Update on ARC's recent eval efforts
More information about ARC's evaluations of GPT-4 and Claude
Topic feed
LLM evaluation, model quality, and reliability measurement.
More information about ARC's evaluations of GPT-4 and Claude
We find that, just as a large transformer model trained on language can generate coherent text, the same exact model trained on pixel sequences can generate coherent image completions and samples. By establishing a correlation between sample quality and...
We’ve made progress towards stable and scalable training of energy-based models (EBMs) resulting in better sample quality and generalization ability than existing models. Generation in EBMs spends more compute to continually refine its answers and doing so...
Deep learning is an empirical science, and the quality of a group’s infrastructure is a multiplier on progress. Fortunately, today’s open-source ecosystem makes it possible for anyone to build great deep learning infrastructure.