Skip to content

Research Papers

Curated papers every AI practitioner should know - from foundations to frontier.


Foundational Papers

Attention Is All You Need (2017)

Authors: Vaswani et al. (Google) Significance: Introduced the Transformer architecture, replacing RNNs with self-attention. The foundation of every major LLM today. Read: arXiv

BERT: Pre-training of Deep Bidirectional Transformers (2018)

Authors: Devlin et al. (Google) Significance: Showed that bidirectional pre-training + fine-tuning works dramatically better than unidirectional language models. Read: arXiv

GPT-3: Language Models are Few-Shot Learners (2020)

Authors: Brown et al. (OpenAI) Significance: Demonstrated that scaling models to 175B parameters unlocks in-context learning - no fine-tuning needed for many tasks. Read: arXiv

Training Language Models to Follow Instructions (InstructGPT, 2022)

Authors: Ouyang et al. (OpenAI) Significance: Introduced RLHF (RL from human feedback) to align LLMs with user intent. The method behind ChatGPT. Read: arXiv


Retrieval-Augmented Generation

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (2020)

Authors: Lewis et al. (Facebook AI) Significance: Formalized the RAG pattern - augment LLMs with external knowledge retrieval. The foundation of most production LLM systems. Read: arXiv

Lost in the Middle: How Language Models Use Long Contexts (2023)

Authors: Liu et al. (Stanford) Significance: Showed that LLMs perform worst on information in the middle of long contexts - critical insight for RAG system design. Read: arXiv


Reasoning & Agents

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (2022)

Authors: Wei et al. (Google) Significance: Showed that asking models to “think step by step” dramatically improves reasoning accuracy. Read: arXiv

Tree of Thoughts: Deliberate Problem Solving (2023)

Authors: Yao et al. (Princeton) Significance: Extended chain-of-thought to explore multiple reasoning paths simultaneously, with backtracking. Read: arXiv

ReAct: Synergizing Reasoning and Acting in Language Models (2022)

Authors: Yao et al. (Princeton) Significance: Combined reasoning traces with action steps - the pattern behind modern agent frameworks. Read: arXiv


Open-Source & Efficiency

LLaMA: Open and Efficient Foundation Language Models (2023)

Authors: Touvron et al. (Meta) Significance: Showed that smaller models trained on more data can match larger models. Sparked the open-source LLM revolution. Read: arXiv

QLoRA: Efficient Finetuning of Quantized Language Models (2023)

Authors: Dettmers et al. (UW) Significance: Made fine-tuning of 65B models possible on a single GPU by combining 4-bit quantization with low-rank adapters. Read: arXiv

DeepSeek-R1: Incentivizing Reasoning Capability (2025)

Authors: DeepSeek Significance: Open-weight reasoning model matching OpenAI o1 at a fraction of the cost. Demonstrated that reinforcement learning can teach reasoning. Read: arXiv


Scaling & Emergent Behavior

Scaling Laws for Neural Language Models (2020)

Authors: Kaplan et al. (OpenAI) Significance: Established predictable relationships between model size, data size, compute, and performance. Read: arXiv

Sparks of Artificial General Intelligence (2023)

Authors: Bubeck et al. (Microsoft) Significance: Comprehensive study of GPT-4’s capabilities, arguing it exhibits “sparks” of AGI. Sparked debate on measuring intelligence. Read: arXiv


Where to Find More