Understanding Large Language Models
Computer Science | . 7 min read (1702 words).
What to read to understand the latest AI developments? Here are my most-recommended research papers to understand LLMs!
Topics:
Introduction
AI in the form of deep learning of different kinds keeps getting more and more popular and since the release of ChatGPT on November 30 2022, it feels like the world cannot stop talking about AI in general and large language models in particular. Having been interested in machine learning and AI since I was a kid, I keep an eye on the major developments. This year, I read up on large language models (LLMs) in particular, which is also the latest hype fuelled by ChatGPT’s popularity. I think multi-modal models are catching on more and more now, but I will focus on LLMs here and they are more than interesting enough on their own.
This blog post will give a quick overview of what LLMs are and recommendations on what papers to read to catch up on how they work. This is based on my reading and I will recommend only my top favourites here.
Background
Before diving into research papers, it is important to have a basic understanding of machine learning, neural networks, and deep learning first. For those who don’t know the basics, the best way to get that knowledge is probably to go through Google’s machine learning crash course, which I can highly recommend and which is completely free. No need to work through exercises etc fully, however.
Here also follows a very quick overview. First, machine learning can be seen as a sub-field of artificial intelligence (AI) which is a sub-field of computer science (CS).1 At its core, machine learning is a way to create software based on data instead of explicit programming, where a machine learning model is ’trained’ on known data and then learns to handle also unknown similar data in the future. The most popular type of models are currently deep neural networks, which consist of many connected layers of simple parameterized nodes that compute simple functions based on connected nodes from the previous layer. These neural networks are (in theory) universal function approximators and can reasonably well approximate a lot of different practical functions. This makes them quite ideal, because they can learn almost anything based on data.
Neural networks have been around for a long time. There are a few improvements that have been made the past 15 years. First of all, deep learning is popular nowadays which just means that the models have a large number of hidden layers and therefore can learn good internal representations directly from the raw data without expert guidance. This used to be impossible to train efficiently, however, due to problems such as exploding gradients. A series of small improvements have made this a non-issue nowadays: proper initialization, batch normalization, ReLU activations, residual connections etc.2 Combined with powerful modern GPU hardware and new neural network architectures, neural networks are finally fulfilling their promise in impressive ways!
These neural networks are still trained using variants of classical gradient descent that has been known since the 1800s. At the moment, it is common to use for example AdamW with some kind of linear warmup and linear decay learning rate schedule. For more background and information about deep learning in general, the book Deep Learning is still highly recommended.
For anyone interested in open-source machine-learning models (including the latest LLMs), Huggingface is absolutely amazing and highly recommended.
Generative AI and language models
So, what is generative AI and LLMs specifically? First, generative AI is simply a name for machine learning models that generate text, images, audio, or similar. This also ties in with the typical large language model, which generates text. The typical LLM takes as input existing text (encoded as ’tokens’ - similar to words) and predicts the next token by giving probabilities for all possible tokens. When used in a similar way to ChatGPT, an algorithm generates text by repeatedly evaluating the LLM and picking the next token in some way. By training these models on huge amounts of real text, they become impressively good at handling both the language and various forms of reasoning.
Recommended AI papers
Enough background. Let’s get into the recommended papers to read!
Transformers and foundation models
To understand today’s LLMs, it is important to first understand the basic neural network architecture that they are all based on: the transformer architecture. Google published this architecture in the paper Attention Is All You Need on 12 Jun 2017. The name comes from the fact that transformers are built around a self-attention mechanism. This is still the foundation of all modern language models (and beyond). The main advantage of these (compared to e.g. LSTM networks before) is the possibility to efficiently train them on GPUs.
Another important development that has happened in relation to the large language models of recent years is the development of ‘foundation models’, which are broadly-trained models that can be adapted to a large range of down-stream tasks using very little additional training data. This is a game-changer compared to before, where only Big Tech used to have enough money and data to train models. By starting from a foundation model, it is often possible for just about anyone to train competent models using their own limited data now! Stanford coined the term foundation model in their report On the Opportunities and Risks of Foundation Models.
Important language models
These milestone language models each contributed something important and have papers:
- GPT-1: Improving Language Understanding by Generative Pre-Training (PDF)
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- GPT-2: Language Models are Unsupervised Multitask Learners (PDF)
- RoBERTa: A Robustly Optimized BERT Pretraining Approach
- GPT-3: Language Models are Few-Shot Learners
- FLAN: Finetuned Language Models Are Zero-Shot Learners
- InstructGPT: Training language models to follow instructions with human feedback
- PaLM: Scaling Language Modeling with Pathways
- LLaMA: Open and Efficient Foundation Language Models
- GPT-4 Technical Report
- Sparks of Artificial General Intelligence: Early experiments with GPT-4
- There’s also a video lecture.
- Sparks of Artificial General Intelligence: Early experiments with GPT-4
- PaLM 2 Technical Report
- Llama 2: Open Foundation and Fine-Tuned Chat Models
- Mistral 7B
Needless to say, it is not necessary to read about all these models to understand modern LLMs, but they are all famous in their own right and many of them have contributed key improvements and techniques or insights.
Scaling laws and emergent abilities
One of the most important changes over the past 5 years have been the scale of language models, which is why we call them large language models. With scale, new abilities have shown up. Here are several interesting papers about this:
- Scaling Laws for Neural Language Models
- Training Compute-Optimal Large Language Models
- Emergent Abilities of Large Language Models
These emergent abilities have led to phenomena such as in-context learning (ICL) and prompt engineering3. The models’ in-context learning abilities can be compared to the alternative of fine-tuning the model using traditional training. These papers investigate in-context learning and prompting:
- Few-shot Fine-tuning vs. In-context Learning: A Fair Comparison and Evaluation
- In-Context Learning Learns Label Relationships but Is Not Conventional Learning
- A Survey on In-context Learning
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Parameter-efficient fine-tuning (PEFT)
In addition to prompt engineering, PEFT has shown up as a cost-effective way to adapt models to new tasks. There is a library from Huggingface. These papers describe some of the key techniques:
- LoRA: Low-Rank Adaptation of Large Language Models
- P-Tuning: GPT Understands, Too
- P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks
- QLoRA: Efficient Finetuning of Quantized LLMs
- Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning
- Open, Closed, or Small Language Models for Text Classification?
Other methods
One popular method for teaching LLMs about new text is to use RAG (retrieval-augmented generation), where data is looked up in a database in some way and the prompt is augmented using it. Papers about RAG:
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.
- Retrieve Anything To Augment Large Language Models
All the most well-known language models are also trained on human preferences using methods such as:
- RLHF: Training language models to follow instructions with human feedback
- DPO: Direct Preference Optimization: Your Language Model is Secretly a Reward Model
- PPO: Proximal Policy Optimization Algorithms
See Huggingface’s Transformer Reinforcement Learning library for implementations of these.
Text embeddings
For some applications, it is useful to compute a vector embedding that describes the meaning of some text such that similar text can be found efficiently (by having a similar vector). There are plenty of models for this and vector databases for searching for similar vectors. One paper that describes this:
- Text and Code Embeddings by Contrastive Pre-Training
- C-Pack: Packaged Resources To Advance General Chinese Embedding
- Retrieve Anything To Augment Large Language Models
Foundational techniques
Here are some of the foundational techniques that remain important:
- Dropout: A Simple Way to Prevent Neural Networks from Overfitting
- Adam: A Method for Stochastic Optimization
- Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
- AdamW: Decoupled Weight Decay Regularization
- Bagging predictors (PDF)
Other interesting papers
Here are a few interesting papers that did not fit elsewhere:
- Allies: Prompting Large Language Model with Beam Search
- Scaling Transformer to 1M tokens and beyond with RMT
- Reward is enough
Ending remarks
Those are a lot of papers, all more or less focused on language models. New papers are published every day, but I believe the above papers are significant enough to be worth reading both now and for years to come. To understand multi-modal models or models focused on images or sound, there are many more interesting papers to read. That’s for another day!
Have I read all the papers above? To varying degrees. Some, I have read multiple times. Some, I have only skimmed. Not all are worth the same time and focus.
Machine learning can also bee seen as a part of mathematical statistics, in the form of statistical learning theory. ↩︎
Usually, it is enough to use only some of these mitigations. ↩︎
Andrew Ng has a good online course about prompt engineering. ↩︎