Understanding Large Language Models

Computer Science | . 7 min read (1702 words).

What to read to understand the latest AI developments? Here are my most-recommended research papers to understand LLMs!

Topics:

Introduction

AI in the form of deep learning of different kinds keeps getting more and more popular and since the release of ChatGPT on November 30 2022, it feels like the world cannot stop talking about AI in general and large language models in particular. Having been interested in machine learning and AI since I was a kid, I keep an eye on the major developments. This year, I read up on large language models (LLMs) in particular, which is also the latest hype fuelled by ChatGPT’s popularity. I think multi-modal models are catching on more and more now, but I will focus on LLMs here and they are more than interesting enough on their own.

This blog post will give a quick overview of what LLMs are and recommendations on what papers to read to catch up on how they work. This is based on my reading and I will recommend only my top favourites here.

Background

Before diving into research papers, it is important to have a basic understanding of machine learning, neural networks, and deep learning first. For those who don’t know the basics, the best way to get that knowledge is probably to go through Google’s machine learning crash course, which I can highly recommend and which is completely free. No need to work through exercises etc fully, however.

Here also follows a very quick overview. First, machine learning can be seen as a sub-field of artificial intelligence (AI) which is a sub-field of computer science (CS).1 At its core, machine learning is a way to create software based on data instead of explicit programming, where a machine learning model is ’trained’ on known data and then learns to handle also unknown similar data in the future. The most popular type of models are currently deep neural networks, which consist of many connected layers of simple parameterized nodes that compute simple functions based on connected nodes from the previous layer. These neural networks are (in theory) universal function approximators and can reasonably well approximate a lot of different practical functions. This makes them quite ideal, because they can learn almost anything based on data.

Neural networks have been around for a long time. There are a few improvements that have been made the past 15 years. First of all, deep learning is popular nowadays which just means that the models have a large number of hidden layers and therefore can learn good internal representations directly from the raw data without expert guidance. This used to be impossible to train efficiently, however, due to problems such as exploding gradients. A series of small improvements have made this a non-issue nowadays: proper initialization, batch normalization, ReLU activations, residual connections etc.2 Combined with powerful modern GPU hardware and new neural network architectures, neural networks are finally fulfilling their promise in impressive ways!

These neural networks are still trained using variants of classical gradient descent that has been known since the 1800s. At the moment, it is common to use for example AdamW with some kind of linear warmup and linear decay learning rate schedule. For more background and information about deep learning in general, the book Deep Learning is still highly recommended.

For anyone interested in open-source machine-learning models (including the latest LLMs), Huggingface is absolutely amazing and highly recommended.

Generative AI and language models

So, what is generative AI and LLMs specifically? First, generative AI is simply a name for machine learning models that generate text, images, audio, or similar. This also ties in with the typical large language model, which generates text. The typical LLM takes as input existing text (encoded as ’tokens’ - similar to words) and predicts the next token by giving probabilities for all possible tokens. When used in a similar way to ChatGPT, an algorithm generates text by repeatedly evaluating the LLM and picking the next token in some way. By training these models on huge amounts of real text, they become impressively good at handling both the language and various forms of reasoning.

Enough background. Let’s get into the recommended papers to read!

Transformers and foundation models

To understand today’s LLMs, it is important to first understand the basic neural network architecture that they are all based on: the transformer architecture. Google published this architecture in the paper Attention Is All You Need on 12 Jun 2017. The name comes from the fact that transformers are built around a self-attention mechanism. This is still the foundation of all modern language models (and beyond). The main advantage of these (compared to e.g. LSTM networks before) is the possibility to efficiently train them on GPUs.

Another important development that has happened in relation to the large language models of recent years is the development of ‘foundation models’, which are broadly-trained models that can be adapted to a large range of down-stream tasks using very little additional training data. This is a game-changer compared to before, where only Big Tech used to have enough money and data to train models. By starting from a foundation model, it is often possible for just about anyone to train competent models using their own limited data now! Stanford coined the term foundation model in their report On the Opportunities and Risks of Foundation Models.

Important language models

These milestone language models each contributed something important and have papers:

Needless to say, it is not necessary to read about all these models to understand modern LLMs, but they are all famous in their own right and many of them have contributed key improvements and techniques or insights.

Scaling laws and emergent abilities

One of the most important changes over the past 5 years have been the scale of language models, which is why we call them large language models. With scale, new abilities have shown up. Here are several interesting papers about this:

These emergent abilities have led to phenomena such as in-context learning (ICL) and prompt engineering3. The models’ in-context learning abilities can be compared to the alternative of fine-tuning the model using traditional training. These papers investigate in-context learning and prompting:

Parameter-efficient fine-tuning (PEFT)

In addition to prompt engineering, PEFT has shown up as a cost-effective way to adapt models to new tasks. There is a library from Huggingface. These papers describe some of the key techniques:

Other methods

One popular method for teaching LLMs about new text is to use RAG (retrieval-augmented generation), where data is looked up in a database in some way and the prompt is augmented using it. Papers about RAG:

All the most well-known language models are also trained on human preferences using methods such as:

See Huggingface’s Transformer Reinforcement Learning library for implementations of these.

Text embeddings

For some applications, it is useful to compute a vector embedding that describes the meaning of some text such that similar text can be found efficiently (by having a similar vector). There are plenty of models for this and vector databases for searching for similar vectors. One paper that describes this:

Foundational techniques

Here are some of the foundational techniques that remain important:

Other interesting papers

Here are a few interesting papers that did not fit elsewhere:

Ending remarks

Those are a lot of papers, all more or less focused on language models. New papers are published every day, but I believe the above papers are significant enough to be worth reading both now and for years to come. To understand multi-modal models or models focused on images or sound, there are many more interesting papers to read. That’s for another day!

Have I read all the papers above? To varying degrees. Some, I have read multiple times. Some, I have only skimmed. Not all are worth the same time and focus.


  1. Machine learning can also bee seen as a part of mathematical statistics, in the form of statistical learning theory. ↩︎

  2. Usually, it is enough to use only some of these mitigations. ↩︎

  3. Andrew Ng has a good online course about prompt engineering. ↩︎