Technology of Large Language Models

Let's learn the fundamentals of Natural Language Processing and language modeling with this workbook.

What are large language models?

Let's learn the fundamentals of Natural Language Processing and language modeling with this workbook.

Large Language Models (LLMs) like OpenAI’s ChatGPT or Google’s BARD are large neural networks trained on huge amounts of data which are able to solve many different tasks involving natural language generation in an unseen quality. But how do they work and what makes them so powerful? This workbook aims at providing an intuitive understanding of essential underlying components and techniques such as language modeling (should jump to next session "Key Terminology-> language modeling")and transformers (should jump to next session "Transformers and attention"). Meanwhile, we will highlight key characteristics of contemporary LLMs, for example their being “foundational” for a wide range of tasks. No worries if any of these terms so far don’t seem familiar to you – we’ll go through these concepts one after another in this workbook.

Key Terminology

Click on the below tabs to learn more about the each of them.

Language Modeling

“You shall know a word by the company it keeps”, Firth, J. R. 1957

This famous formulation about the meaning of words by the linguist John Rupert Firth has shown to be the most fundamental insight behind the success of today's LLMs. It means that meanings of a word can be derived from all the contexts it can be found in. The statistical approach to this is called language modeling. In practice this means processing large amounts of text in order to determine how probable it is for each word to occur in a certain context, namely the words’ distributional information. If those learned probabilities are accurate enough, this enables the generation of new text of the quality we see today.

Let's look at an example:

“Yesterday was a beautiful day since the sun was <>. ”

A likely completion of this sentence by humans, based on our perception of the world, would be “beautiful”, instead of, say “dreadful”, which makes the former far more probable. To collect this information, all we need is text which contains as many aspects of the meaning of a word as possible, reflected by all of its contexts. The advent of the internet was the key enabler for exploiting the true power of this approach by making vast amounts of textual data readily available, which has led to an ever-improving quality of language models (LMs) ever since.

Why develop large language models?

Most importantly, LLMs are highly versatile and can be used for a surprisingly wide range of tasks. Contemporary models like GPT3.5 which powers ChatGPT can answer questions, write poems or essays, translate between languages or even generate code while only being trained on raw text. While some of those capabilities are still somewhat surprising, it is clear that the task of language modeling is far more useful than it seemed even just a few years ago. It has been shown to be suitable for learning a range of linguistic regularities like how sentences are structured, how word forms depend on each other, and their roles in the sentence.

Also, LLMs are efficient in that they provide an end-to-end solution to tasks mentioned above within a single model, whereas in the past a bunch of distinct methods would be required, each responsible for one aspect of generating language for a specific task. Additionally, the data required to learn all this information does not entail tedious manual processing, and can simply be scraped from the internet or other resources for textual data, as we will see in the next section. All of these reasons make it clear that LLMs have the potential of changing the way human beings create or design contents across different fields in the near future.

How do large language models work?

LLM Training Techniques

For LLMs to become powerful they require to be trained on large amounts of data, which is nowadays mainly enabled by a technique called self-supervised learning. But before we explain that, we need to take a look at a closely associated underlying concept first, that is, supervised learning.

Supervised Learning

In supervised learning a neural network is presented with, for example, a sentence, and the task is to make a prediction whether the sentence says something positive or negative. In order to learn the relationship between words in a sentence and the expressed sentiment, it is presented many (ranging from thousands to millions) pairs of example sentences and sentiments (also known as labels).

Example where sentiment can be positive or negative :

Sentence: Pretty good. Didn't know any of the comedians but the first time viewing put a smile on my face. I'll check out the next season soon.

Sentiment to be predicted:: Positive.

To assign labels to examples is called annotation and usually needs to be done by humans. And maybe now it becomes more obvious to you, why in the past this sort of annotation labor has posed a serious bottleneck to model training, as employing humans for scaling up datasets can be expensive.

Transformer and Attention

Transformer is a novel neural network architecture introduced by Vaswani et al. in 2017 . Its key ability lies in the so-called attention mechanism, which allows it to make use of contextual information even more effectively and flexibly than previous artificial neural network models. Attention in transformers functions like its counterpart in the human brain and cognition. From a cognitive perspective, attention plays a critical role to flexibly direct our awareness towards selected aspects of information. In a similar vein, in artificial neural networks such as transformers, the attention mechanism enables the model to flexibly relate each word to all others by weighting their relative importance between one another. This way, important contextual information is highlighted and associated with respective words, and the input is “transformed” into something meaningful for downstream tasks.

For example, if we look at the following sentence, some words might be more important than others for its overall meaning:

Alice crossed the road as she was in a hurry

Here the pronoun “she” is referring to theentity “Alice”. Identifying this sort of association is known as coreference resolution. Intuitively,the attention mechanism allows transformers to learn that some words (e.g.,“Alice”) in this sentence relate more strongly to other words (e.g., “she”) and carry important information for them. If we think about the meaning of “she”, “Alice” might be the most important word but “the” is quite unimportant. And as explained earlier, from a computational perspective, this means more weights are given to “Alice” for the word “she”, such that a strong association between them is maintained.

Apparently, a huge amount of computation is needed to build up these word-to-word associations when there is a large amount of data. But fortunately, one important advantage of transformers exactly has to do with their efficiency in handling these complex associations, because they are designed to allow computations to be done independently and simultaneously for each word.This is important because training is a very time and resource consuming process.

Famous model architectures. Click on the below tabs to learn more about each of them.

What makes large language models so powerful? Key characteristics of large language models

Foundation Models

One key characteristic of large language models has to do with their capability of generalization. By generalization we simply mean the knowledge these models extract from their input, including meaning of words, regularities about sentence structure, etc., may apply to many different situations. This characteristic ties LLMs closely to the concept of foundation models. The term foundation models, coined in 2021 , is motivated by a number of observations regarding the characteristics as well as the capabilities of large neural network models trained on a broad range of data at scale, which are explained below.

First of all, the power of foundation models lies in their capability to capture relevant information in a way that could be generally used as the “foundation” for multiple purposes. This sort of generalized knowledge may be used for a wide variety of downstream applications, thus distinguishing foundation models from traditional approaches.

In the case of building a chatbot, for example, a traditional natural language understanding engine may have to assemble a complicated set of methods to eventually help the chatbot “understand” user input. When a user asks “Can you describe the development of quantum computing in the last decade?”, the engine may have to firstly identify that the user intends to ask about a description and then recognize relevant time and entity information in the input. After that, retrieval of relevant data from the database and a complete response needs to be formed. These methods need to be integrated and may give rise to considerable complexity (see figure below).

ai text gen — A classical natural language understanding (NLU) engine may involve integration of multiple components such as intent classification, slot filling etc., whereas large language models provide an end-to-end, unified approach.

With large language models nowadays, however, we are able to capture all of this information from plain user input and directly output desired answers using a single huge neural network trained on relevant data in advance. Furthermore, because of its generality, the knowledge representation these large language models extract may even be reused to handle information in other modalities, like speech, audio, images or even source code. Famous examples include text-to-code models like CodeX and text-to-image models like DALL·E 2 or Stable Diffusion, which make heavy use of general linguistic information captured by LLMs.

Since LLMs are neural networks which store generalized knowledge representation in their parameters, the question of model size is closely tied to the level of generalization that can be achieved. In simple terms, the more data we have, the more information there is to be extracted, and the more capacity (parameters) a model may need in order to memorize it. This relationship has led to a rapid growth in model size, ranging from a few hundred millions to around a trillion parameters.

Homogenization

As briefly mentioned earlier, the number of modeling techniques used today to achieve generalization has greatly reduced. The transformer is used as the essential component in various model architectures (mostly variants of either BERT or GPT) due to its efficient and powerful attention mechanism. And training can be done in one or two stages, by firstly learning general knowledge during pre-training on raw data, then if available, by fine-tuning on task-specific data to endow the model with more specific capabilities.

Finally, the homogenization of methodological approaches has also greatly facilitated research across different fields of applications. For example, LLMs can be used for protein sequence modeling as well as speech processing or image generation.

Emergence of capabilities

When a neural network is trained, there is usually a very specific task which the model learns to solve. Therefore, when pretraining models on the classic language modeling task, we expect them to be good at producing coherent linguistically correct text. It turns out that, when scaled sufficiently, LLMs could solve a surprisingly wide range of tasks from a description in natural language (prompt) alone. Those tasks can be machine translation, performing arithmetics, code generation (by specifying what the program is supposed to do) or general question answering.

Typically, users describe the task in natural language in the input fed to the pretrained model. Based on this input, the model generates the most probable next word, appends it to the input, and continues until some stopping criteria is reached (e.g. an artificial word indicating end-of-sentence has been generated). Only supplying a task description is known as zero-shot learning while additionally adding a few samples of what a correct answer might look like, is called few-shot learning.

What's next in the workbooks?

In this workbook, we tried to give an overview over the most fundamental concepts, techniques, and characteristics of large language models. We covered the idea behind language modeling and why it was a good idea to use large neural networks such as architectures based on BERT and GPT. We also introduced the training scheme of self-supervised learning as well as the transformers and their powerful attention mechanism. Finally, we described crucial characteristics that we observe in LLMs like ChatGPT, for example, the generalized knowledge they extract from raw texts which enables their emergent capabilities.

With the integration of ChatGPT into Microsoft`s Bing or BARD into Google Search, the wider adoption of generative LLMs has just started. Before that, code generation based on natural language prompts has been a highlighted emergent feature of GPT-3, which is exploited by Github’s CoPilot (legal issues notwithstanding) and has already proven a powerful assistant to software engineers. We will dive deeper into potential and real use cases in WB2.

While LLMs show compelling capabilities producing highly coherent texts, this brings up ethical questions regarding the expectations humans may have when interacting with a machine. Also, even the best LLMs have been shown to generate falsehoods (hallucinate) or produce slurs. Ethical considerations with regard to societal aspects therefore must include the dangers of misinformation at scale and the reproduction of biases in the training data. On top of these, there are also environmental and sustainability issues surrounding the training and usage of LLMs. These are hard and largely unsolved problems which we will cover in more depth in WB3.

Worbook 2