An Introduction to Large language Models

February 25, 2025

Large language models (LLMs) are recent advances in deep learning models to work on human languages. Some great use cases of LLMs have been demonstrated. A large language model is a trained deep-learning model that understands and generates text in a human-like fashion. Behind the scene, there is a large transformer model that does all the magic.

In this post, you will learn about the structure of large language models and how it works. In particular, you will know:

What is a transformer model
How a transformer model reads text and generates output
How a large language model can produce text in a human-like fashion.

‍

Tracing the Evolution of Large Language Models

The history of Large Language Models can be traced back to the 1960s when the first steps were taken in natural language processing (NLP). In 1967, a professor at MIT developed Eliza, the first-ever NLP program. Eliza employed pattern matching and substitution techniques to understand and interact with humans. Shortly after, in 1970, another MIT team built SHRDLU, an NLP program that aimed to comprehend and communicate with humans.

In 1988, the introduction of Recurrent Neural Networks (RNNs) brought advancements in capturing sequential information in text data. However, RNNs had limitations in dealing with longer sentences. To overcome this, Long Short-Term Memory (LSTM) was proposed in 1997. LSTM made significant progress in applications based on sequential data and gained attention in the research community. Concurrently, attention mechanisms started to receive attention as well.

While LSTM addressed the issue of processing longer sentences to some extent, it still faced challenges when dealing with extremely lengthy sentences. Additionally, training LSTM models proved to be time-consuming due to the inability to parallelize the training process. These concerns prompted further research and development in the field of large language models.

‍

The year 2017 marked a significant breakthrough in NLP research with the publication of the influential paper "Attention Is All You Need." This paper introduced a groundbreaking architecture called Transformers, which revolutionized the NLP landscape. Transformers were designed to address the limitations faced by LSTM-based models.

Transformers represented a major leap forward in the development of Large Language Models (LLMs) due to their ability to handle large amounts of data and incorporate attention mechanisms effectively. With an enormous number of parameters, Transformers became the first LLMs to be developed at such scale. They quickly emerged as state-of-the-art models in the field, surpassing the performance of previous architectures like LSTMs.

To this day, Transformers continue to have a profound impact on the development of LLMs. Their innovative architecture and attention mechanisms have inspired further research and advancements in the field of NLP. The success and influence of Transformers have led to the continued exploration and refinement of LLMs, leveraging the key principles introduced in the original paper.

‍

Over the past five years, extensive research has been dedicated to advancing Large Language Models (LLMs) beyond the initial Transformers architecture. One notable trend has been the exponential increase in the size of LLMs, both in terms of parameters and training datasets. Through experimentation, it has been established that larger LLMs and more extensive datasets enhance their knowledge and capabilities.

LLMs such as BERT, GPT, and their variants like GPT-2, GPT-3, GPT 3.5, and XLNet have been introduced, featuring progressively larger parameter sizes and training datasets. These models have pushed the boundaries of language understanding and generation, setting new benchmarks in NLP tasks.

In 2022, another breakthrough occurred in the field of NLP with the introduction of ChatGPT. ChatGPT is an LLM specifically optimized for dialogue and exhibits an impressive ability to answer a wide range of questions and engage in conversations. Shortly after, Google introduced BARD as a competitor to ChatGPT, further driving innovation and progress in dialogue-oriented LLMs.

Over the past year, the development of Large Language Models has accelerated rapidly, resulting in the creation of hundreds of models. To track and compare these models, you can refer to the Hugging Face Open LLM leaderboard, which provides a list of open-source LLMs along with their rankings. As of now, Falcon 40B Instruct stands as the state-of-the-art LLM, showcasing the continuous advancements in the field.

What are Large Language Models?

Large language models largely represent a class of deep learning architectures called transformer networks. A transformer model is a neural network that learns context and meaning by tracking relationships in sequential data, like the words in this sentence.

A transformer is made up of multiple transformer blocks, also known as layers. For example, a transformer has self-attention layers, feed-forward layers, and normalization layers, all working together to decipher input to predict streams of output at inference. The layers can be stacked to make deeper transformers and powerful language models. Transformers were first introduced by Google in the 2017 paper “Attention Is All You Need.”

There are two key innovations that make transformers particularly adept for large language models: positional encodings and self-attention.

Positional encoding embeds the order of which the input occurs within a given sequence. Essentially, instead of feeding words within a sentence sequentially into the neural network, thanks to positional encoding, the words can be fed in non-sequentially.

Self-attention assigns a weight to each part of the input data while processing it. This weight signifies the importance of that input in context to the rest of the input. In other words, models no longer have to dedicate the same attention to all inputs and can focus on the parts of the input that actually matter. This representation of what parts of the input the neural network needs to pay attention to is learnt over time as the model sifts and analyzes mountains of data.

These two techniques in conjunction allow for analyzing the subtle ways and contexts in which distinct elements influence and relate to each other over long distances, non-sequentially.

The ability to process data non-sequentially enables the decomposition of the complex problem into multiple, smaller, simultaneous computations. Naturally, GPUs are well suited to solve these types of problems in parallel, allowing for large-scale processing of large-scale unlabelled datasets and enormous transformer networks.

The Importance of Large Language Models?

In the history of artificial intelligence, the focus had traditionally been on perception and understanding. However, a significant shift occurred with the emergence of large language models (LLMs), trained on extensive internet-scale datasets containing hundreds of billions of parameters. This breakthrough has unlocked the potential of AI models to generate content that closely resembles human expression.

These models demonstrate versatility across various tasks such as reading, writing, coding, drawing, and creating content in a manner that appears convincingly human. They have become instrumental in enhancing human creativity and boosting productivity across industries, offering solutions to some of the world's most challenging problems.

The applications of LLMs are wide-ranging. For instance, in the field of medicine, an AI system can learn the language of protein sequences to propose viable compounds, aiding scientists in the development of groundbreaking, life-saving vaccines. Moreover, these models empower individuals in their creative pursuits, serving as a resource for overcoming challenges.

A writer grappling with writer's block can turn to a large language model to ignite their creative spark. Similarly, software programmers can significantly improve productivity by leveraging LLMs to generate code based on natural language descriptions. In essence, these models have become valuable tools that not only augment human capabilities but also open new avenues for addressing complex problems and fostering innovation.

What is the difference between LLMs and traditional language models?

Large Language Models (LLMs) are learned utilizing self-supervised learning or semi-supervised learning on enormous amounts of unlabeled text as opposed to traditional language models, which are trained on labeled data. LLMs are general-purpose models that perform well across a variety of applications as opposed to being trained for a single job.

LLMs provide responses to prompts that are human-like by combining deep learning and natural language generation techniques with a large text library. LLMs are trained by employing cutting-edge machine learning algorithms to understand and analyze the text, unlike traditional language models that are pre-trained by academic institutions and major tech corporations. LLMs are self-training, thus they get better the more input and usage they receive.

Different Kinds of LLMs

LLMs can indeed be broadly categorized into two types based on their tasks: text continuation and dialogue optimization.

Continuing the Text: LLMs in this category are trained to predict the next sequence of words in a given input text. Their primary objective is to continue the text in a coherent and meaningful manner. For example, when provided with the input "How are you," these LLMs might complete the sentence with phrases such as "How are you doing?" or "How are you? I am fine." Some popular LLMs falling under this category include Transformers, BERT, XLNet, GPT, GPT-2, GPT-3, GPT-4, and others.

However, a limitation of these LLMs is that they excel at text completion rather than providing specific answers. While they can generate plausible continuations, they may not always address the specific question or provide a precise answer.

Dialogue Optimized: Dialogue-optimized LLMs were introduced to overcome the limitations of text continuation LLMs. These models are specifically designed to generate meaningful responses in dialogue or conversation scenarios. They are trained to understand the context of the conversation and aim to provide accurate and contextually appropriate answers.

Unlike text continuation LLMs, dialogue-optimized LLMs focus on delivering relevant answers rather than simply completing the text. For instance, when given the input "How are you?" These LLMs strive to respond with an appropriate answer like "I am doing fine" rather than just completing the sentence. Some examples of dialogue-optimized LLMs are InstructGPT, ChatGPT, BARD, Falcon-40B-instruct, and others.

The introduction of dialogue-optimized LLMs aims to enhance their ability to engage in interactive and dynamic conversations, enabling them to provide more precise and relevant answers to user queries.

The challenges of Training LLM: Infrastructure & Cost

Training Large Language Models (LLMs) from scratch presents significant challenges, primarily related to infrastructure and cost considerations.

Infrastructure: LLMs require a massive amount of computational resources for training due to their large parameter sizes and the vast text corpora used. Training on such a scale necessitates distributed and parallel computing with multiple GPUs. For example, training GPT-3, which has 175 billion parameters, on a single GPU would take an estimated 355 years. To overcome this, setups with thousands of GPUs are required. The hardware used for training popular LLMs like Falcon-40B involved hundreds of GPUs, such as 384 A100 40GB GPUs. Training such models demands specialized infrastructure capable of supporting the computational demands of LLM training.

Cost: The infrastructure required for training LLMs is accompanied by significant costs. Setting up the necessary GPU infrastructure at the required scale can be financially burdensome, involving substantial investments by companies or research institutions. Training a model like GPT-3 from scratch is estimated to have cost around $4.6 million, and even training a smaller-scale 7-billion-parameter model can cost approximately $25,000. These costs encompass the hardware, energy consumption, maintenance, and related expenses.

Considering the infrastructure and cost challenges, it is crucial to carefully plan and allocate resources when training LLMs from scratch. Organizations must assess their computational capabilities, budgetary constraints, and availability of hardware resources before undertaking such endeavors.

How Do You Train LLMs from Scratch?

The training process of LLMs is different for the kind of LLM you want to build whether it’s continuing the text or dialogue optimized. The performance of LLMs mainly depends upon 2 factors: Dataset and Model Architecture. These 2 are the key driving factors behind the performance of LLMs.

Let’s discuss the different steps involved in training the LLMs.

1. Continuing the Text

The training process of the LLMs that continue the text is known as pre-training LLMs. These LLMs are trained in self-supervised learning to predict the next word in the text. We will exactly see the different steps involved in training LLMs from scratch.

a. Dataset Collection

The first step in training LLMs is collecting a massive corpus of text data. The dataset plays the most significant role in the performance of LLMs. Recently, OpenChat is the latest dialog-optimized large language model inspired by LLaMA-13B. It achieves 105.7% of the ChatGPT score on the Vicuna GPT-4 evaluation. Do you know the reason behind its success? It’s high-quality data. It has been fine-tuned on only ~6K data.

The training data is created by scraping the internet, websites, social media platforms, academic sources, etc. Make sure that training data is as diverse as possible.

Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models

What does it say? Let me explain.

You might have come across the headlines that “ChatGPT failed at Engineering exams” or “ChatGPT fails to clear the UPSCr” and so on. What can be the possible reasons? The reason being it lacked the necessary level of intelligence. This is heavily dependent on the dataset used for training. Hence, the demand for diverse dataset continues to rise as high-quality cross-domain dataset has a direct impact on the model generalization across different tasks.

Unlock the potential of LLMs with the high quality data!

Previously, Common Crawl was the go-to dataset for training LLMs. The Common Crawl contains the raw web page data, extracted metadata, and text extractions since 2008. The size of the dataset is in petabytes (1 petabyte=1e6 GB). It’s proven that the Large Language Models trained on this dataset showed effective results but failed to generalize well across other tasks. Hence, a new dataset called Pile was created from 22 diverse high-quality datasets. It’s a combination of existing data sources and new datasets in the range of 825 GB. In recent times, the refined version of the common crawl was released in the name of RefinedWeb Dataset.

Note: The datasets used for GPT-3 and GPT-4 have not been open-sourced in order to maintain a competitive advantage over the others.

b. Dataset Preprocessing

The next step is to preprocess and clean the dataset. As the dataset is crawled from multiple web pages and different sources, it is quite often that the dataset might contain various nuances. We must eliminate these nuances and prepare a high-quality dataset for the model training.

The specific preprocessing steps actually depend on the dataset you are working with. Some of the common preprocessing steps include removing HTML Code, fixing spelling mistakes, eliminating toxic/biased data, converting emoji into their text equivalent, and data deduplication. Data deduplication is one of the most significant preprocessing steps while training LLMs. Data deduplication refers to the process of removing duplicate content from the training corpus.

It’s obvious that the training data might contain duplicate or nearly the same sentences since it’s collected from various data sources. We need data deduplication for 2 primary reasons: It helps the model not to memorize the same data again and again. It helps us to evaluate LLMs better because the training and test data contain non-duplicated information. If it contains duplicated information, there is a very chance that the information it has seen in the training set is provided as output during the test set. As a result, the numbers reported may not be true. You can read more about data deduplication techniques in the paper Deduplicating Training Data Makes Language Models Better

c. Dataset Preparation

The next step is to create the input and output pairs for training the model. During the pre-training phase, LLMs are trained to predict the next token in the text. Hence, input and output pairs are created accordingly.

For example, let’s take a simple corpus-

Example 1: I am a GSC Assistant.
Example 2: GSC stands for Global Science Conference.
Example 3:I can provide you with information about the Global Science Conference.

In the case of example 1, we can create the input-output pairs as per below-

d. Model Architecture

The subsequent step involves defining the model architecture and initiating the training process for the LLM

Currently, there is a substantial number of LLMs being developed, and you can explore various LLMs on the Hugging Face Open LLM leaderboard. Researchers generally follow a standardized process when constructing LLMs. They often start with an existing Large Language Model architecture, such as GPT-3, and utilize the model's initial hyperparameters as a foundation. From there, they make adjustments to both the model architecture and hyperparameters to develop a state-of-the-art LLM.

For example,

Falcon is recognized as a cutting-edge LLM and holds the top rank on the open-source LLM leaderboard. It draws inspiration from the architecture of GPT-3 but incorporates a few modifications and tweaks to enhance its performance and capabilities.

By leveraging existing LLM architectures and fine-tuning them with customized adjustments, researchers can push the boundaries of language understanding and generation, leading to the development of state-of-the-art models like Falcon.

Hyperparameter Search

Hyperparameter tuning is indeed a resource-intensive process, both in terms of time and cost, especially for models with billions of parameters. Running exhaustive experiments for hyperparameter tuning on such large-scale models is often infeasible. A practical approach is to leverage the hyperparameters from previous research, such as those used in models like GPT-3, and then fine-tune them on a smaller scale before applying them to the final model.

During the hyperparameter tuning process, various aspects of the model can be explored and adjusted, including weight initialization, positional embeddings, optimizer, activation function, learning rate, weight decay, loss function, sequence length, number of layers, attention heads, parameters, dense vs. sparse layers, batch size, and dropout.

For popular hyperparameters, there are some best practices to consider:

Batch Size: It is ideal to choose a batch size that fits the GPU memory, maximizing computational efficiency.
Learning Rate Scheduler: It is beneficial to decrease the learning rate as training progresses to overcome local minima and improve model stability. Commonly used learning rate schedulers include Step Decay and Exponential Decay.
Weight Initialization: Proper weight initialization plays a crucial role in the convergence of the model. Techniques like T-Fixup are commonly used for weight initialization in transformer models. It's important to note that weight initialization techniques are typically applied when defining custom LLM architectures.
Regularization: LLMs can be prone to overfitting, so incorporating regularization techniques is crucial. Batch Normalization, Dropout, and L1/L2 regularization are commonly used to mitigate overfitting and improve model generalization.

By following these best practices and exploring the impact of various hyperparameters, researchers can optimize the performance and stability of LLMs while managing the computational and time constraints involved in the training process.

2. Dialogue-optimized LLMs

In the dialogue-optimized LLMs, the first step is the same as the pretraining LLMs discussed above. After pretraining, these LLMs are now capable of completing the text. Now, to generate an answer for a specific question, the LLM is finetuned on a supervised dataset containing questions and answers. By the end of this step, your model is now capable of generating an answer to a question.

ChatGPT is a dialogue-optimized LLM. The training method of ChatGPT is similar to the steps discussed above. It includes an additional step known as RLHF apart from pre-training and supervised fine tuning.

‍

But recently, there has been a paper known as LIMA: Less Is for More Alignment. It reveals that you don’t need RLHF at all in the first place. All you need is pre-training on the huge amount of dataset and supervised fine-tuning on high-quality data as less than 1000 examples.

As of today, OpenChat is the latest dialog-optimized large language model inspired by LLaMA-13B. It achieves 105.7% of the ChatGPT score on the Vicuna GPT-4 evaluation. It’s been finetuned on only 6k high-quality examples.

How Do You Evaluate LLMs?

The evaluation of LLMs cannot be subjective. It has to be a logical process to evaluate the performance of LLMs.

In the case of classification or regression problems, we have the true labels and predicted labels and then compare both of them to understand how well the model is performing. We look at the confusion matrix for this right? But what about large language models? They just generate the text.

There are 2 ways to evaluate LLMs: Intrinsic and extrinsic methods.

Intrinsic Methods

Traditional Language models were evaluated using intrinsic methods like perplexity, bits per character, etc. These metrics track the performance on the language front i.e. how well the model is able to predict the next word.

Extrinsic Methods

With the advancements in LLMs today, extrinsic methods are preferred to evaluate their performance. The recommended way to evaluate LLMs is to look at how well they are performing at different tasks like problem-solving, reasoning, mathematics, computer science, and competitive exams like MIT, JEE, etc.

EleutherAI released a framework called as Language Model Evaluation Harness to compare and evaluate the performance of LLMs. Hugging face integrated the evaluation framework to evaluate open-source LLMs developed by the community.

The proposed framework evaluates LLMs across 4 different datasets. The final score is an aggregation of scores from each dataset.

AI2 Reasoning Challenge: A collection of science questions designed for elementary school students.
HellaSwag: A test that challenges state-of-the-art models to make common-sense inferences, which are relatively easy for humans (about 95% accuracy).
MMLU: A comprehensive test that evaluates the multitask accuracy of a text model. It includes 57 different tasks covering subjects like basic math, U.S. history, computer science, law, and more.
TruthfulQA: A test specifically created to assess a model’s tendency to generate accurate answers and avoid reproducing false information commonly found online.

In conclusion, Large Language Models (LLMs) represent a groundbreaking advancement in artificial intelligence, fundamentally reshaping our interaction with language. These models are already making significant contributions in various applications, from chatbots and virtual assistants to language translation and content creation, leveraging their ability to generate text that closely resembles human speech and perform diverse language-related tasks.

However, like any transformative technology, LLMs bring forth important ethical and societal considerations that must be addressed. As we strive to enhance the capabilities of LLMs, it is crucial to acknowledge their broader implications and actively work towards building a more responsible and equitable future for this groundbreaking technology.

We hope you found this exploration insightful, and if you're seeking more information, feel free to explore our FAQs:

FAQs

Q1: What are Large Language Models (LLMs)?

Large Language Models (LLMs) are machine learning models trained on extensive datasets to learn how to represent language. These models leverage advanced natural language processing techniques, such as the self-attention mechanism, to grasp the context and meaning of language.

Q2: What is the self-attention mechanism used in LLMs?

The self-attention mechanism is a crucial component of LLMs, allowing the model to assign varying weights to different parts of the input text based on their relevance to the task at hand. This mechanism enhances the model's understanding of the context and meaning of the input text.

Q3: What are some use cases for LLMs?

LLMs find applications in various natural language processing tasks, including sentiment analysis, question answering, text production, and text summarization. They are commonly used in applications like chatbots, virtual assistants, and content production software.

Q4: What is fine-tuning in the context of LLMs?

During the fine-tuning process, a pre-trained LLM is further trained for a specific task or domain, enhancing its performance by adapting to the specific nuances of that task or domain.

Q5: What are some challenges associated with building and training LLMs?

Developing and training LLMs require high-quality data, powerful computing resources, and specialized machine learning expertise. Challenges have been raised regarding the ethical implications, potential biases, and environmental concerns associated with the use of LLMs.