How Large Language Models work

Razib M Sun Dec 15 2024

In this Long Post, we will learn how LLM works!

You all know that Large Language Models, or LLMs, understand the nuanced linguistic details and can generate human-like text. However, in this discussion, we will delve deeper into this subject with the help of examples, sample codes, diagrams, and more. In summary, Large Language Models analyze the input text and create output predictions, but how do they interpret and process text data to transform it into an understandable format?

Large Language Models, or LLMs, are a type of machine learning model. Machine learning is training computers to learn from data and make predictions or decisions without prior programming for the specific task. In this context, without explicit programming, the model learns to perform tasks based on the patterns it identifies in the data rather than following a specific set of programming instructions for each task written by a developer.

Like all machine learning models, LLMs learn from large volumes of data. In this case, it's text data. This data could range from books, articles, websites, and more. Training an LLM involves feeding it with a corpus of text data.

The model learns the statistical patterns in the data, like the probability of a word following another word or a sequence of words.

LLMs typically use a form of artificial neural network architecture called a transformer. Transformer is a specific type of machine learning model architecture that uses self-attention mechanisms and is exceptionally well-suited for understanding the complex patterns of human language. In the transformer architecture, the system divides the input text into smaller units called tokens. The system transformed tokens into vectors through a process called word embedding. Afterward, each token is contextualized by considering other tokens within a specified window via a parallel multi-head attention mechanism. The process allows important tokens to be highlighted and less significant ones to be de-emphasized.

How Large Language Models work

Here's a simple, foundational Python script to illustrate the central principle of language prediction used by Large Language Models (LLMs).

import pandas as pd
import random
from collections import Counter

In the above example, We have used a few key libraries for our experiment.

Preparing Data

Corpus: Let's start with a small text corpus. You can either paste in a couple of paragraphs of your choice or load a small text file using pd.read_text().

text_corpus = """
A language model is a statistical method that predicts the next word in a sequence. 
Statistical language models rely on probability to guess the next word. 
They are trained on vast amounts of text data containing patterns.
"""

Cleanup: A little preprocessing for consistency:

import re
text_corpus = text_corpus.lower() # Make everything lowercase
text_corpus = re.sub(r"[^a-z\s]", '', text_corpus) # Remove special characters
words = text_corpus.split() # Split into individual words

Building a Simple Model

We'll create a dictionary-based model tracking word frequencies and subsequent words:

word_pairs = {}
for i in range(len(words) - 1): 
    current_word = words[i]
    next_word = words[i + 1]
    if current_word in word_pairs:
        word_pairs[current_word].append(next_word)
    else:
        word_pairs[current_word] = [next_word]

Generating Text

Now, it's time for the LLM-like prediction:

def generate_sentence(start_word, length=10):
    sentence = [start_word]
    for _ in range(length):
        possible_words = word_pairs.get(sentence[-1], []) # Possible next words
        if possible_words:
            next_word = random.choice(possible_words) # Randomly choose
            sentence.append(next_word)       
        else:
            break # No options found, stop
    return ' '.join(sentence)

print(generate_sentence('language', 10))

Based on the above idea, let's build a simple model-

import random
import re
from collections import defaultdict, Counter

class NgramModel:
    def __init__(self, corpus, n=2):
        """Initializes the N-gram model with given corpus and N."""
        self.n = n
        self.ngram_counts = defaultdict(Counter)

        # Clean and process the corpus
        words = self.clean_text(corpus)
        self.build_ngram_counts(words)

    def clean_text(self, corpus):
        """Preprocesses the text."""
        corpus = corpus.lower()
        corpus = re.sub(r"[^a-z\s]", '', corpus)
        return corpus.split()

    def build_ngram_counts(self, words):
        """Builds the N-gram count structure."""
        for i in range(len(words) - self.n + 1):
            ngram = tuple(words[i:i+self.n])
            self.ngram_counts[ngram[:-1]][ngram[-1]] += 1

    def generate_sentence(self, start_word, length=10):
        """Generates a sentence of specified length."""
        sentence = [start_word]
        for _ in range(length):
            context = tuple(sentence[-(self.n - 1):])
            possible_next_words = list(self.ngram_counts[context].keys())

            if not possible_next_words:
                break
                
            counts = list(self.ngram_counts[context].values())

            # Laplace smoothing
            smoothed_counts = [count + 1 for count in counts]
            total_count = sum(smoothed_counts)
            probs = [count / total_count for count in smoothed_counts]

            next_word = random.choices(possible_next_words, weights=probs)[0]
            sentence.append(next_word)

        return ' '.join(sentence)


# ---- text corpus ----
text_corpus = """Artificial intelligence is intelligence demonstrated by machines, unlike the natural intelligence displayed by humans and animals. Within the field of computer science, AI research is defined as the study of intelligent agents. This means that anything can be considered an AI if it mimics human intelligence to learn, understand, and apply knowledge, reason and solve problems, perceive the environment, and interact. In medicine, we can use AI and machine learning to predict patterns and assist with diagnoses. In advertising, we can utilize AI to better target audiences and make predictions about buying behavior. And in education, we can apply AI to personalized learning, making instructions tailor-fit to each learner's needs."""

# ---- model ----

model = NgramModel(text_corpus, n=2)

start_word = input("Enter a starting word: ")
print(model.generate_sentence(start_word, 15))

Menu

How Large Language Models work

Preparing Data

Building a Simple Model

Generating Text

Shopping Cart