Artificial Intelligence — Natural Language Processing
2026-05-19LLM Architecture Overview
Main Notes
Definition
A Large Language Model is a deep-learning model trained on massive text corpora to understand and generate human-like text. Built on the Transformer architecture (Vaswani et al., 2017). Parameters range from billions to trillions.
Key Components
Tokenizer — splits raw text into tokens (subwords via BPE or WordPiece)
Embeddings — maps each token to a dense vector in high-dimensional space
Output Head — projects final hidden state to vocabulary probabilities (softmax)
Self-Attention core mechanism
Every token attends to every other token in the context window simultaneously. Multi-head attention runs in parallel across H heads, each learning different relationship patterns.
Pre-training — next-token prediction on trillions of tokens (self-supervised, no labels)
Supervised Fine-tuning (SFT) — adapt to task-specific labeled instruction data
RLHF — human raters rank outputs; reward model trained; policy updated via PPO
Scaling Laws & Emergent Abilities
Performance scales predictably with compute, data, and parameters (Hoffmann et al., Chinchilla). Emergent capabilities — in-context learning, chain-of-thought, arithmetic — appear abruptly beyond certain scale thresholds and were not explicitly trained for.
Context Window
Maximum tokens processed in a single forward pass. GPT-3: 4K → GPT-4 Turbo: 128K → Gemini 1.5 Pro: 1M+. Larger windows enable document-level reasoning but memory scales quadratically — attention is O(n²).
Hallucinations
Models generate confident but factually incorrect text — trained to produce plausible continuations, not verified facts. Common mitigations:
RAG — ground responses in retrieved documents
Tool use — call external APIs and databases at inference time
Constitutional AI / RLHF — reduce sycophancy and over-confidence
Retrieval-Augmented Generation (RAG)
At inference time: (1) embed query → (2) retrieve top-k chunks from vector DB → (3) inject into prompt → (4) generate grounded response. Keeps knowledge current without retraining.
Prompt Engineering
Zero-shot — describe task only; rely on pre-trained world knowledge
Few-shot — provide 2–8 worked examples directly in the prompt
Chain-of-Thought (CoT) — "think step by step" unlocks multi-step reasoning
System prompts — persistent instructions shaping model persona and constraints
Key Limitations
No persistent memory across sessions without external storage layer
Knowledge cutoff date — stale on events beyond training data
Highly sensitive to prompt phrasing (brittle, fragile outputs)
High inference cost at scale (GPU/TPU hours, significant energy spend)
Bias and harmful stereotypes inherited from web-scraped training data
Mistral Large (Mistral AI) — efficient European open model
Summary
Large Language Models are Transformer-based neural networks pre-trained on internet-scale text to predict the next token.
Their power derives from self-attention — which captures rich contextual relationships across the full input — and
from scale, where billions of parameters unlock emergent reasoning not explicitly trained for.
The standard pipeline is self-supervised pre-training → supervised fine-tuning → RLHF alignment.
Core practical concerns include hallucinations (mitigated via RAG and tool use), bounded context windows,
knowledge cutoffs, and prompt sensitivity. Effective LLM deployment combines the right blend of prompt engineering,
fine-tuning, and retrieval augmentation matched to the requirements of the target task.
New Notebook
What would you like to call this notebook?
Give it a name that describes what you're studying. You can always rename it later by clicking the badge.