What an LLM is, how it's built, and why it works
Main Notes
An LLM is a deep neural network trained on massive text corpora to predict the next token given a sequence of prior tokens. The underlying architecture is the Transformer (Vaswani et al., 2017). Parameter counts range from ~7B (edge-deployable) to 1T+ (frontier). The model learns grammar, facts, reasoning patterns, and style purely from the statistical structure of language — no hand-coded rules.
BPE (Byte-Pair Encoding) or SentencePiece compress text to subword tokens. A 100-word sentence ≈ 130 tokens. Vocabulary size typically 32K–128K. Tokenization explains why LLMs struggle with character-level tasks (counting letters, anagrams) — they never "see" individual characters during training.
For each token, compute Query, Key, Value projections. Attention weights tell each token how much to "borrow" from every other token's Value.
Attn(Q,K,V) = softmax( Q·Kᵀ / √d_k ) · V
Scaled by √d_k to prevent vanishing gradients in softmax. Causal masking blocks attention to future tokens during generation.
Loss scales predictably with compute × data × parameters (Chinchilla law: optimally scale data and params together). Emergent capabilities — chain-of-thought reasoning, in-context learning, multi-step arithmetic — appear abruptly at certain scale thresholds with no explicit training signal.
Maximum tokens processed in one forward pass. Attention is O(n²) in memory and compute. GPT-3: 4K → GPT-4 Turbo: 128K → Gemini 1.5 Pro: 1M+. Solutions: sparse attention, sliding window, RoPE/ALiBi positional encodings, and state-space models (Mamba) that are O(n).
The model generates plausible text, not verified text. It has no ground-truth lookup — it samples from a learned distribution. High confidence ≠ factual accuracy. Key mitigations:
Summary
Large Language Models are Transformer-based networks trained via self-supervised next-token prediction on internet-scale text. The Transformer's self-attention mechanism is the engine — every token attends to every other, capturing long-range dependencies. Pre-training encodes world knowledge; fine-tuning (SFT + RLHF/DPO) aligns behaviour. Scale unlocks emergent abilities — reasoning and instruction-following not explicitly trained for. Core limitations — hallucination, bounded context, knowledge cutoff — are addressed through RAG, tool use, and extended context architectures. Understanding these foundations is the prerequisite for using LLMs effectively.