Cornell Method — LLM Fundamentals

Main Notes

What is a Large Language Model?

An LLM is a deep neural network trained on massive text corpora to predict the next token given a sequence of prior tokens. The underlying architecture is the Transformer (Vaswani et al., 2017). Parameter counts range from ~7B (edge-deployable) to 1T+ (frontier). The model learns grammar, facts, reasoning patterns, and style purely from the statistical structure of language — no hand-coded rules.

Transformer Architecture

Input Embedding — tokens mapped to dense vectors; positional encodings added
Multi-Head Self-Attention — each token attends to every other simultaneously across H parallel heads
Feed-Forward Network — two linear layers with GELU activation applied to each token independently
Layer Norm + Residuals — stabilise training; allows very deep stacks (32–96 layers typical)
Output Head — final hidden state projected to vocabulary distribution via softmax

Tokenization

BPE (Byte-Pair Encoding) or SentencePiece compress text to subword tokens. A 100-word sentence ≈ 130 tokens. Vocabulary size typically 32K–128K. Tokenization explains why LLMs struggle with character-level tasks (counting letters, anagrams) — they never "see" individual characters during training.

Self-Attention core mechanism

For each token, compute Query, Key, Value projections. Attention weights tell each token how much to "borrow" from every other token's Value.

Attn(Q,K,V) = softmax( Q·Kᵀ / √d_k ) · V

Scaled by √d_k to prevent vanishing gradients in softmax. Causal masking blocks attention to future tokens during generation.

Training Phases

Pre-training — self-supervised next-token prediction on trillions of tokens. Encodes world knowledge into weights.
Supervised Fine-Tuning (SFT) — teach the model to follow instructions using curated human-written demonstrations.
RLHF — human raters rank outputs; a reward model is trained on rankings; policy updated via PPO to maximise reward.
DPO — Direct Preference Optimisation; simpler RLHF variant that skips the explicit reward model.

Scaling Laws & Emergent Abilities

Loss scales predictably with compute × data × parameters (Chinchilla law: optimally scale data and params together). Emergent capabilities — chain-of-thought reasoning, in-context learning, multi-step arithmetic — appear abruptly at certain scale thresholds with no explicit training signal.

Context Window

Maximum tokens processed in one forward pass. Attention is O(n²) in memory and compute. GPT-3: 4K → GPT-4 Turbo: 128K → Gemini 1.5 Pro: 1M+. Solutions: sparse attention, sliding window, RoPE/ALiBi positional encodings, and state-space models (Mamba) that are O(n).

Hallucination

The model generates plausible text, not verified text. It has no ground-truth lookup — it samples from a learned distribution. High confidence ≠ factual accuracy. Key mitigations:

RAG — retrieve and inject real source documents into the prompt
Tool use — call external APIs, databases, calculators at inference time
RLHF alignment — reduce sycophancy and overconfidence during training

Key Trade-offs

Capability vs. Cost — larger models perform better but cost more to run
Latency vs. Quality — streaming smaller models beats waiting for large ones
Open vs. Closed — open weights give control; closed APIs offer convenience
Safety vs. Helpfulness — over-refusal reduces utility; under-refusal raises risk

Summary

Large Language Models are Transformer-based networks trained via self-supervised next-token prediction on internet-scale text. The Transformer's self-attention mechanism is the engine — every token attends to every other, capturing long-range dependencies. Pre-training encodes world knowledge; fine-tuning (SFT + RLHF/DPO) aligns behaviour. Scale unlocks emergent abilities — reasoning and instruction-following not explicitly trained for. Core limitations — hallucination, bounded context, knowledge cutoff — are addressed through RAG, tool use, and extended context architectures. Understanding these foundations is the prerequisite for using LLMs effectively.