Cornell Notes — Large Language Models

Cues & Keywords

What is an LLM? Definition & architecture family

Transformer Core building block since 2017

Self-Attention How does a token "see" context?

Training Phases Pre-train → Fine-tune → RLHF

Scaling Laws Why does model size matter?

Context Window What limits document length?

Hallucination Confident but wrong — why?

RAG Retrieval-Augmented Generation

Prompt Engineering Zero-shot / Few-shot / CoT

Limitations When do LLMs fail?

Notable Models 2024–25 landscape

GPT-4o Claude 4 Gemini Llama 3

Main Notes

Definition

A Large Language Model is a deep-learning model trained on massive text corpora to understand and generate human-like text. Built on the Transformer architecture (Vaswani et al., 2017). Parameters range from billions to trillions.

Key Components

Tokenizer — splits raw text into tokens (subwords via BPE or WordPiece)
Embeddings — maps each token to a dense vector in high-dimensional space
Transformer Layers — stacked blocks of multi-head self-attention + feed-forward network
Output Head — projects final hidden state to vocabulary probabilities (softmax)

Self-Attention core mechanism

Every token attends to every other token in the context window simultaneously. Multi-head attention runs in parallel across H heads, each learning different relationship patterns.

Formula: Attention(Q, K, V) = softmax( QKᵀ / √d_k ) · V

Training Phases

Pre-training — next-token prediction on trillions of tokens (self-supervised, no labels)
Supervised Fine-tuning (SFT) — adapt to task-specific labeled instruction data
RLHF — human raters rank outputs; reward model trained; policy updated via PPO

Scaling Laws & Emergent Abilities

Performance scales predictably with compute, data, and parameters (Hoffmann et al., Chinchilla). Emergent capabilities — in-context learning, chain-of-thought, arithmetic — appear abruptly beyond certain scale thresholds and were not explicitly trained for.

Context Window

Maximum tokens processed in a single forward pass. GPT-3: 4K → GPT-4 Turbo: 128K → Gemini 1.5 Pro: 1M+. Larger windows enable document-level reasoning but memory scales quadratically — attention is O(n²).

Hallucinations

Models generate confident but factually incorrect text — trained to produce plausible continuations, not verified facts. Common mitigations:

RAG — ground responses in retrieved documents
Tool use — call external APIs and databases at inference time
Constitutional AI / RLHF — reduce sycophancy and over-confidence

Retrieval-Augmented Generation (RAG)

At inference time: (1) embed query → (2) retrieve top-k chunks from vector DB → (3) inject into prompt → (4) generate grounded response. Keeps knowledge current without retraining.

Prompt Engineering

Zero-shot — describe task only; rely on pre-trained world knowledge
Few-shot — provide 2–8 worked examples directly in the prompt
Chain-of-Thought (CoT) — "think step by step" unlocks multi-step reasoning
System prompts — persistent instructions shaping model persona and constraints

Key Limitations

No persistent memory across sessions without external storage layer
Knowledge cutoff date — stale on events beyond training data
Highly sensitive to prompt phrasing (brittle, fragile outputs)
High inference cost at scale (GPU/TPU hours, significant energy spend)
Bias and harmful stereotypes inherited from web-scraped training data

Notable Models (2024–2025)

GPT-4o (OpenAI) — multimodal, 128K context, vision + audio
Claude 3.5 / 4.x (Anthropic) — strong reasoning & safety alignment focus
Gemini 1.5 Pro (Google DeepMind) — 1M token context window
Llama 3.3 405B (Meta) — open-weight, permissive commercial license
Mistral Large (Mistral AI) — efficient European open model

Summary

Large Language Models are Transformer-based neural networks pre-trained on internet-scale text to predict the next token. Their power derives from self-attention — which captures rich contextual relationships across the full input — and from scale, where billions of parameters unlock emergent reasoning not explicitly trained for. The standard pipeline is self-supervised pre-training → supervised fine-tuning → RLHF alignment. Core practical concerns include hallucinations (mitigated via RAG and tool use), bounded context windows, knowledge cutoffs, and prompt sensitivity. Effective LLM deployment combines the right blend of prompt engineering, fine-tuning, and retrieval augmentation matched to the requirements of the target task.

What would you like to call this notebook?