Main Notes

Definition

A Large Language Model is a deep-learning model trained on massive text corpora to understand and generate human-like text. Built on the Transformer architecture (Vaswani et al., 2017). Parameters range from billions to trillions.

Key Components
  • Tokenizer — splits raw text into tokens (subwords via BPE or WordPiece)
  • Embeddings — maps each token to a dense vector in high-dimensional space
  • Transformer Layers — stacked blocks of multi-head self-attention + feed-forward network
  • Output Head — projects final hidden state to vocabulary probabilities (softmax)
Self-Attention core mechanism

Every token attends to every other token in the context window simultaneously. Multi-head attention runs in parallel across H heads, each learning different relationship patterns.

Formula: Attention(Q, K, V) = softmax( QKᵀ / √d_k ) · V

Training Phases
  • Pre-training — next-token prediction on trillions of tokens (self-supervised, no labels)
  • Supervised Fine-tuning (SFT) — adapt to task-specific labeled instruction data
  • RLHF — human raters rank outputs; reward model trained; policy updated via PPO
Scaling Laws & Emergent Abilities

Performance scales predictably with compute, data, and parameters (Hoffmann et al., Chinchilla). Emergent capabilities — in-context learning, chain-of-thought, arithmetic — appear abruptly beyond certain scale thresholds and were not explicitly trained for.

Context Window

Maximum tokens processed in a single forward pass. GPT-3: 4K → GPT-4 Turbo: 128K → Gemini 1.5 Pro: 1M+. Larger windows enable document-level reasoning but memory scales quadratically — attention is O(n²).

Hallucinations

Models generate confident but factually incorrect text — trained to produce plausible continuations, not verified facts. Common mitigations:

  • RAG — ground responses in retrieved documents
  • Tool use — call external APIs and databases at inference time
  • Constitutional AI / RLHF — reduce sycophancy and over-confidence
Retrieval-Augmented Generation (RAG)

At inference time: (1) embed query → (2) retrieve top-k chunks from vector DB → (3) inject into prompt → (4) generate grounded response. Keeps knowledge current without retraining.

Prompt Engineering
  • Zero-shot — describe task only; rely on pre-trained world knowledge
  • Few-shot — provide 2–8 worked examples directly in the prompt
  • Chain-of-Thought (CoT) — "think step by step" unlocks multi-step reasoning
  • System prompts — persistent instructions shaping model persona and constraints
Key Limitations
  • No persistent memory across sessions without external storage layer
  • Knowledge cutoff date — stale on events beyond training data
  • Highly sensitive to prompt phrasing (brittle, fragile outputs)
  • High inference cost at scale (GPU/TPU hours, significant energy spend)
  • Bias and harmful stereotypes inherited from web-scraped training data
Notable Models (2024–2025)
  • GPT-4o (OpenAI) — multimodal, 128K context, vision + audio
  • Claude 3.5 / 4.x (Anthropic) — strong reasoning & safety alignment focus
  • Gemini 1.5 Pro (Google DeepMind) — 1M token context window
  • Llama 3.3 405B (Meta) — open-weight, permissive commercial license
  • Mistral Large (Mistral AI) — efficient European open model

Summary

Large Language Models are Transformer-based neural networks pre-trained on internet-scale text to predict the next token. Their power derives from self-attention — which captures rich contextual relationships across the full input — and from scale, where billions of parameters unlock emergent reasoning not explicitly trained for. The standard pipeline is self-supervised pre-training → supervised fine-tuning → RLHF alignment. Core practical concerns include hallucinations (mitigated via RAG and tool use), bounded context windows, knowledge cutoffs, and prompt sensitivity. Effective LLM deployment combines the right blend of prompt engineering, fine-tuning, and retrieval augmentation matched to the requirements of the target task.