I.Data Preparation— foundation of every LLM
A.Data Collection
Internet-scale corpora assembled from diverse sources.
1.Web crawls — CommonCrawl, C4, FineWeb (multi-TB scale)
2.Curated books — Books3, Project Gutenberg, arXiv papers
3.Code — GitHub, Stack Overflow (critical for coding ability)
4.Wikipedia, encyclopedias — high-density factual text
B.Data Cleaning & Filtering
Quality matters more than raw volume at scale.
1.Deduplication — exact and near-duplicate removal (MinHash, suffix arrays)
2.Heuristic filters — remove boilerplate, spam, low-information pages
3.Language identification — keep target-language content
4.Safety filtering — remove CSAM, extreme content
C.Tokenization
Text → integer token IDs for the model to consume.
1.BPE (Byte-Pair Encoding) — used by GPT series; builds vocab by merging frequent byte pairs
2.SentencePiece / Unigram — used by LLaMA, T5; language-agnostic subword segmentation
3.Vocabulary size typically 32K–128K; ~1.3 tokens per word in English
II.Pre-Training most expensive
A.Objective
Causal Language Modelling — predict the next token given all prior tokens. Self-supervised; no labels needed.
B.Architecture Choices
1.Decoder-only Transformer (GPT, LLaMA, Claude, Gemini) — standard for generative LLMs
2.Encoder-decoder (T5, BART) — used for seq2seq tasks
3.Positional encoding — RoPE or ALiBi for long-context extrapolation
4.Grouped Query Attention (GQA) — reduces KV-cache memory at inference
C.Optimisation
1.Optimizer: AdamW with cosine LR decay + warmup
2.Mixed precision (bf16) + gradient checkpointing
3.Distributed training: tensor parallelism, pipeline parallelism, ZeRO sharding
D.Compute Scale
1.Chinchilla optimal: N params × 20 = training tokens (e.g. 70B model → ~1.4T tokens)
2.Frontier runs: 10²³–10²⁵ FLOP; thousands of H100/A100 GPUs for months
3.Cost: $5M–$100M+ for a single frontier pre-training run
III.Supervised Fine-Tuning (SFT)
A.Purpose
Teaches the model to follow instructions and respond helpfully, rather than just continue text.
B.Data Format
1.Prompt-completion pairs written by human contractors or sourced from real user queries
2.Instruction-following datasets: FLAN, Alpaca, ShareGPT, Dolly
3.Scale: 10K–1M examples (much smaller than pre-training)
C.Parameter-Efficient Variants
1.LoRA — train low-rank adapter matrices; freeze base model weights
2.QLoRA — LoRA on 4-bit quantized base model; fits in consumer GPU VRAM
IV.Alignment safety-critical
A.RLHF — Reinforcement Learning from Human Feedback
1.Collect human preference data: raters rank multiple model outputs
2.Train a reward model (RM) on those rankings
3.Fine-tune the LLM policy to maximise RM score using PPO
4.KL-divergence penalty keeps policy close to SFT model (prevents reward hacking)
B.DPO — Direct Preference Optimisation
1.Bypasses explicit reward model; optimises policy directly from preference pairs
2.Simpler, more stable training; widely adopted post-2023
C.Constitutional AI (CAI)
1.Model critiques its own outputs against a written "constitution" of principles
2.Reduces need for human labelling at scale; used by Anthropic
V.Evaluation
A.Automated Benchmarks
1.MMLU — 57-subject multiple choice; measures breadth of world knowledge
2.HumanEval / SWE-Bench — code generation and real bug fixing
3.MATH, GSM8K — mathematical reasoning
4.GPQA — expert-level PhD questions in science
B.Human Evaluation
1.Chatbot Arena — ELO ratings from blind pairwise comparisons
2.Red-teaming — adversarial probing for safety failures
VI.Deployment
A.Serving Infrastructure
1.Inference servers: vLLM, TGI, TensorRT-LLM — PagedAttention for KV-cache efficiency
2.Quantization: 8-bit, 4-bit (GPTQ, AWQ) to reduce VRAM and cost
3.Speculative decoding — small draft model speeds up large model generation
B.Access Patterns
1.API (OpenAI, Anthropic, Google) — pay-per-token; no infrastructure management
2.Self-hosted open weights (Llama, Mistral) — control, privacy, cost at scale
3.Edge deployment — quantized small models on device (Phi-3, Gemma 2B)