Outline Method — LLM Training Pipeline

I.Data Preparation— foundation of every LLM

A.Data Collection

Internet-scale corpora assembled from diverse sources.

Web crawls — CommonCrawl, C4, FineWeb (multi-TB scale)
Curated books — Books3, Project Gutenberg, arXiv papers
Code — GitHub, Stack Overflow (critical for coding ability)
Wikipedia, encyclopedias — high-density factual text

B.Data Cleaning & Filtering

Quality matters more than raw volume at scale.

Deduplication — exact and near-duplicate removal (MinHash, suffix arrays)
Heuristic filters — remove boilerplate, spam, low-information pages
Language identification — keep target-language content
Safety filtering — remove CSAM, extreme content

C.Tokenization

Text → integer token IDs for the model to consume.

BPE (Byte-Pair Encoding) — used by GPT series; builds vocab by merging frequent byte pairs
SentencePiece / Unigram — used by LLaMA, T5; language-agnostic subword segmentation
Vocabulary size typically 32K–128K; ~1.3 tokens per word in English

II.Pre-Training most expensive

A.Objective

Causal Language Modelling — predict the next token given all prior tokens. Self-supervised; no labels needed.

B.Architecture Choices

Decoder-only Transformer (GPT, LLaMA, Claude, Gemini) — standard for generative LLMs
Encoder-decoder (T5, BART) — used for seq2seq tasks
Positional encoding — RoPE or ALiBi for long-context extrapolation
Grouped Query Attention (GQA) — reduces KV-cache memory at inference

C.Optimisation

Optimizer: AdamW with cosine LR decay + warmup
Mixed precision (bf16) + gradient checkpointing
Distributed training: tensor parallelism, pipeline parallelism, ZeRO sharding

D.Compute Scale

Chinchilla optimal: N params × 20 = training tokens (e.g. 70B model → ~1.4T tokens)
Frontier runs: 10²³–10²⁵ FLOP; thousands of H100/A100 GPUs for months
Cost: $5M–$100M+ for a single frontier pre-training run

III.Supervised Fine-Tuning (SFT)

A.Purpose

Teaches the model to follow instructions and respond helpfully, rather than just continue text.

B.Data Format

Prompt-completion pairs written by human contractors or sourced from real user queries
Instruction-following datasets: FLAN, Alpaca, ShareGPT, Dolly
Scale: 10K–1M examples (much smaller than pre-training)

C.Parameter-Efficient Variants

1.LoRA — train low-rank adapter matrices; freeze base model weights

2.QLoRA — LoRA on 4-bit quantized base model; fits in consumer GPU VRAM

IV.Alignment safety-critical

A.RLHF — Reinforcement Learning from Human Feedback

Collect human preference data: raters rank multiple model outputs
Train a reward model (RM) on those rankings
Fine-tune the LLM policy to maximise RM score using PPO
KL-divergence penalty keeps policy close to SFT model (prevents reward hacking)

B.DPO — Direct Preference Optimisation

1.Bypasses explicit reward model; optimises policy directly from preference pairs

2.Simpler, more stable training; widely adopted post-2023

C.Constitutional AI (CAI)

1.Model critiques its own outputs against a written "constitution" of principles

2.Reduces need for human labelling at scale; used by Anthropic

V.Evaluation

A.Automated Benchmarks

MMLU — 57-subject multiple choice; measures breadth of world knowledge
HumanEval / SWE-Bench — code generation and real bug fixing
MATH, GSM8K — mathematical reasoning
GPQA — expert-level PhD questions in science

B.Human Evaluation

1.Chatbot Arena — ELO ratings from blind pairwise comparisons

2.Red-teaming — adversarial probing for safety failures

VI.Deployment

A.Serving Infrastructure

Inference servers: vLLM, TGI, TensorRT-LLM — PagedAttention for KV-cache efficiency
Quantization: 8-bit, 4-bit (GPTQ, AWQ) to reduce VRAM and cost
Speculative decoding — small draft model speeds up large model generation

B.Access Patterns

API (OpenAI, Anthropic, Google) — pay-per-token; no infrastructure management
Self-hosted open weights (Llama, Mistral) — control, privacy, cost at scale
Edge deployment — quantized small models on device (Phi-3, Gemma 2B)