A.Data Collection
Internet-scale corpora assembled from diverse sources.
1.Web crawls — CommonCrawl, C4, FineWeb (multi-TB scale)
2.Curated books — Books3, Project Gutenberg, arXiv papers
3.Code — GitHub, Stack Overflow (critical for coding ability)
4.Wikipedia, encyclopedias — high-density factual text
B.Data Cleaning & Filtering
Quality matters more than raw volume at scale.
1.Deduplication — exact and near-duplicate removal (MinHash, suffix arrays)
2.Heuristic filters — remove boilerplate, spam, low-information pages
3.Language identification — keep target-language content
4.Safety filtering — remove CSAM, extreme content
C.Tokenization
Text → integer token IDs for the model to consume.
1.BPE (Byte-Pair Encoding) — used by GPT series; builds vocab by merging frequent byte pairs
2.SentencePiece / Unigram — used by LLaMA, T5; language-agnostic subword segmentation
3.Vocabulary size typically 32K–128K; ~1.3 tokens per word in English