- 01.The Transformer architecture was introduced in "Attention Is All You Need" (Vaswani et al., Google, 2017), replacing recurrent networks as the dominant sequence model.
- 02.GPT-1 (2018) showed that unsupervised pre-training on raw text followed by fine-tuning could outperform task-specific supervised models.
- 03.BERT (2018) used bidirectional self-attention and masked language modelling, becoming the standard for discriminative NLP for several years.
- 04.GPT-3 (2020, 175B parameters) demonstrated in-context learning — the ability to solve tasks from a few examples without any gradient update.
- 05.InstructGPT (2022) showed that RLHF dramatically improved GPT-3's helpfulness and reduced harmful outputs despite using a smaller model.
- 06.ChatGPT's public launch (November 2022) reached 100 million users in two months — the fastest consumer product adoption in history at that time.
- 07.LLaMA (Meta, 2023) was the first open-weight frontier-class model, enabling researchers and developers to run, fine-tune, and study a capable LLM locally.
- 08.An LLM generates text by predicting a probability distribution over its vocabulary at each position, then sampling or taking the argmax.
- 09.Tokenization using Byte-Pair Encoding means the model never sees individual characters — letter-counting tasks ("how many e's in 'excellence'?") are unexpectedly hard.
- 10.Self-attention computes queries, keys, and values for each token, then scores all pairs and applies a weighted sum — this is what lets the model relate "it" to the noun it refers to.
- 11.Multi-head attention runs H independent attention computations in parallel, each learning to attend to different types of relationships (syntax, coreference, topic).
- 12.A typical frontier LLM has 32–96 Transformer layers; each layer has a self-attention sublayer and a feed-forward sublayer — both wrapped with layer normalisation and residual connections.
- 13.The KV-cache stores computed key and value tensors for previously generated tokens so the model doesn't recompute them at each new step — without it, generation would be O(n²).
- 14.Temperature scales the logits before softmax: low temperature makes outputs more deterministic; high temperature increases diversity and creativity, at the cost of coherence.
- 15.Top-p (nucleus) sampling dynamically selects the minimum set of tokens whose cumulative probability exceeds p before sampling, avoiding long tails of low-probability tokens.
- 16.Frontier LLMs are trained on trillions of tokens — GPT-4 and LLaMA-3 training sets are estimated at 8–15 trillion tokens from web, books, and code.
- 17.The Chinchilla scaling law (Hoffmann et al., 2022) showed that optimal performance requires scaling training data proportionally with model size — previous models were undertrained.
- 18.Training a frontier model requires 10²³–10²⁵ FLOPs, typically on thousands of GPUs (H100, A100) running continuously for weeks or months.
- 19.Emergent abilities — capabilities that suddenly appear at scale thresholds without specific training — include multi-step arithmetic, code generation, and translation into rare languages.
- 20.LoRA (Low-Rank Adaptation) fine-tunes only two small adapter matrices per layer, leaving the base model frozen, reducing trainable parameters by 99%+ and fitting on a single GPU.
- 21.Hallucination is not a bug — it's a consequence of the model predicting plausible tokens rather than verified ones; the model has no truth-checking mechanism built in.
- 22.LLMs have a knowledge cutoff date — facts from after the training data collection date are unknown unless injected via context, tools, or updated fine-tuning.
- 23.The lost-in-the-middle problem shows that LLMs attend more strongly to the beginning and end of long contexts, with important information in the middle sometimes ignored.
- 24.Prompt sensitivity is a real limitation — rephrasing the same question differently or changing capitalisation can change the model's answer significantly.
- 25.LLMs are stochastic — even with identical prompts, temperature > 0 means outputs vary between runs, making reproducibility a challenge in production systems.
- 26.Training data bias transfers into model behaviour — if the training corpus over-represents certain demographics, viewpoints, or languages, the model will reflect those biases in outputs.
- 27.RAG (Retrieval-Augmented Generation) grounds responses in retrieved source documents, reducing hallucination and enabling use of up-to-date or proprietary information.
- 28.Chain-of-thought prompting ("think step by step") dramatically improves performance on multi-step reasoning tasks by eliciting intermediate steps before the final answer.
- 29.Few-shot prompting places 2–8 worked examples in the prompt before the test question; the model infers the task pattern from the examples without any weight updates.
- 30.System prompts prime the model's behaviour for a session — defining persona, constraints, output format, and domain knowledge before any user turn.
- 31.Tool use / function calling lets the model invoke external APIs, databases, and calculators at runtime, converting it from a static pattern-matcher to an active agent.
- 32.Speculative decoding uses a small "draft" model to propose multiple tokens, which a large "verifier" model checks in parallel, significantly reducing wall-clock generation time.
- 33.GitHub Copilot, built on Codex (GPT-3 fine-tuned on code), was adopted by 1 million developers within two months of launch, demonstrating clear ROI for AI-assisted coding.
- 34.LLMs pass the US Bar Exam, the USMLE medical licensing exam, and SAT reading at scores in the top 10–15% of human test-takers — as of GPT-4 and equivalents.
- 35.Agentic systems wrap LLMs in a loop: the model plans, calls tools, observes results, and updates its plan until a goal is accomplished — extending from single responses to multi-step workflows.
- 36.LLM-generated content is now common in customer support, content marketing, software documentation, legal drafting, and scientific literature — raising questions about attribution, accuracy, and over-reliance.
- 37.The inference cost for a single GPT-4 query is roughly $0.003–0.03 — cheap per-query but expensive at billion-query scale; cost reduction is a primary engineering goal.
- 38.Multimodal LLMs (GPT-4V, Gemini, Claude) accept images, audio, and video alongside text, enabling document parsing, diagram analysis, and audio transcription within a single model.
- 39.SWE-Bench measures LLM ability to fix real GitHub issues; top models solve 40–55% of issues autonomously — a task that was 0% two years prior.
- 40.The consensus in the field is that LLMs are tools, not oracles — they work best in human-in-the-loop systems where outputs are reviewed, verified, and grounded before action.