Sentence Method — LLM Key Facts

Origins & History

01.The Transformer architecture was introduced in "Attention Is All You Need" (Vaswani et al., Google, 2017), replacing recurrent networks as the dominant sequence model.
02.GPT-1 (2018) showed that unsupervised pre-training on raw text followed by fine-tuning could outperform task-specific supervised models.
03.BERT (2018) used bidirectional self-attention and masked language modelling, becoming the standard for discriminative NLP for several years.
04.GPT-3 (2020, 175B parameters) demonstrated in-context learning — the ability to solve tasks from a few examples without any gradient update.
05.InstructGPT (2022) showed that RLHF dramatically improved GPT-3's helpfulness and reduced harmful outputs despite using a smaller model.
06.ChatGPT's public launch (November 2022) reached 100 million users in two months — the fastest consumer product adoption in history at that time.
07.LLaMA (Meta, 2023) was the first open-weight frontier-class model, enabling researchers and developers to run, fine-tune, and study a capable LLM locally.

Architecture & Mechanics

08.An LLM generates text by predicting a probability distribution over its vocabulary at each position, then sampling or taking the argmax.
09.Tokenization using Byte-Pair Encoding means the model never sees individual characters — letter-counting tasks ("how many e's in 'excellence'?") are unexpectedly hard.
10.Self-attention computes queries, keys, and values for each token, then scores all pairs and applies a weighted sum — this is what lets the model relate "it" to the noun it refers to.
11.Multi-head attention runs H independent attention computations in parallel, each learning to attend to different types of relationships (syntax, coreference, topic).
12.A typical frontier LLM has 32–96 Transformer layers; each layer has a self-attention sublayer and a feed-forward sublayer — both wrapped with layer normalisation and residual connections.
13.The KV-cache stores computed key and value tensors for previously generated tokens so the model doesn't recompute them at each new step — without it, generation would be O(n²).
14.Temperature scales the logits before softmax: low temperature makes outputs more deterministic; high temperature increases diversity and creativity, at the cost of coherence.
15.Top-p (nucleus) sampling dynamically selects the minimum set of tokens whose cumulative probability exceeds p before sampling, avoiding long tails of low-probability tokens.

Training & Scale

16.Frontier LLMs are trained on trillions of tokens — GPT-4 and LLaMA-3 training sets are estimated at 8–15 trillion tokens from web, books, and code.
17.The Chinchilla scaling law (Hoffmann et al., 2022) showed that optimal performance requires scaling training data proportionally with model size — previous models were undertrained.
18.Training a frontier model requires 10²³–10²⁵ FLOPs, typically on thousands of GPUs (H100, A100) running continuously for weeks or months.
19.Emergent abilities — capabilities that suddenly appear at scale thresholds without specific training — include multi-step arithmetic, code generation, and translation into rare languages.
20.LoRA (Low-Rank Adaptation) fine-tunes only two small adapter matrices per layer, leaving the base model frozen, reducing trainable parameters by 99%+ and fitting on a single GPU.

Limitations

21.Hallucination is not a bug — it's a consequence of the model predicting plausible tokens rather than verified ones; the model has no truth-checking mechanism built in.
22.LLMs have a knowledge cutoff date — facts from after the training data collection date are unknown unless injected via context, tools, or updated fine-tuning.
23.The lost-in-the-middle problem shows that LLMs attend more strongly to the beginning and end of long contexts, with important information in the middle sometimes ignored.
24.Prompt sensitivity is a real limitation — rephrasing the same question differently or changing capitalisation can change the model's answer significantly.
25.LLMs are stochastic — even with identical prompts, temperature > 0 means outputs vary between runs, making reproducibility a challenge in production systems.
26.Training data bias transfers into model behaviour — if the training corpus over-represents certain demographics, viewpoints, or languages, the model will reflect those biases in outputs.

Techniques & Mitigations

27.RAG (Retrieval-Augmented Generation) grounds responses in retrieved source documents, reducing hallucination and enabling use of up-to-date or proprietary information.
28.Chain-of-thought prompting ("think step by step") dramatically improves performance on multi-step reasoning tasks by eliciting intermediate steps before the final answer.
29.Few-shot prompting places 2–8 worked examples in the prompt before the test question; the model infers the task pattern from the examples without any weight updates.
30.System prompts prime the model's behaviour for a session — defining persona, constraints, output format, and domain knowledge before any user turn.
31.Tool use / function calling lets the model invoke external APIs, databases, and calculators at runtime, converting it from a static pattern-matcher to an active agent.
32.Speculative decoding uses a small "draft" model to propose multiple tokens, which a large "verifier" model checks in parallel, significantly reducing wall-clock generation time.

Applications & Impact

33.GitHub Copilot, built on Codex (GPT-3 fine-tuned on code), was adopted by 1 million developers within two months of launch, demonstrating clear ROI for AI-assisted coding.
34.LLMs pass the US Bar Exam, the USMLE medical licensing exam, and SAT reading at scores in the top 10–15% of human test-takers — as of GPT-4 and equivalents.
35.Agentic systems wrap LLMs in a loop: the model plans, calls tools, observes results, and updates its plan until a goal is accomplished — extending from single responses to multi-step workflows.
36.LLM-generated content is now common in customer support, content marketing, software documentation, legal drafting, and scientific literature — raising questions about attribution, accuracy, and over-reliance.
37.The inference cost for a single GPT-4 query is roughly $0.003–0.03 — cheap per-query but expensive at billion-query scale; cost reduction is a primary engineering goal.
38.Multimodal LLMs (GPT-4V, Gemini, Claude) accept images, audio, and video alongside text, enabling document parsing, diagram analysis, and audio transcription within a single model.
39.SWE-Bench measures LLM ability to fix real GitHub issues; top models solve 40–55% of issues autonomously — a task that was 0% two years prior.
40.The consensus in the field is that LLMs are tools, not oracles — they work best in human-in-the-loop systems where outputs are reviewed, verified, and grounded before action.