Zettelkasten #07 LLM Notes

LLM Concept Network — Atomic Cards

One idea per card. Every card links to related concepts. Navigate the network.

2026-05-19 ← All Methods
Z-001fundamentals
Token
The atomic unit of text for an LLM. Roughly 3–4 characters or ~¾ of a word in English. Words are split into subword tokens via BPE or SentencePiece. The model never sees individual characters — only token IDs from a fixed vocabulary of 32K–128K entries.
Z-002representation
Embedding
A dense vector (e.g., 4096 floats) that encodes the meaning of a token in a high-dimensional space. Tokens with related meanings have similar embeddings. Embeddings are learned during pre-training and are the bridge between discrete tokens and continuous neural computation.
Z-003mechanism
Self-Attention
Allows every token to "look at" every other token in the context. Computes queries, keys, and values; weights contributions by dot-product similarity scaled by √d_k.
softmax( QKᵀ / √d_k ) · V
Enables long-range dependency capture — crucial for coreference, agreement, and discourse.
Z-004architecture
Transformer Block
The repeating unit in an LLM. Each block contains: (1) multi-head self-attention, (2) add + layer norm, (3) feed-forward network (two linear layers + GELU), (4) add + layer norm. Stacked 32–96 times. Residual connections allow gradient flow through deep stacks.
Z-005training
Pre-Training
Self-supervised learning on massive text corpora by predicting the next token. No labels required. The model sees trillions of tokens and encodes statistical regularities of language, world knowledge, and reasoning patterns into its weights. The most expensive phase by far.
Z-006training
Fine-Tuning (SFT)
Continues training on a small dataset of (prompt, ideal response) pairs. Teaches the base model to follow instructions and behave as an assistant rather than just continuing text. LoRA and QLoRA make SFT feasible on single GPUs by training only small adapter matrices.
Z-007alignment
RLHF / DPO
RLHF trains a reward model on human preference rankings, then uses PPO to push the LLM policy toward outputs with higher reward. DPO (Direct Preference Optimisation) skips the reward model and optimises the policy directly from ranked pairs — simpler and more stable post-2023.
Z-008technique
RAG
Retrieval-Augmented Generation. At inference time: embed the query → retrieve top-k chunks from a vector store → inject retrieved text into the prompt as context → generate a grounded response. Keeps knowledge current without retraining. Reduces hallucination for factual queries.
Z-009limitation
Hallucination
The model generates confident but factually incorrect text. Root cause: the objective is next-token likelihood, not truth. The model interpolates from training distribution — if a plausible-sounding fact fits, it may generate it. Key mitigations: RAG, tool use, RLHF against overconfidence.
Z-010architecture
Context Window
Maximum number of tokens processed in one forward pass. Self-attention is O(n²) in memory and compute. Expanded via RoPE/ALiBi positional encodings and hardware upgrades. GPT-3: 4K → Claude 3: 200K → Gemini 1.5: 1M+. "Lost-in-the-middle" degrades retrieval in the centre of long contexts.
Z-011technique
Prompt Engineering
Crafting inputs to steer model behaviour without touching weights. Key patterns: few-shot examples, chain-of-thought ("think step by step"), role-assignment ("You are an expert..."), format constraints, negative examples. Small prompt changes can have outsized effect on output quality.
Z-012technique
Tool Use / Function Calling
The model is given a schema of callable functions and can emit structured JSON to invoke them. The host program executes the function, returns the result, and the model incorporates it into its response. Converts the LLM from a static reasoner to an active agent with access to real-world data and actions.
Z-013architecture
Agentic Loop
A design pattern: (1) Plan — break goal into steps; (2) Act — call a tool; (3) Observe — read the output; (4) Reflect — update plan; repeat until done. Requires: long context for trajectory history, reliable tool-calling, and failure recovery. Foundation for autonomous software agents.
Z-014inference
Sampling & Temperature
After computing logits over vocabulary, temperature T scales them: T<1 makes distribution sharper (more deterministic); T>1 flattens it (more random). Top-p (nucleus) sampling then keeps only the smallest set of tokens summing to probability p before drawing. Temperature 0 = greedy argmax.
Z-015scaling
Scaling Laws
Test loss scales as a power law with compute, data, and parameters. Chinchilla (2022) showed optimal scaling requires N parameters and ~20N training tokens. Emergent abilities appear non-linearly at certain compute thresholds — model quality appears flat then suddenly jumps.