Zettelkasten — LLM Core Concepts

Z-001fundamentals

Token

The atomic unit of text for an LLM. Roughly 3–4 characters or ~¾ of a word in English. Words are split into subword tokens via BPE or SentencePiece. The model never sees individual characters — only token IDs from a fixed vocabulary of 32K–128K entries.

links →Z-002Z-010

Z-002representation

Embedding

A dense vector (e.g., 4096 floats) that encodes the meaning of a token in a high-dimensional space. Tokens with related meanings have similar embeddings. Embeddings are learned during pre-training and are the bridge between discrete tokens and continuous neural computation.

links →Z-001Z-003Z-008

Z-003mechanism

Self-Attention

Allows every token to "look at" every other token in the context. Computes queries, keys, and values; weights contributions by dot-product similarity scaled by √d_k.

softmax( QKᵀ / √d_k ) · V

Enables long-range dependency capture — crucial for coreference, agreement, and discourse.

links →Z-004Z-010

Z-004architecture

Transformer Block

The repeating unit in an LLM. Each block contains: (1) multi-head self-attention, (2) add + layer norm, (3) feed-forward network (two linear layers + GELU), (4) add + layer norm. Stacked 32–96 times. Residual connections allow gradient flow through deep stacks.

links →Z-003Z-005

Z-005training

Pre-Training

Self-supervised learning on massive text corpora by predicting the next token. No labels required. The model sees trillions of tokens and encodes statistical regularities of language, world knowledge, and reasoning patterns into its weights. The most expensive phase by far.

links →Z-006Z-011Z-012

Z-006training

Fine-Tuning (SFT)

Continues training on a small dataset of (prompt, ideal response) pairs. Teaches the base model to follow instructions and behave as an assistant rather than just continuing text. LoRA and QLoRA make SFT feasible on single GPUs by training only small adapter matrices.

links →Z-005Z-007

Z-007alignment

RLHF / DPO

RLHF trains a reward model on human preference rankings, then uses PPO to push the LLM policy toward outputs with higher reward. DPO (Direct Preference Optimisation) skips the reward model and optimises the policy directly from ranked pairs — simpler and more stable post-2023.

links →Z-006Z-009

Z-008technique

RAG

Retrieval-Augmented Generation. At inference time: embed the query → retrieve top-k chunks from a vector store → inject retrieved text into the prompt as context → generate a grounded response. Keeps knowledge current without retraining. Reduces hallucination for factual queries.

links →Z-002Z-009Z-011

Z-009limitation

Hallucination

The model generates confident but factually incorrect text. Root cause: the objective is next-token likelihood, not truth. The model interpolates from training distribution — if a plausible-sounding fact fits, it may generate it. Key mitigations: RAG, tool use, RLHF against overconfidence.

links →Z-007Z-008Z-013

Z-010architecture

Context Window

Maximum number of tokens processed in one forward pass. Self-attention is O(n²) in memory and compute. Expanded via RoPE/ALiBi positional encodings and hardware upgrades. GPT-3: 4K → Claude 3: 200K → Gemini 1.5: 1M+. "Lost-in-the-middle" degrades retrieval in the centre of long contexts.

links →Z-001Z-003Z-008

Z-011technique

Prompt Engineering

Crafting inputs to steer model behaviour without touching weights. Key patterns: few-shot examples, chain-of-thought ("think step by step"), role-assignment ("You are an expert..."), format constraints, negative examples. Small prompt changes can have outsized effect on output quality.

links →Z-005Z-008Z-012

Z-012technique

Tool Use / Function Calling

The model is given a schema of callable functions and can emit structured JSON to invoke them. The host program executes the function, returns the result, and the model incorporates it into its response. Converts the LLM from a static reasoner to an active agent with access to real-world data and actions.

links →Z-009Z-011Z-013

Z-013architecture

Agentic Loop

A design pattern: (1) Plan — break goal into steps; (2) Act — call a tool; (3) Observe — read the output; (4) Reflect — update plan; repeat until done. Requires: long context for trajectory history, reliable tool-calling, and failure recovery. Foundation for autonomous software agents.

links →Z-010Z-012

Z-014inference

Sampling & Temperature

After computing logits over vocabulary, temperature T scales them: T<1 makes distribution sharper (more deterministic); T>1 flattens it (more random). Top-p (nucleus) sampling then keeps only the smallest set of tokens summing to probability p before drawing. Temperature 0 = greedy argmax.

links →Z-001Z-004

Z-015scaling

Scaling Laws

Test loss scales as a power law with compute, data, and parameters. Chinchilla (2022) showed optimal scaling requires N parameters and ~20N training tokens. Emergent abilities appear non-linearly at certain compute thresholds — model quality appears flat then suddenly jumps.

links →Z-005Z-004

LLM Concept Network — Atomic Cards