Flow Notes — LLM Inference Process

Tokenizer splits input text into integer IDs from a fixed vocabulary. ~1.3 tokens per English word.

Transformer ×N is the core compute. Each of the N layers runs multi-head self-attention and a feed-forward network.

KV-Cache stores computed keys and values so each new token only needs one forward pass, not N².

Auto-regressive loop: each generated token is appended to the input and fed back through the model until <eos>.

Temperature and top-p control diversity. T=0 is greedy; T=0.7 + top-p=0.9 is typical for chat.

System prompt is prepended before the user turn and shapes model persona, constraints, and output format.