User Prompt "Explain backprop" natural language input Tokenizer BPE / SentencePiece text → [token IDs] Embedding + pos. encoding IDs → dense vectors Transformer × 32–96 layers self-attn → FFN → LayerNorm hidden state h ∈ ℝ^d_model Output Head linear projection h → logits ∈ ℝ^|vocab| Softmax logits → probs sum to 1.0 Sampling temp · top-p → token selected Generated Token appended to sequence loop until <eos> System Prompt prepended before user turn KV Cache prev. K,V stored → O(1) step Temperature T 0 = greedy · 1 = standard auto-regressive loop FORWARD PASS DECODE
Tokenizer splits input text into integer IDs from a fixed vocabulary. ~1.3 tokens per English word.
Transformer ×N is the core compute. Each of the N layers runs multi-head self-attention and a feed-forward network.
KV-Cache stores computed keys and values so each new token only needs one forward pass, not N².
Auto-regressive loop: each generated token is appended to the input and fed back through the model until <eos>.
Temperature and top-p control diversity. T=0 is greedy; T=0.7 + top-p=0.9 is typical for chat.
System prompt is prepended before the user turn and shapes model persona, constraints, and output format.