Tokenizer splits input text into integer IDs from a fixed vocabulary. ~1.3 tokens per English word.
Transformer ×N is the core compute. Each of the N layers runs multi-head self-attention and a feed-forward network.
KV-Cache stores computed keys and values so each new token only needs one forward pass, not N².
Auto-regressive loop: each generated token is appended to the input and fed back through the model until <eos>.
Temperature and top-p control diversity. T=0 is greedy; T=0.7 + top-p=0.9 is typical for chat.
System prompt is prepended before the user turn and shapes model persona, constraints, and output format.