Charting Method — LLM Model Comparison

Attribute	GPT-4o (OpenAI)	Claude 3.5/4 (Anthropic)	Gemini 1.5 Pro (Google)	Llama 3.3 70B (Meta)	Mistral Large (Mistral AI)
Access & Weights
License	Closed API	Closed API	Closed API	Open weights	Open weights
Commercial use	Via API (paid)	Via API (paid)	Via API (paid)	Permissive	Permissive
Context & Multimodality
Context window	128K tokens	200K tokens	1M tokens	128K tokens	128K tokens
Vision / images	Yes	Yes	Yes	No	No
Audio / speech	Yes (native)	No	Yes (native)	No	No
Video input	No	No	Yes	No	No
Benchmark Performance
MMLU (knowledge)	~88%	~88–90%	~85%	~82%	~81%
HumanEval (code)	~90%	~92%	~71%	~72%	~68%
MATH (reasoning)	~76%	~78%	~67%	~58%	~56%
Strengths & Fit
Primary strength	Multimodal, broad capability, tool use	Long docs, reasoning, safety, coding	Massive context, multimodal, search	Self-hosting, privacy, cost at scale	Efficiency, European compliance
Best use case	General assistant, agents, vision	Long-form analysis, code review, writing	Video/audio analysis, enterprise search	On-prem RAG, fine-tuning, cost control	EU data residency, fast inference
Safety alignment	RLHF	Constitutional AI	RLHF	Community	Basic
Cost & Inference
Input cost (approx.)	$2.50 / 1M tok	$3.00 / 1M tok	$1.25 / 1M tok	Self-hosted ~$0.10	$2.00 / 1M tok
Self-host VRAM req.	Not available	Not available	Not available	~40 GB (bf16)	~48 GB (bf16)
Fine-tuning support	Via API	Limited	Vertex AI	Full (LoRA/QLoRA)	Full (LoRA/QLoRA)

Notes: Benchmarks vary by test date and version; treat as rough order-of-magnitude comparisons. Cost figures reflect mid-2025 API pricing and change frequently. For on-device inference, quantized variants (Q4, Q8) halve VRAM requirements at some quality cost. Context windows shown are maximum advertised; effective performance degrades in the 50–80% range.