Attribute GPT-4o (OpenAI) Claude 3.5/4 (Anthropic) Gemini 1.5 Pro (Google) Llama 3.3 70B (Meta) Mistral Large (Mistral AI)
Access & Weights
License Closed API Closed API Closed API Open weights Open weights
Commercial use Via API (paid) Via API (paid) Via API (paid) Permissive Permissive
Context & Multimodality
Context window 128K tokens 200K tokens 1M tokens 128K tokens 128K tokens
Vision / images Yes Yes Yes No No
Audio / speech Yes (native) No Yes (native) No No
Video input No No Yes No No
Benchmark Performance
MMLU (knowledge) ~88% ~88–90% ~85% ~82% ~81%
HumanEval (code) ~90% ~92% ~71% ~72% ~68%
MATH (reasoning) ~76% ~78% ~67% ~58% ~56%
Strengths & Fit
Primary strength Multimodal, broad capability, tool use Long docs, reasoning, safety, coding Massive context, multimodal, search Self-hosting, privacy, cost at scale Efficiency, European compliance
Best use case General assistant, agents, vision Long-form analysis, code review, writing Video/audio analysis, enterprise search On-prem RAG, fine-tuning, cost control EU data residency, fast inference
Safety alignment RLHF Constitutional AI RLHF Community Basic
Cost & Inference
Input cost (approx.) $2.50 / 1M tok $3.00 / 1M tok $1.25 / 1M tok Self-hosted ~$0.10 $2.00 / 1M tok
Self-host VRAM req. Not available Not available Not available ~40 GB (bf16) ~48 GB (bf16)
Fine-tuning support Via API Limited Vertex AI Full (LoRA/QLoRA) Full (LoRA/QLoRA)
Notes: Benchmarks vary by test date and version; treat as rough order-of-magnitude comparisons. Cost figures reflect mid-2025 API pricing and change frequently. For on-device inference, quantized variants (Q4, Q8) halve VRAM requirements at some quality cost. Context windows shown are maximum advertised; effective performance degrades in the 50–80% range.