May 2026 LLM Dev Sprint
Mon 05 May
  • Goal: Build a production-ready RAG pipeline for our internal knowledge base by end of month
  • Kickoff meeting — chose Claude Sonnet as base model; decided against GPT-4o for cost at scale
  • Read the Anthropic API docs — tool use, context caching, prompt caching endpoints
  • Set up dev environment: anthropic Python SDK, chromadb for vector store, sentence-transformers
  • Prompt caching saves 90% cost for repeated system prompts — critical for our use case
  • Min 1024 tokens in system prompt to qualify for caching
  • Choose embedding model: text-embedding-3-small vs nomic-embed-text — benchmark both
Tue 06 May
  • Embedded 2,400 internal docs into ChromaDB — avg 500 tokens/chunk with 50-token overlap
  • Chunk size matters enormously — too large = diluted retrieval, too small = loses context. 400–600 tokens is the sweet spot for prose.
  • Tried cosine similarity retrieval top-k=5; recall was poor on multi-hop questions
  • Investigate hybrid search (BM25 + dense) to improve recall on keyword-heavy queries
  • Read "RAPTOR" paper — hierarchical summarisation of document trees improves long-doc retrieval
  • Add re-ranking step: cross-encoder/ms-marco-MiniLM to sort retrieved chunks by relevance
  • Hallucinations still occur even with retrieval — model sometimes "fills in" missing context
  • Mitigation: add explicit instruction "Answer only using provided context. Say 'I don't know' if not found."
Wed 07 May
  • Implemented hybrid retrieval — BM25 for keyword match + dense vector for semantic; improved top-5 recall by ~22%
  • Pair-programmed with Claude on the re-ranking code — caught a subtle bug in batching logic
  • Tool use pattern: give the model a search_docs(query) tool — it reformulates its own retrieval query, which is often better than passing the raw user query verbatim
  • Implement query rewriting: model rephrases user question to maximise retrieval recall before search
  • Latency budget: 200ms retrieval + 800ms LLM generation = ~1s p95. Acceptable for async chat.
  • Stream responses to frontend — reduces perceived latency even if total time is similar
Thu 08 May
  • Streaming SSE endpoint live — time-to-first-token ~220ms, feels snappy
  • Demo to stakeholders — major feedback: "It sometimes makes things up confidently"
  • Always include source citations in the response. Users trust grounded answers significantly more, and it makes hallucinations immediately visible.
  • Add citation injection: append source doc titles and URLs to each answer
  • Implement output validation: check if all claims in response are supported by retrieved chunks
  • Considered fine-tuning but decided against it — RAG improvements are faster to iterate and don't require retraining
  • Fine-tune only for style/format, not for factual knowledge
Fri 09 May
  • Citation format: [1] doc_title (section) appended inline — clean and clear
  • Basic evals suite: 50 question-answer pairs with known ground truth — baseline 62% exact match
  • Eval insight: hardest questions are multi-doc — require reasoning across 3+ sources
  • Next week: implement multi-hop retrieval for complex queries
  • Write prompt variants for system message — A/B test conciseness vs. detail level
  • Read Anthropic's "Building Effective Agents" post — key insight: start simple (single LLM call), add complexity (tools, loops) only when needed
  • Most LLM application bugs are prompt bugs, not code bugs. Invest time in prompt engineering before optimising infrastructure.
Mon 12 May — Weekly Review
  • Sprint retrospective: RAG pipeline MVP shipped. Evals at 78% (+16pp from week 1).
  • The three highest-leverage improvements so far: (1) hybrid retrieval, (2) query rewriting, (3) "answer only from context" instruction
  • Open question: how to handle queries that are intentionally out-of-scope? Need a classifier or "I don't know" path
  • This week: agentic loop — let the model decide when to search vs. answer directly
  • Explore multi-agent: planner LLM → executor LLM → reviewer LLM for complex multi-step tasks
  • Migrate eval harness to promptfoo for better parallelism and LLM-as-judge scoring
Future Log — Scheduled Tasks
  • 19 May: Multi-hop retrieval experiment — chained queries for complex questions
  • 22 May: Fine-grained context caching — measure cost reduction on high-traffic prompts
  • 26 May: Load test pipeline at 100 concurrent users — identify bottlenecks
  • 31 May: Full eval suite — target 85% on 200-question benchmark before v1 launch
  • 02 Jun: v1 soft launch to 50 internal users
  • Ongoing: monitor token usage, latency, and hallucination rate via structured logging
task to do completed event / read note / observation key insight migrated forward