Bullet Journal — Building with LLMs

May 2026 LLM Dev Sprint

Mon 05 May

✦ Goal: Build a production-ready RAG pipeline for our internal knowledge base by end of month
○ Kickoff meeting — chose Claude Sonnet as base model; decided against GPT-4o for cost at scale
• Read the Anthropic API docs — tool use, context caching, prompt caching endpoints
✕ Set up dev environment: anthropic Python SDK, chromadb for vector store, sentence-transformers
— Prompt caching saves 90% cost for repeated system prompts — critical for our use case
— Min 1024 tokens in system prompt to qualify for caching
• Choose embedding model: text-embedding-3-small vs nomic-embed-text — benchmark both

Tue 06 May

✕ Embedded 2,400 internal docs into ChromaDB — avg 500 tokens/chunk with 50-token overlap
✦ Chunk size matters enormously — too large = diluted retrieval, too small = loses context. 400–600 tokens is the sweet spot for prose.
— Tried cosine similarity retrieval top-k=5; recall was poor on multi-hop questions
• Investigate hybrid search (BM25 + dense) to improve recall on keyword-heavy queries
○ Read "RAPTOR" paper — hierarchical summarisation of document trees improves long-doc retrieval
• Add re-ranking step: cross-encoder/ms-marco-MiniLM to sort retrieved chunks by relevance
— Hallucinations still occur even with retrieval — model sometimes "fills in" missing context
— Mitigation: add explicit instruction "Answer only using provided context. Say 'I don't know' if not found."

Wed 07 May

✕ Implemented hybrid retrieval — BM25 for keyword match + dense vector for semantic; improved top-5 recall by ~22%
○ Pair-programmed with Claude on the re-ranking code — caught a subtle bug in batching logic
✦ Tool use pattern: give the model a search_docs(query) tool — it reformulates its own retrieval query, which is often better than passing the raw user query verbatim
• Implement query rewriting: model rephrases user question to maximise retrieval recall before search
— Latency budget: 200ms retrieval + 800ms LLM generation = ~1s p95. Acceptable for async chat.
› Stream responses to frontend — reduces perceived latency even if total time is similar

Thu 08 May

✕ Streaming SSE endpoint live — time-to-first-token ~220ms, feels snappy
○ Demo to stakeholders — major feedback: "It sometimes makes things up confidently"
✦ Always include source citations in the response. Users trust grounded answers significantly more, and it makes hallucinations immediately visible.
• Add citation injection: append source doc titles and URLs to each answer
• Implement output validation: check if all claims in response are supported by retrieved chunks
— Considered fine-tuning but decided against it — RAG improvements are faster to iterate and don't require retraining
— Fine-tune only for style/format, not for factual knowledge

Fri 09 May

✕ Citation format: [1] doc_title (section) appended inline — clean and clear
✕ Basic evals suite: 50 question-answer pairs with known ground truth — baseline 62% exact match
— Eval insight: hardest questions are multi-doc — require reasoning across 3+ sources
• Next week: implement multi-hop retrieval for complex queries
• Write prompt variants for system message — A/B test conciseness vs. detail level
○ Read Anthropic's "Building Effective Agents" post — key insight: start simple (single LLM call), add complexity (tools, loops) only when needed
✦ Most LLM application bugs are prompt bugs, not code bugs. Invest time in prompt engineering before optimising infrastructure.

Mon 12 May — Weekly Review

○ Sprint retrospective: RAG pipeline MVP shipped. Evals at 78% (+16pp from week 1).
✦ The three highest-leverage improvements so far: (1) hybrid retrieval, (2) query rewriting, (3) "answer only from context" instruction
— Open question: how to handle queries that are intentionally out-of-scope? Need a classifier or "I don't know" path
• This week: agentic loop — let the model decide when to search vs. answer directly
• Explore multi-agent: planner LLM → executor LLM → reviewer LLM for complex multi-step tasks
› Migrate eval harness to promptfoo for better parallelism and LLM-as-judge scoring

Future Log — Scheduled Tasks

• 19 May: Multi-hop retrieval experiment — chained queries for complex questions
• 22 May: Fine-grained context caching — measure cost reduction on high-traffic prompts
• 26 May: Load test pipeline at 100 concurrent users — identify bottlenecks
• 31 May: Full eval suite — target 85% on 200-question benchmark before v1 launch
○ 02 Jun: v1 soft launch to 50 internal users
— Ongoing: monitor token usage, latency, and hallucination rate via structured logging

• task to do ✕ completed ○ event / read — note / observation ✦ key insight › migrated forward