Big Ideas. Cheap Models.
Ok, fair enough.
Then why scroll through the entire website?
Welcome to Garage Inference — a hackathon where the constraint is the creativity. Forget frontier models and sky-high API bills. We're going small, going cheap, and going all-in on what the weakest, most affordable LLMs can actually accomplish when paired with clever engineering.
Build remarkable, useful, surprising things with models that cost fractions of a penny per call — or run on your laptop's CPU.
The constraint IS the creativity.
When you can't brute-force your way to quality with a 400B parameter model, you have to actually think. You have to architect. You have to be clever. That's where the real engineering happens.
Most hackathons celebrate the biggest models, the fattest GPU clusters, the most expensive API calls. We're going the other direction — and we think that's where the most interesting work lives.
Build a working project powered by the cheapest and smallest large language models available — whether you run them locally on your own hardware or call the most budget-friendly API tier out there.
Your goal: squeeze remarkable results out of limited intelligence and limited budget. Then show us the gap between how bad your model is and how good your project turned out.
best_project = weakest_possible_model × highest_possible_impact A solid app on Gemma 3 1B will beat a decent app on GPT-5 Nano every time. A useful tool running Qwen 3 0.6B in-browser will beat the same tool calling Ministral 8B via API. Go as low as you can. The constraint is the point.
Using Phi-4 Mini to power a real-time code reviewer that actually works
Using GPT-5.2 to build a chatbot — won't cut it
We don't care how you get there. We care what you ship — and how little intelligence it took to ship it.
Not all cheap models are equally cheap. The tier system makes scoring fair: lower tier = more points. Weaker model, same quality output — that's harder to do, so it scores higher.
Grey area? Ask in Discord before the hackathon. We'll maintain a living list of approved models and tiers on the event page. When in doubt, drop a tier — judges reward it.
Maximum constraint, maximum points. Everything here barely follows instructions.
Qwen 3 0.6B/1.7B, Qwen 3.5 Small 0.8B, Gemma 3 1B, Gemma 3n E2B, SmolLM2 135M/360M/1.7B, Phi-4 Mini, Ministral 3B, DeepSeek-R1-Distill 1.5B, browser & edge models, any model ≤4B
Cheap and limited, but they follow instructions most of the time.
Qwen 3 4B/8B, Qwen 3.5 Small 4B/9B, Gemma 3 4B (Q4), Gemma 3n E4B, Ministral 8B, DeepSeek-R1-Distill 7B/8B, Llama 4 Scout (Q2/Q3). API: GPT-5 Nano, Gemini 2.0 Flash-Lite, Gemini 2.0 Flash, Ministral 8B, DeepSeek V3.2
Decent models at low prices. You'll need strong engineering to stand out.
Qwen 3 14B/32B, Qwen 3 30B-A3B (MoE), Gemma 3 12B/27B (Q4), Ministral 14B, DeepSeek-R1-Distill 14B/32B. API: GPT-5 Mini, Gemini 2.5 Flash, $0.25–$1.50/M tokens
Real capability. Prove the engineering mattered, not just the model.
Llama 4 Scout (Q4+), DeepSeek-R1-Distill 70B, Qwen 3 235B-A22B (Q4). API: Claude Haiku 4.5, Gemini 2.5 Pro (batch), $1.50–$3/M tokens
Choose one or more LLMs from Tier 1, 2, or 3. You can combine multiple small models. You can use different models for different subtasks. You can use a tiny local model for 90% of tasks and a cheap API model for the remaining 10%. Declare everything — we want full transparency on what's doing the thinking.
Choose something that matters. Something a user would actually care about. Not a toy demo, not a benchmark wrapper — a product or tool that delivers real value. Ask yourself: "Would someone use this next week?" If yes, you're on the right track.
This is where the hackathon lives. Your model is dumb. It will hallucinate. It will forget context. It will misunderstand instructions. Your job as an engineer is to build the scaffolding that compensates.
You don't need to use all of these. Pick what fits your project. But the best submissions will show that you thought deeply about the model's limitations and engineered solutions for them.
During the hackathon window. Code must be original. You can use pre-existing frameworks, libraries, and tools. You can use AI coding assistants to write your code — we judge the final product, not your process.
A working demo, source code, a short write-up explaining which model(s) you used and why, what problem you solved, what engineering techniques you used, your total cost (if using APIs), and your honest assessment of where the model still fails — we respect transparency.
These are starting points. The best projects will be ones we didn't think of.
Let's be completely transparent about what wins this hackathon. Every project gets scored on five criteria. But those scores get viewed through the lens of what model you used.
The tier system isn't a formal point multiplier — it's a philosophical lens. Judges are instructed to always ask themselves: "How impressive is this result given the model that produced it?"
Judge sees two projects with the same score on "Practical Usefulness." Project A uses GPT-5 Nano (Tier 2). Project B uses Qwen 3 0.6B running in-browser (Tier 1). Project B wins. Every time. No debate.
The core metric of Garage Inference. How wide is the gap between your model's expected capability and your project's actual output? Did you make a model that can barely count to ten produce something that feels intelligent? Did you build a production-quality tool on a model the industry uses as a throwaway baseline?
Would a real person use this? Not "would a person be impressed at a demo day" — would they actually install this, bookmark this, or pay for this? Does it solve a genuine problem? Could someone with no ML background benefit from it?
Is the project stable? Does it respond in reasonable time? Does the architecture make smart trade-offs? Did you handle edge cases? Did you build proper error handling for when the model inevitably produces garbage? Is the code clean enough that someone else could extend it?
Did you find a novel angle? An unexpected use case for small models? A technique the judges haven't seen before? Did you combine models, tools, or approaches in a surprising way?
Can someone else run this? Is the setup documented? If it's a local project — can a developer with a regular laptop get it running in under 15 minutes? If it's an API project — is the cost clear and manageable? Could a student in a developing country use this?
Match your ambition to a strategy. When in doubt, start smaller than you think you need.
"I want maximum points and I'm a strong engineer"
Go Tier 1. Pick Qwen 3 0.6B or Gemma 3 1B. You'll fight the model on everything. Build a solid engineering layer around it — structured outputs, tight prompts, fallback logic. Tier 1 wins get the most attention.
"I want to win and I want a realistic shot"
Go Tier 1 or Tier 2. Pick Phi-4 Mini or GPT-5 Nano. Weak enough to be a real constraint, reliable enough to build something polished in 72 hours.
"I want to build something useful and prove a point about cost"
Go Tier 2. Use Claude Haiku 4.5 or Gemini 2.5 Flash. Focus on a real product with real users. Show that a $20/month API bill delivers production-grade AI features. Include a cost breakdown.
"I want to push the hardware constraint"
Go Tier 1, extreme end. Run inference in the browser. Deploy on a Raspberry Pi. Fine-tune a 1B model to beat a 7B on a specific task. The hardware constraint is the project.
"I have a team and we want to go deep"
Consider a multi-model architecture. Use a Tier 1 model for 80% of tasks, a Tier 2 model for the hard 20%. Or build an ensemble of three Tier 1 models that specialize in different subtasks. Show that cheap models collaborating can rival a single expensive model.
Transparency is core to Garage Inference. Every submission must include:
Live URL, video walkthrough, or reproducible local setup. It must run. A working demo (deployed web app, video walkthrough, or installable project) — no slide decks, no "imagine if this worked."
Public repo with a clean README. Include setup instructions that actually work. Someone should be able to clone, install, and run your project without asking you questions.
"Qwen 3 4B Q4_K_M" — not "a small Llama model." Specify where inference runs: local (CPU/GPU, RAM, device) or cloud API (provider and tier). Include quantization details if applicable — format, precision, tooling.
Explain which model(s) you used and why, what problem you solved, what engineering techniques you used to compensate for model weakness. Be honest about the division of labor — if your RAG pipeline does 90% of the work and the LLM just formats the answer, say so. That's not a weakness — that's smart engineering.
Total API cost if using cloud models — actual dollars spent during development and for the demo. Screenshots of billing dashboards are appreciated. Latency, accuracy, tokens per second — whatever matters for your use case.
Screen recording showing the project in action. Plus: where does the model still break? What inputs produce garbage? Judges respect honesty — and it helps them understand the real "wow gap."
We're finalizing the judging panel. Expect people who've shipped real products with small models, not just benchmarked them. Follow us on Discord for announcements.
TBA
TBA
Judge details coming soon.
TBA
TBA
Judge details coming soon.
TBA
TBA
Judge details coming soon.
We're still looking for judges. If you have experience shipping with small/local models and want to help evaluate projects, reach out at hello@raptors.dev.
The name says it. The best companies started in garages. The best music came from garage bands. The best hacking happens when you strip away resources and rely on raw ingenuity.
Frontier models are impressive. But they're also expensive, rate-limited, dependent on someone else's infrastructure, and inaccessible to most builders on the planet. The cheapest models — the ones nobody writes blog posts about, the ones that sit at the bottom of every leaderboard, the ones companies release as an afterthought next to their flagship — that's where the real democratization of AI lives.
There are billions of people and millions of developers who will never have a $200/month API budget. There are use cases that will never justify the cost of a frontier model. There are devices that will never connect to a cloud endpoint. The future of AI isn't just the biggest model — it's the smallest model that still gets the job done.
This hackathon is about proving that future is already here. No cloud budget required. No GPU cluster. Just the weakest model you can find and the smartest engineering you can muster.
Build something that makes people forget the model is cheap.
The interesting work isn't happening on billion-dollar clusters. It's on laptops, single GPUs, and free-tier endpoints. Builders who treat constraints as architecture decisions, not limitations. Quantized weights, prompt chains, systems running on hardware nobody expected them to run on — that's where the signal is. Capability per dollar is the metric that matters.