Fine. Here's the whole thing.

Build something cool with the cheapest AI model you can find. Submit a working demo, source code, and a short writeup. Win money and bragging rights.

Join Discord

Garage
Inference

Big Ideas. Cheap Models.

The constraint is the creativity.

Welcome to Garage Inference — a hackathon where the constraint is the creativity. Forget frontier models and sky-high API bills. We're going small, going cheap, and going all-in on what the weakest, most affordable LLMs can actually accomplish when paired with clever engineering.

Build remarkable, useful, surprising things with models that cost fractions of a penny per call — or run on your laptop's CPU.

The constraint IS the creativity.

When you can't brute-force your way to quality with a 400B parameter model, you have to actually think. You have to architect. You have to be clever. That's where the real engineering happens.

Most hackathons celebrate the biggest models, the fattest GPU clusters, the most expensive API calls. We're going the other direction — and we think that's where the most interesting work lives.

Use weak models to build strong products.

Build a working project powered by the cheapest and smallest large language models available — whether you run them locally on your own hardware or call the most budget-friendly API tier out there.

Your goal: squeeze remarkable results out of limited intelligence and limited budget. Then show us the gap between how bad your model is and how good your project turned out.

best_project = weakest_possible_model × highest_possible_impact

A solid app on Gemma 3 1B will beat a decent app on GPT-5 Nano every time. A useful tool running Qwen 3 0.6B in-browser will beat the same tool calling Ministral 8B via API. Go as low as you can. The constraint is the point.

Good

Using Phi-4 Mini to power a real-time code reviewer that actually works

Bad

Using GPT-5.2 to build a chatbot — won't cut it

We don't care how you get there. We care what you ship — and how little intelligence it took to ship it.

Pick your weight class.

Not all cheap models are equally cheap. The tier system makes scoring fair: lower tier = more points. Weaker model, same quality output — that's harder to do, so it scores higher.

Grey area? Ask in Discord before the hackathon. We'll maintain a living list of approved models and tiers on the event page. When in doubt, drop a tier — judges reward it.

Tier 1 — Absolute Garage MAX

Maximum constraint, maximum points. Everything here barely follows instructions.

Qwen 3 0.6B/1.7B, Qwen 3.5 Small 0.8B, Gemma 3 1B, Gemma 3n E2B, SmolLM2 135M/360M/1.7B, Phi-4 Mini, Ministral 3B, DeepSeek-R1-Distill 1.5B, browser & edge models, any model ≤4B

Tier 2 — Proper Garage HIGH

Cheap and limited, but they follow instructions most of the time.

Qwen 3 4B/8B, Qwen 3.5 Small 4B/9B, Gemma 3 4B (Q4), Gemma 3n E4B, Ministral 8B, DeepSeek-R1-Distill 7B/8B, Llama 4 Scout (Q2/Q3). API: GPT-5 Nano, Gemini 2.0 Flash-Lite, Gemini 2.0 Flash, Ministral 8B, DeepSeek V3.2

Tier 3 — Home Workshop MOD

Decent models at low prices. You'll need strong engineering to stand out.

Qwen 3 14B/32B, Qwen 3 30B-A3B (MoE), Gemma 3 12B/27B (Q4), Ministral 14B, DeepSeek-R1-Distill 14B/32B. API: GPT-5 Mini, Gemini 2.5 Flash, $0.25–$1.50/M tokens

Tier 4 — Rented Studio LOW

Real capability. Prove the engineering mattered, not just the model.

Llama 4 Scout (Q4+), DeepSeek-R1-Distill 70B, Qwen 3 235B-A22B (Q4). API: Claude Haiku 4.5, Gemini 2.5 Pro (batch), $1.50–$3/M tokens

Banned GPT-5/5.2/5.2 Pro, Claude Sonnet 4.6/Opus 4.6, Gemini 3 Pro, o3-pro, Llama 4 Maverick (full precision), any >$3/M input tokens, any frontier model

Five steps. No shortcuts.

01

Pick Your Model (or models)

Choose one or more LLMs from Tier 1, 2, or 3. You can combine multiple small models. You can use different models for different subtasks. You can use a tiny local model for 90% of tasks and a cheap API model for the remaining 10%. Declare everything — we want full transparency on what's doing the thinking.

02

Pick a Real Problem

Choose something that matters. Something a user would actually care about. Not a toy demo, not a benchmark wrapper — a product or tool that delivers real value. Ask yourself: "Would someone use this next week?" If yes, you're on the right track.

03

Engineer Around the Model's Weaknesses

This is where the hackathon lives. Your model is dumb. It will hallucinate. It will forget context. It will misunderstand instructions. Your job as an engineer is to build the scaffolding that compensates.

Structured prompts — Constrain output to what the model can actually handle
RAG — The model doesn't need to "know" anything — just read and respond
Tool use — Orchestrate actions rather than generating raw knowledge
Multi-step pipelines — Break complex tasks into tiny subtasks a weak model can handle
Validation layers — Catch bad outputs and retry, rephrase, or reroute
Caching — Don't repeat expensive work — pre-compute what you can
Fine-tuning — A tiny model can become an expert at one narrow thing
Ensemble methods — Multiple tiny models voting or collaborating
Confidence scoring — Let the model say "I don't know" instead of hallucinating

You don't need to use all of these. Pick what fits your project. But the best submissions will show that you thought deeply about the model's limitations and engineered solutions for them.

04

Build It

During the hackathon window. Code must be original. You can use pre-existing frameworks, libraries, and tools. You can use AI coding assistants to write your code — we judge the final product, not your process.

05

Submit

A working demo, source code, a short write-up explaining which model(s) you used and why, what problem you solved, what engineering techniques you used, your total cost (if using APIs), and your honest assessment of where the model still fails — we respect transparency.

Inspiration, not prescription.

These are starting points. The best projects will be ones we didn't think of.

"No way that's a 3B model"

  • Customer support bot — Handles real conversations with a 1B model, RAG over your docs, and structured tool use.
  • Local code review assistant — Running Phi-4 Mini that catches actual bugs in pull requests.
  • Personal finance advisor — Running entirely in-browser via WebLLM — no server, no data leaks, no cost.
  • Meeting notes summarizer — Runs on a Raspberry Pi plugged into your conference room.

"Cheapest possible API, maximum value"

  • Content moderation pipeline — Using GPT-5 Nano that matches GPT-5-level accuracy through prompt chaining and confidence thresholds.
  • Multi-language FAQ system — Built on Gemini 2.0 Flash that handles 15 languages for under $0.10/day.
  • Real-time writing assistant — Using Claude Haiku 4.5 that gives feedback as you type — fractions of a cent per session.
  • Personalized news digest — Reads 500 articles and produces a briefing using DeepSeek V3.2 for under $0.01 total.

"This shouldn't work but it does"

  • Browser-only page summarizer — Gemma 3n E2B running entirely in-browser, powering an extension — zero server cost, zero latency, fully private.
  • Legal contract extractor — Qwen 3 0.6B fine-tuned on legal contract language that extracts key terms better than GPT-4o-mini.
  • NPC dialogue system — SmolLM2 1.7B powering a playable game — with personality, memory, and contextual responses.
  • 1B model ensemble — Three 1B models collaborating on writing tasks, each specialized (outlining, drafting, editing), rivaling a single 8B model.

"Extreme garage"

  • Qwen 3 0.6B doing anything useful — At all. Seriously. If you can make this model produce value, you deserve a trophy.
  • Voice-controlled RPi Zero — A model running on a $35 Raspberry Pi Zero that responds to voice commands.
  • SmolLM2 home automation — SmolLM2 135M as the brains of a home automation system.
  • Inference on a 2020 phone — A literal phone from 2020. Running inference. Doing something useful. Good luck.

How we judge.

Let's be completely transparent about what wins this hackathon. Every project gets scored on five criteria. But those scores get viewed through the lens of what model you used.

The tier system isn't a formal point multiplier — it's a philosophical lens. Judges are instructed to always ask themselves: "How impressive is this result given the model that produced it?"

Judge sees two projects with the same score on "Practical Usefulness." Project A uses GPT-5 Nano (Tier 2). Project B uses Qwen 3 0.6B running in-browser (Tier 1). Project B wins. Every time. No debate.
35%

Results vs. Model Capability

The Wow Gap

The core metric of Garage Inference. How wide is the gap between your model's expected capability and your project's actual output? Did you make a model that can barely count to ten produce something that feels intelligent? Did you build a production-quality tool on a model the industry uses as a throwaway baseline?

Scores high A Tier 1 model producing results that feel like a Tier 3 model. A tool that makes users forget they're talking to a 1B model. A demo where the judge says "wait, this is running WHERE?"
Scores low A capable Tier 3 model doing exactly what you'd expect it to do. A project where the model's quality is carrying the work rather than your engineering.
25%

Practical Usefulness

Would a real person use this? Not "would a person be impressed at a demo day" — would they actually install this, bookmark this, or pay for this? Does it solve a genuine problem? Could someone with no ML background benefit from it?

Scores high A tool you'd actually recommend to a friend. Something that saves time, money, or effort. A product with a clear user and a clear use case.
Scores low A cool demo with no real-world application. A benchmark wrapper. A chatbot that just... chats.
20%

Technical Execution

Is the project stable? Does it respond in reasonable time? Does the architecture make smart trade-offs? Did you handle edge cases? Did you build proper error handling for when the model inevitably produces garbage? Is the code clean enough that someone else could extend it?

Scores high Smooth UX, fast inference, graceful failure modes, clean code, smart caching, thoughtful architecture decisions documented in the write-up.
Scores low Crashes during demo. 30-second response times with no loading state. No error handling. A monolithic prompt with no engineering around it.
10%

Creativity & Innovation

Did you find a novel angle? An unexpected use case for small models? A technique the judges haven't seen before? Did you combine models, tools, or approaches in a surprising way?

Scores high "I've never seen anyone try this with a model this small." A new prompting technique. A creative architecture. An unexpected domain.
Scores low Standard RAG chatbot. Standard summarizer. Anything that looks like a tutorial project with a different coat of paint.
10%

Accessibility & Reproducibility

Can someone else run this? Is the setup documented? If it's a local project — can a developer with a regular laptop get it running in under 15 minutes? If it's an API project — is the cost clear and manageable? Could a student in a developing country use this?

Scores high One-command setup. Clear README. Documented hardware requirements. Runs on a $500 laptop. Cost breakdown included.
Scores low Requires obscure dependencies. No documentation. Only runs on your specific machine. Costs are hidden or hand-waved.

Choose your weapon.

Match your ambition to a strategy. When in doubt, start smaller than you think you need.

"I want maximum points and I'm a strong engineer"

Go Tier 1. Pick Qwen 3 0.6B or Gemma 3 1B. You'll fight the model on everything. Build a solid engineering layer around it — structured outputs, tight prompts, fallback logic. Tier 1 wins get the most attention.

"I want to win and I want a realistic shot"

Go Tier 1 or Tier 2. Pick Phi-4 Mini or GPT-5 Nano. Weak enough to be a real constraint, reliable enough to build something polished in 72 hours.

"I want to build something useful and prove a point about cost"

Go Tier 2. Use Claude Haiku 4.5 or Gemini 2.5 Flash. Focus on a real product with real users. Show that a $20/month API bill delivers production-grade AI features. Include a cost breakdown.

"I want to push the hardware constraint"

Go Tier 1, extreme end. Run inference in the browser. Deploy on a Raspberry Pi. Fine-tune a 1B model to beat a 7B on a specific task. The hardware constraint is the project.

"I have a team and we want to go deep"

Consider a multi-model architecture. Use a Tier 1 model for 80% of tasks, a Tier 2 model for the hard 20%. Or build an ensemble of three Tier 1 models that specialize in different subtasks. Show that cheap models collaborating can rival a single expensive model.

What you must declare.

Transparency is core to Garage Inference. Every submission must include:

01

Working demo or deployed application

Live URL, video walkthrough, or reproducible local setup. It must run. A working demo (deployed web app, video walkthrough, or installable project) — no slide decks, no "imagine if this worked."

02

Source code on GitHub

Public repo with a clean README. Include setup instructions that actually work. Someone should be able to clone, install, and run your project without asking you questions.

03

Exact model declaration

"Qwen 3 4B Q4_K_M" — not "a small Llama model." Specify where inference runs: local (CPU/GPU, RAM, device) or cloud API (provider and tier). Include quantization details if applicable — format, precision, tooling.

04

Technical writeup

Explain which model(s) you used and why, what problem you solved, what engineering techniques you used to compensate for model weakness. Be honest about the division of labor — if your RAG pipeline does 90% of the work and the LLM just formats the answer, say so. That's not a weakness — that's smart engineering.

05

Cost & performance metrics

Total API cost if using cloud models — actual dollars spent during development and for the demo. Screenshots of billing dashboards are appreciated. Latency, accuracy, tokens per second — whatever matters for your use case.

06

2-minute demo video & known failures

Screen recording showing the project in action. Plus: where does the model still break? What inputs produce garbage? Judges respect honesty — and it helps them understand the real "wow gap."

Who's judging.

We're finalizing the judging panel. Expect people who've shipped real products with small models, not just benchmarked them. Follow us on Discord for announcements.

TBA

TBA

Judge details coming soon.

TBA

TBA

Judge details coming soon.

TBA

TBA

Judge details coming soon.

We're still looking for judges. If you have experience shipping with small/local models and want to help evaluate projects, reach out at hello@raptors.dev.

Why we build in garages.

The name says it. The best companies started in garages. The best music came from garage bands. The best hacking happens when you strip away resources and rely on raw ingenuity.

Frontier models are impressive. But they're also expensive, rate-limited, dependent on someone else's infrastructure, and inaccessible to most builders on the planet. The cheapest models — the ones nobody writes blog posts about, the ones that sit at the bottom of every leaderboard, the ones companies release as an afterthought next to their flagship — that's where the real democratization of AI lives.

There are billions of people and millions of developers who will never have a $200/month API budget. There are use cases that will never justify the cost of a frontier model. There are devices that will never connect to a cloud endpoint. The future of AI isn't just the biggest model — it's the smallest model that still gets the job done.

This hackathon is about proving that future is already here. No cloud budget required. No GPU cluster. Just the weakest model you can find and the smartest engineering you can muster.

Build something that makes people forget the model is cheap.

The current is shifting.

The interesting work isn't happening on billion-dollar clusters. It's on laptops, single GPUs, and free-tier endpoints. Builders who treat constraints as architecture decisions, not limitations. Quantized weights, prompt chains, systems running on hardware nobody expected them to run on — that's where the signal is. Capability per dollar is the metric that matters.