Garage Inference — Hackathon

// 00. Welcome

The constraint is the creativity.

Welcome to Garage Inference — a hackathon where the constraint is the creativity. Forget frontier models and sky-high API bills. We're going small, going cheap, and going all-in on what the weakest, most affordable LLMs can actually accomplish when paired with clever engineering.

Build remarkable, useful, surprising things with models that cost fractions of a penny per call — or run on your laptop's CPU.

The constraint IS the creativity.

When you can't brute-force your way to quality with a 400B parameter model, you have to actually think. You have to architect. You have to be clever. That's where the real engineering happens.

Most hackathons celebrate the biggest models, the fattest GPU clusters, the most expensive API calls. We're going the other direction — and we think that's where the most interesting work lives.

// 01. The Challenge

Use weak models to build strong products.

Build a working project powered by the cheapest and smallest large language models available — whether you run them locally on your own hardware or call the most budget-friendly API tier out there.

Your goal: squeeze remarkable results out of limited intelligence and limited budget. Then show us the gap between how bad your model is and how good your project turned out.

best_project = weakest_possible_model × highest_possible_impact

A solid app on Gemma 3 1B will beat a decent app on GPT-5 Nano every time. A useful tool running Qwen 3 0.6B in-browser will beat the same tool calling Ministral 8B via API. Go as low as you can. The constraint is the point.

Good

Using Phi-4 Mini to power a real-time code reviewer that actually works

Bad

Using GPT-5.2 to build a chatbot — won't cut it

We don't care how you get there. We care what you ship — and how little intelligence it took to ship it.

// 02. Model Tiers

Pick your weight class.

Not all cheap models are equally cheap. The tier system makes scoring fair: lower tier = more points. Weaker model, same quality output — that's harder to do, so it scores higher.

Grey area? Ask in Discord before the hackathon. We'll maintain a living list of approved models and tiers on the event page. When in doubt, drop a tier — judges reward it.

Tier 1 — Absolute Garage MAX

Maximum constraint, maximum points. Everything here barely follows instructions.

Qwen 3 0.6B/1.7B, Qwen 3.5 Small 0.8B, Gemma 3 1B, Gemma 3n E2B, SmolLM2 135M/360M/1.7B, Phi-4 Mini, Ministral 3B, DeepSeek-R1-Distill 1.5B, browser & edge models, any model ≤4B

Tier 2 — Proper Garage HIGH

Cheap and limited, but they follow instructions most of the time.

Qwen 3 4B/8B, Qwen 3.5 Small 4B/9B, Gemma 3 4B (Q4), Gemma 3n E4B, Ministral 8B, DeepSeek-R1-Distill 7B/8B, Llama 4 Scout (Q2/Q3). API: GPT-5 Nano, Gemini 2.0 Flash-Lite, Gemini 2.0 Flash, Ministral 8B, DeepSeek V3.2

Tier 3 — Home Workshop MOD

Decent models at low prices. You'll need strong engineering to stand out.

Qwen 3 14B/32B, Qwen 3 30B-A3B (MoE), Gemma 3 12B/27B (Q4), Ministral 14B, DeepSeek-R1-Distill 14B/32B. API: GPT-5 Mini, Gemini 2.5 Flash, $0.25–$1.50/M tokens

Tier 4 — Rented Studio LOW

Real capability. Prove the engineering mattered, not just the model.

Llama 4 Scout (Q4+), DeepSeek-R1-Distill 70B, Qwen 3 235B-A22B (Q4). API: Claude Haiku 4.5, Gemini 2.5 Pro (batch), $1.50–$3/M tokens

Banned GPT-5/5.2/5.2 Pro, Claude Sonnet 4.6/Opus 4.6, Gemini 3 Pro, o3-pro, Llama 4 Maverick (full precision), any >$3/M input tokens, any frontier model

// 03. What to Build

Five steps. No shortcuts.

Pick Your Model (or models)

Choose one or more LLMs from Tier 1, 2, or 3. You can combine multiple small models. You can use different models for different subtasks. You can use a tiny local model for 90% of tasks and a cheap API model for the remaining 10%. Declare everything — we want full transparency on what's doing the thinking.

Pick a Real Problem

Choose something that matters. Something a user would actually care about. Not a toy demo, not a benchmark wrapper — a product or tool that delivers real value. Ask yourself: "Would someone use this next week?" If yes, you're on the right track.

Engineer Around the Model's Weaknesses

This is where the hackathon lives. Your model is dumb. It will hallucinate. It will forget context. It will misunderstand instructions. Your job as an engineer is to build the scaffolding that compensates.

Structured prompts — Constrain output to what the model can actually handle

RAG — The model doesn't need to "know" anything — just read and respond

Tool use — Orchestrate actions rather than generating raw knowledge

Multi-step pipelines — Break complex tasks into tiny subtasks a weak model can handle

Validation layers — Catch bad outputs and retry, rephrase, or reroute

Caching — Don't repeat expensive work — pre-compute what you can

Fine-tuning — A tiny model can become an expert at one narrow thing

Ensemble methods — Multiple tiny models voting or collaborating

Confidence scoring — Let the model say "I don't know" instead of hallucinating

You don't need to use all of these. Pick what fits your project. But the best submissions will show that you thought deeply about the model's limitations and engineered solutions for them.

Build It

During the hackathon window. Code must be original. You can use pre-existing frameworks, libraries, and tools. You can use AI coding assistants to write your code — we judge the final product, not your process.

Submit

A working demo, source code, a short write-up explaining which model(s) you used and why, what problem you solved, what engineering techniques you used, your total cost (if using APIs), and your honest assessment of where the model still fails — we respect transparency.

// 04. Project Ideas

Inspiration, not prescription.

These are starting points. The best projects will be ones we didn't think of.

"No way that's a 3B model"

Customer support bot — Handles real conversations with a 1B model, RAG over your docs, and structured tool use.
Local code review assistant — Running Phi-4 Mini that catches actual bugs in pull requests.
Personal finance advisor — Running entirely in-browser via WebLLM — no server, no data leaks, no cost.
Meeting notes summarizer — Runs on a Raspberry Pi plugged into your conference room.

"Cheapest possible API, maximum value"

Content moderation pipeline — Using GPT-5 Nano that matches GPT-5-level accuracy through prompt chaining and confidence thresholds.
Multi-language FAQ system — Built on Gemini 2.0 Flash that handles 15 languages for under $0.10/day.
Real-time writing assistant — Using Claude Haiku 4.5 that gives feedback as you type — fractions of a cent per session.
Personalized news digest — Reads 500 articles and produces a briefing using DeepSeek V3.2 for under $0.01 total.

"This shouldn't work but it does"

Browser-only page summarizer — Gemma 3n E2B running entirely in-browser, powering an extension — zero server cost, zero latency, fully private.
Legal contract extractor — Qwen 3 0.6B fine-tuned on legal contract language that extracts key terms better than GPT-4o-mini.
NPC dialogue system — SmolLM2 1.7B powering a playable game — with personality, memory, and contextual responses.
1B model ensemble — Three 1B models collaborating on writing tasks, each specialized (outlining, drafting, editing), rivaling a single 8B model.

"Extreme garage"

Qwen 3 0.6B doing anything useful — At all. Seriously. If you can make this model produce value, you deserve a trophy.
Voice-controlled RPi Zero — A model running on a $35 Raspberry Pi Zero that responds to voice commands.
SmolLM2 home automation — SmolLM2 135M as the brains of a home automation system.
Inference on a 2020 phone — A literal phone from 2020. Running inference. Doing something useful. Good luck.

// 06. Scoring

How we judge.

Let's be completely transparent about what wins this hackathon. Every project gets scored on five criteria. But those scores get viewed through the lens of what model you used.

The tier system isn't a formal point multiplier — it's a philosophical lens. Judges are instructed to always ask themselves: "How impressive is this result given the model that produced it?"

Judge sees two projects with the same score on "Practical Usefulness." Project A uses GPT-5 Nano (Tier 2). Project B uses Qwen 3 0.6B running in-browser (Tier 1). Project B wins. Every time. No debate.

30%

Results vs. Model Capability

The Wow Gap

The core metric of Garage Inference. How wide is the gap between your model's expected capability and your project's actual output? Did you make a model that can barely count to ten produce something that feels intelligent? Did you build a production-quality tool on a model the industry uses as a throwaway baseline?

Scores high A Tier 1 model producing results that feel like a Tier 3 model. A tool that makes users forget they're talking to a 1B model. A demo where the judge says "wait, this is running WHERE?"

Scores low A capable Tier 3 model doing exactly what you'd expect it to do. A project where the model's quality is carrying the work rather than your engineering.

20%

Practical Usefulness

Would a real person use this? Not "would a person be impressed at a demo day" — would they actually install this, bookmark this, or pay for this? Does it solve a genuine problem? Could someone with no ML background benefit from it?

Scores high A tool you'd actually recommend to a friend. Something that saves time, money, or effort. A product with a clear user and a clear use case.

Scores low A cool demo with no real-world application. A benchmark wrapper. A chatbot that just... chats.

20%

Technical Execution

Is the project stable? Does it respond in reasonable time? Does the architecture make smart trade-offs? Did you handle edge cases? Did you build proper error handling for when the model inevitably produces garbage? Is the code clean enough that someone else could extend it?

Scores high Smooth UX, fast inference, graceful failure modes, clean code, smart caching, thoughtful architecture decisions documented in the write-up.

Scores low Crashes during demo. 30-second response times with no loading state. No error handling. A monolithic prompt with no engineering around it.

10%

Creativity & Innovation

Did you find a novel angle? An unexpected use case for small models? A technique the judges haven't seen before? Did you combine models, tools, or approaches in a surprising way?

Scores high "I've never seen anyone try this with a model this small." A new prompting technique. A creative architecture. An unexpected domain.

Scores low Standard RAG chatbot. Standard summarizer. Anything that looks like a tutorial project with a different coat of paint.

10%

Accessibility & Reproducibility

Can someone else run this? Is the setup documented? If it's a local project — can a developer with a regular laptop get it running in under 15 minutes? If it's an API project — is the cost clear and manageable? Could a student in a developing country use this?

Scores high One-command setup. Clear README. Documented hardware requirements. Runs on a $500 laptop. Cost breakdown included.

Scores low Requires obscure dependencies. No documentation. Only runs on your specific machine. Costs are hidden or hand-waved.

10%

Secure Design

Safety by Default

Small models hallucinate, get jailbroken, and follow injected instructions more easily than big ones. Are tools, data, and actions protected when the model misbehaves? Does the architecture prevent unsafe actions when the model is manipulated or wrong? Is there isolation between model output and anything with real-world consequences?

Scores high Least-privilege tool scopes. Input/output validation. No raw shell or DB access from model output. Clear threat model in the write-up. Jailbreak-resistant prompts or guardrails.

Scores low Model output executed or trusted blindly. Prompt injection trivially leaks secrets. API keys on the client. No review of what the model can actually do on your behalf.

// 07. Model Guide

Choose your weapon.

Match your ambition to a strategy. When in doubt, start smaller than you think you need.

"I want maximum points and I'm a strong engineer"

Go Tier 1. Pick Qwen 3 0.6B or Gemma 3 1B. You'll fight the model on everything. Build a solid engineering layer around it — structured outputs, tight prompts, fallback logic. Tier 1 wins get the most attention.

"I want to win and I want a realistic shot"

Go Tier 1 or Tier 2. Pick Phi-4 Mini or GPT-5 Nano. Weak enough to be a real constraint, reliable enough to build something polished in 72 hours.

"I want to build something useful and prove a point about cost"

Go Tier 2. Use Claude Haiku 4.5 or Gemini 2.5 Flash. Focus on a real product with real users. Show that a $20/month API bill delivers production-grade AI features. Include a cost breakdown.

"I want to push the hardware constraint"

Go Tier 1, extreme end. Run inference in the browser. Deploy on a Raspberry Pi. Fine-tune a 1B model to beat a 7B on a specific task. The hardware constraint is the project.

"I have a team and we want to go deep"

Consider a multi-model architecture. Use a Tier 1 model for 80% of tasks, a Tier 2 model for the hard 20%. Or build an ensemble of three Tier 1 models that specialize in different subtasks. Show that cheap models collaborating can rival a single expensive model.

// 08. Submission

What you must declare.

Transparency is core to Garage Inference. Every submission must include:

Working demo or deployed application

Live URL, video walkthrough, or reproducible local setup. It must run. A working demo (deployed web app, video walkthrough, or installable project) — no slide decks, no "imagine if this worked."

Source code on GitHub

Public repo with a clean README. Include setup instructions that actually work. Someone should be able to clone, install, and run your project without asking you questions.

Exact model declaration

"Qwen 3 4B Q4_K_M" — not "a small Llama model." Specify where inference runs: local (CPU/GPU, RAM, device) or cloud API (provider and tier). Include quantization details if applicable — format, precision, tooling.

Technical writeup

Explain which model(s) you used and why, what problem you solved, what engineering techniques you used to compensate for model weakness. Be honest about the division of labor — if your RAG pipeline does 90% of the work and the LLM just formats the answer, say so. That's not a weakness — that's smart engineering.

Cost & performance metrics

Total API cost if using cloud models — actual dollars spent during development and for the demo. Screenshots of billing dashboards are appreciated. Latency, accuracy, tokens per second — whatever matters for your use case.

2-minute demo video & known failures

Screen recording showing the project in action. Plus: where does the model still break? What inputs produce garbage? Judges respect honesty — and it helps them understand the real "wow gap."

// 09. FAQ

Common questions.

Everything you need to know before building. If your question isn't here, ask in Discord.

The Basics

Is this hackathon free?

Yes, completely free. No entry fee, no API budget required, no cloud costs. The entire premise is building with models that cost nothing or next to nothing to run.

Can I participate solo?

Yes. Solo or teams of up to 4. All team members must register independently.

Do I need ML or AI experience?

No ML expertise required. This is about engineering around models, not training them. If you can call an API, run Ollama, or write a Python script — you're ready. Domain experts, frontend devs, and product thinkers are just as valuable as ML engineers.

Can I work on a pre-existing idea?

Ideas are fine — all code must be written during the 72-hour window. Libraries, frameworks, and open-source dependencies are allowed. Pre-written application code is not.

Models & Technical Rules

What counts as a "weak" or "cheap" model?

Anything in the bottom tiers of major leaderboards. Think Phi-4-mini (3.8B), Gemma 3 (4B), Llama 3.2 (3B), Qwen 2.5 (3B), SmolLM3 (3B), or smaller. If it runs on a laptop with no GPU, it probably counts. Check the Model Tiers section for the full breakdown.

Can I use RAG, fine-tuning, quantization, or prompt chaining?

Yes to all. RAG, quantization (GGUF, AWQ, GPTQ), prompt chains, multi-agent patterns, tool use, structured output, speculative decoding — any technique that squeezes more from less is encouraged. That's the whole point of this hackathon.

Do I need a GPU?

No. CPU inference with Q4 quantization gives 2–5 tokens/second — enough for batch processing, RAG pipelines, and tool-calling agents. An 8GB laptop can run 3–4B models comfortably via llama.cpp or Ollama. A Mac Mini M4 is a legitimate inference server.

Can I use cloud APIs for small models?

Yes, as long as the model itself qualifies as weak/cheap. Running Phi-4-mini via a cloud endpoint is fine. Running GPT-4o is not. The constraint is the model, not the infrastructure.

Can I use AI coding assistants (Copilot, Cursor, Claude) to build?

Yes, but disclose what you used in your submission. Your project should demonstrate what the cheap model can do — not what your coding assistant can do.

Won't quantization destroy my model's quality?

No. AWQ 4-bit retains 99.3% of original quality. GGUF Q4_K_M retains ~98.5%. The quality-size tradeoff is far better than most people assume. For reference, a 350M parameter model fine-tuned for tool calling beat ChatGPT and Claude on ToolBench.

Submission & Judging

What do I submit?

A working prototype, public code repository, README with setup instructions, and a short demo video. See the Submission section for full details.

Does my project need to be open source?

Yes. All winning projects must have a public repository.

What are projects judged on?

Technical creativity, real-world usefulness, how well you squeezed value from a weak model, and whether it actually works. Not judged on: code style, pitch polish, or idea originality. See Scoring for the full rubric.

// 10. Prizes

What you win.

$1,700 total prize pool. All winning projects must be open-source.

Beyond cash — winners get code review from senior engineers, personalized feedback, a portfolio-worthy shipped project, and professional connections.

1st

Root Access Award

$1,000

The project that best combines a weak model with strong engineering. Impactful, elegant, and actually works.

2nd

Runner-Up

$300

Outstanding technical execution and a creative approach to squeezing value from minimal compute.

3rd

Finalist

$200

Stands out with originality — code that makes a difference with the least capable model.

♡

Community Choice

$200

Voted by participants. The project that resonated most with the community.

Prizes are paid out via PayPal or bank transfer within 14 days of announcement. All winning projects must have a public repository.

// 11. Judges

Who's judging.

Engineers and leaders from Amazon, Google, Apple, Wayve, and more. People who've shipped real products — not just benchmarked them.

Manushi Sheth

Sonos

Data & AI leader with 6+ years leading product data spanning engineering, analytics, and ML.

Sumanth Kadulla

Cloud Infrastructure

8+ years building cloud infra on AWS, Azure, GCP. Technical judge for 25+ international conferences.

Deniz A. Akbasaran

Gorgias

Product & data specialist building evaluation frameworks for AI agents and optimizing LLM performance.

Nishant Sinha

Amazon

Engineering Lead & Sr Engineer building distributed and ML systems, deploying models at edge.

Mykhailo Shumilov

Vadimages

CTO & digital transformation leader. Expert in web development, system architecture, and emerging technologies.

Vyom Mittal

Amazon

Principal Product Manager Technical with 15+ years leading identity, access governance, and zero trust for AI.

Duygu Unlu

FameTech

SAP solution architect with 12+ years engineering global ERP landscapes and digital transformation.

Anoop K. Pillai

Amazon

Sr Manager of Software Engineering leading 50+ engineers building EC2 pricing infrastructure. 11 US patents.

Maksim Tumakov Security Judge

DefenScope

Cybersecurity product leader focused on AI-driven products. Researcher with experience at Kaspersky and BI.ZONE.

Maneesh Singh

IT Leadership

IT Leader with 20+ years across Cloud, DevOps, AI/ML & NLP in banking, travel, and healthcare.

// 12. Philosophy

Why we build in garages.

The name says it. The best companies started in garages. The best music came from garage bands. The best hacking happens when you strip away resources and rely on raw ingenuity.

Frontier models are impressive. But they're also expensive, rate-limited, dependent on someone else's infrastructure, and inaccessible to most builders on the planet. The cheapest models — the ones nobody writes blog posts about, the ones that sit at the bottom of every leaderboard, the ones companies release as an afterthought next to their flagship — that's where the real democratization of AI lives.

There are billions of people and millions of developers who will never have a $200/month API budget. There are use cases that will never justify the cost of a frontier model. There are devices that will never connect to a cloud endpoint. The future of AI isn't just the biggest model — it's the smallest model that still gets the job done.

This hackathon is about proving that future is already here. No cloud budget required. No GPU cluster. Just the weakest model you can find and the smartest engineering you can muster.

Build something that makes people forget the model is cheap.

// 13. Full Panel

The full judging panel.

Suprakash Dutta

AWS

Natarajan Alagappan

Audible (Amazon)

Harun Sokullu

Ozan SuperApp

Ismoil Shokirov

Thread Magic

Vijayachandar Sanikal

General Motors

Sudhir Ponnapalli

Walgreens

Pratik Koshiya

Workday

Daljeet Singh

AWS

Praneeth K. Patil

Equinix

Jyotheeswara R. Gottam

Walmart

Evgeny Borovikov

DevOps Leadership

Sreekanth R. Panyam

Apple

Nishant Motwani

Google

Myroslav Martsin

RBFCU

Sudarshan T. Narayanan

Kyndryl

Rohit Bhawal

Amazon

Evgeniy Zhidelev

IT Infrastructure

Maitrik Patel

Product & AI

Kaan Ozdogru

Wayve

Divya Jain

Mobile Engineering

Sergey Antonovich

Avride

Tenny E. Devadas

Achieve

Taran Goel

Amazon

Pavel Andreev

Alaska Airlines

Timur Kaziev

Ostrovok

Alexander Kalankhodzhaev

Raiku

Mykhailo Shumilov

Vadimages

Our panel is nearly full. We're selectively adding judges with deep production experience in local inference, on-device ML, or shipping AI products under real-world constraints. If that's you — hello@raptors.dev.

GarageInference

The constraint is the creativity.

Use weak models to build strong products.

Pick your weight class.

Five steps. No shortcuts.

Pick Your Model (or models)

Pick a Real Problem

Engineer Around the Model's Weaknesses

Build It

Submit

Inspiration, not prescription.

"No way that's a 3B model"

"Cheapest possible API, maximum value"

"This shouldn't work but it does"

"Extreme garage"

When it happens.

Hackathon starts

Submissions close

Judging

Community voting

Winners announced

How we judge.

Results vs. Model Capability

Practical Usefulness

Technical Execution

Creativity & Innovation

Accessibility & Reproducibility

Secure Design

Choose your weapon.

What you must declare.

Working demo or deployed application

Source code on GitHub

Exact model declaration

Technical writeup

Cost & performance metrics

2-minute demo video & known failures

Common questions.

The Basics

Models & Technical Rules

Submission & Judging

What you win.

Who's judging.

Why we build in garages.

The full judging panel.

The current is shifting.

Garage
Inference