CactusRoute — 7-Layer Adaptive Edge/Cloud Hybrid Router

🏆 1st Place — Google DeepMind x Cactus Compute Global Hackathon (Feb 2026)

FunctionGemma 270M (on-device) ↔ Gemini 2.5 Flash (cloud)
Schema-driven adaptive routing for function calling — backed by 8 arXiv papers

What is this?

A hybrid inference strategy for the FunctionGemma Hackathon that dynamically routes tool-calling queries between a 270M on-device model (FunctionGemma via Cactus) and Gemini 2.5 Flash in the cloud.

Instead of using a fixed confidence threshold (the baseline uses 0.99, routing nearly everything to cloud), CactusRoute uses a 7-layer schema-driven adaptive framework with output repair, semantic validation, deterministic extraction, retry with prompt variation, and per-difficulty adaptive thresholds — every technique grounded in peer-reviewed research.

Architecture

User Query
    │
    ▼
┌──────────────────────────────────────────┐
│  Layer 1: Pre-flight Difficulty          │  Zero-cost heuristic: tool count +
│  Estimation  (easy / medium / hard)      │  multi-intent markers ("and", commas)
│  ↳ ODIA (2507.08877)                     │  Backed by: simple/complex routing
└───────────────────┬──────────────────────┘
                    │
                    ▼
┌──────────────────────────────────────────┐
│  FunctionGemma  (On-Device, 270M)        │  Always runs first (~50-100ms)
│  force_tools=True, constrained JSON      │  Speculative local-first approach
│  ↳ TinyAgent (2409.00608)                │  Backed by: SLM ≥ GPT-4-Turbo
│  ↳ Hammer (2410.04587)                   │  Backed by: description-aware calling
└───────────────────┬──────────────────────┘
                    │
     ┌──────────────┴──────────────┐
     │  Layer 2: Handoff Signals   │
     │  cloud_handoff (1st token)  │──→ catastrophic entropy → Layer 7
     │  spike_handoff (mid-gen)    │──→ entropy spike → Layer 7
     │  ↳ STEER (2511.06190)       │
     │  ↳ U-HLM (2412.12687)      │
     └──────────────┬──────────────┘
                    │ (generation succeeded)
                    ▼
     ┌──────────────────────────────┐
     │  Layer 3: Output Repair     │  AM/PM hour correction, negative fix,
     │  repair_output()            │  semantic mismatch fill, type coercion
     │  ↳ Hybrid-Code (2512.23743) │  Backed by: format normalization
     └──────────────┬──────────────┘
                    │
                    ▼
     ┌──────────────────────────────┐
     │  Layer 4: Multi-Gate        │  A. Structural: tool names + required params
     │  Validation                 │  B. Semantic: word-overlap + integer ranges
     │  validate_output()          │  C. Intent coverage: expected vs actual calls
     │  semantic_validate()        │
     │  ↳ PARSE (2510.08623)       │  Backed by: reflection-based guardrails
     │  ↳ Hammer (2410.04587)      │  Backed by: description-aware validation
     │  ↳ ToolRM (2510.26167)      │  Backed by: rule-based scoring
     └──────────────┬──────────────┘
                    │
                    ▼
     ┌──────────────────────────────┐
     │  Layer 5: Adaptive Conf.    │  easy=0.25  medium=0.45  hard=0.60
     │  Thresholds                 │  Dynamic > fixed (bimodal distribution)
     │  ↳ STEER (2511.06190)       │  Backed by: GMM-fitted confidence
     │  ↳ ODIA (2507.08877)        │  Backed by: simple/complex routing
     └──────┬───────────┬──────────┘
            │ PASS      │ FAIL
            ▼           ▼
     ┌──────────┐ ┌────────────────────┐
     │  ACCEPT  │ │  Layer 6: Retry    │  Alternate system prompt
     │  Local   │ │  with Prompt       │  Full re-validation pipeline
     │  Result  │ │  Variation         │
     └──────────┘ │  ↳ PARSE           │  Backed by: 92% error reduction
                  │  ↳ ToolRM          │  Backed by: self-correction +11.4pts
                  └──────┬─────────────┘
                         │ (retry also failed)
                         ▼
                  ┌────────────────────┐
                  │  Layer 7: Determ.  │  Schema-driven regex extraction
                  │  Extraction +     │  from raw user text; segment
                  │  Cloud Fallback   │  decomposition for multi-call
                  │  ↳ Hybrid-Code    │  Backed by: keyword fallback
                  │  ↳ TinyAgent      │  Backed by: Tool RAG patterns
                  └──────┬─────────────┘
                         │ (extraction failed)
                         ▼
                  ┌────────────────────┐
                  │  Gemini 2.5 Flash  │  Cloud fallback (last resort)
                  │  Cloud Endpoint    │
                  └────────────────────┘

The 7 Layers Explained

Layer	Function	Research Backing
1. Difficulty estimation	Classifies query as easy/medium/hard via tool count + NLP markers — zero model inference	ODIA (2507.08877): ByteDance's simple/complex routing handles 60% of traffic with small model
2. Handoff signals	Cactus `cloud_handoff` (1st token entropy) and `spike_handoff` (mid-generation entropy spike)	STEER (2511.06190): logit confidence is bimodal → clean separation; U-HLM (2412.12687): speculative local-first saves 46% cloud calls
3. Output repair	AM/PM hour correction, negative value fix, semantic mismatch fill, type coercion	Hybrid-Code (2512.23743): "format normalization" auto-corrects LLM output errors; 0% hallucination rate
4. Multi-gate validation	Structural (tool names + required params) + semantic (word-overlap + integer range) + intent coverage	PARSE (2510.08623): reflection-based guardrails; Hammer (2410.04587): description-aware validation; ToolRM (2510.26167): rule-based scoring
5. Adaptive thresholds	Per-difficulty confidence bars: easy=0.25, medium=0.45, hard=0.60	STEER: dynamic > fixed thresholds with GMM-fitted bimodal distribution; ODIA: difficulty-based routing proven in production
6. Retry with prompt variation	Second on-device attempt with alternate system prompt; full re-validation	PARSE: 92% error reduction within first retry; ToolRM: self-correction yields +11.4 accuracy points
7. Deterministic extraction	Schema-driven regex parsing from raw text; segment decomposition for multi-call; cloud fallback as last resort	Hybrid-Code: "reliability through redundancy"; TinyAgent (2409.00608): 1.1B model exceeds GPT-4-Turbo via structured extraction

Semantic Role System

The framework uses 11 semantic roles to map tool parameters to extraction strategies:

Role	Extraction Strategy	Example
`ROLE_HOUR`	Time regex: `at 3pm`, `3:00`	`set_alarm(hour=15)`
`ROLE_MINUTE`	Time regex: `3:30`, `half past`	`set_alarm(minute=30)`
`ROLE_DURATION`	Duration regex: `10 minutes`, `1 hour`	`set_timer(duration=10)`
`ROLE_LOCATION`	Location patterns: `in Paris`, `weather for NYC`	`get_weather(location="Paris")`
`ROLE_PERSON`	Proper name patterns: `send Bob`, `to Alice`	`send_message(contact="Bob")`
`ROLE_MESSAGE`	Message patterns: `saying "hello"`, `"meet me"`	`send_message(message="hello")`
`ROLE_TITLE`	Reminder patterns: `to buy milk`	`create_reminder(title="buy milk")`
`ROLE_SONG`	Play patterns: `play Bohemian Rhapsody`	`play_music(song="...")`
`ROLE_QUERY`	Search patterns: `find contact`, `search for`	`search_contacts(query="...")`
`ROLE_TIME_STR`	Full time string: `at 3pm`, `3:00 PM`	`set_alarm(time="3:00 PM")`
`ROLE_UNKNOWN`	Cloud fallback — cannot extract deterministically	—

Scoring Formula

$$\text{Score} = \sum_{d} w_d \cdot (0.60 \cdot F1 + 0.15 \cdot \text{TimeScore} + 0.25 \cdot \text{OnDeviceRatio})$$

Difficulty weights: easy=20%, medium=30%, hard=50%.

Our 7-layer framework maximizes all three components: high F1 through multi-gate validation and repair, low latency through local-first execution, and high on-device ratio through adaptive thresholds + retry + deterministic extraction.

Project Structure

cactus-hack/
├── README.md                  ← You are here
├── RESEARCH.md                ← 83 papers searched, 8 deeply analyzed, 140+ learnings
├── STRATEGY.md                ← Detailed strategy with research findings
│
├── functiongemma-hackathon/   ← Hackathon submission
│   ├── main.py                ← 7-layer adaptive router (~1200 lines)
│   ├── benchmark.py           ← Official benchmark (30 cases: 10 easy/10 med/10 hard)
│   ├── submit.py              ← Leaderboard submission script
│   ├── demo.py                ← Rich interactive demo (4 modes)
│   └── tests.py               ← 239 unit tests, 27 test classes (any platform)
│
├── deep-research-mcp-server/  ← Deep research pipeline (Gemini-powered)
│   ├── src/                   ← TypeScript source
│   └── output/                ← Research outputs (learnings JSON + reports)
│
├── cactus/                    ← Cactus SDK (git submodule)
│   ├── python/                ← Python bindings
│   └── weights/               ← Model weights (downloaded via cactus CLI)
│
└── papers/                    ← Saved research papers

Quick Start

Prerequisites

uv (Python package manager)
Mac with Cactus SDK for benchmark/demo (tests run anywhere)
GEMINI_API_KEY environment variable

Setup

cd functiongemma-hackathon
uv sync

Run Tests (any platform)

uv run python tests.py -v     # 206 tests, 23 classes, ~0.01s, no Cactus needed

Run Benchmark (Mac only — requires Cactus)

export GEMINI_API_KEY="your-key"
uv run python benchmark.py

Run Demo (Mac only — requires Cactus)

uv run python demo.py                # Curated scenarios with dashboard
uv run python demo.py --interactive  # Free-form text input
uv run python demo.py --voice        # Voice-to-action via Whisper
uv run python demo.py --compare      # Baseline vs CactusRoute side-by-side
uv run python demo.py --benchmark    # Full 30-case benchmark run

Submit to Leaderboard

uv run python submit.py --team "YourTeamName" --location "YourCity"

Key Optimizations Over Baseline

#	Optimization	Research Backing	Impact
1	Model singleton — load once, reuse	—	Saves ~7-15s across 30 benchmark calls
2	Pre-flight difficulty — tool count + NLP heuristics	ODIA (ByteDance)	Zero-cost routing signal
3	Adaptive thresholds — 0.25 / 0.45 / 0.60	STEER, FrugalGPT	Maximizes on-device without sacrificing F1
4	Schema-driven output repair — AM/PM, negatives, mismatches	Hybrid-Code	Rescues otherwise-rejected local outputs
5	Semantic validation — word overlap + range checks	PARSE, Hammer	Catches hallucinated parameters
6	Role-based extraction — 11 semantic roles mapped to regex	PARSE (ARCHITECT)	Deterministic fallback for on-device
7	Retry with prompt variation — alternate system prompt	PARSE (92% 1st retry), ToolRM (+11.4pts)	Cheap second chance on-device
8	Deterministic extraction — schema-driven text parsing	Hybrid-Code (keyword fallback)	Extracts calls without any LLM
9	Intent coverage augmentation — fills missing calls	TinyAgent (LLMCompiler)	Catches incomplete multi-call output
10	Type coercion — string→int based on schema	ToolRM (argument similarity)	`"10"` ≠ `10` in F1 comparator
11	`tool_rag_top_k=0` — use ALL tools	TinyAgent (Tool RAG)	Default=2 misses needed tools
12	Dynamic system prompt — multi-call instruction for hard queries	—	"Call ALL relevant tools"
13	Cloud model fix — `gemini-2.5-flash`	—	Baseline's `gemini-2.0-flash` is deprecated
14	Source tag normalization — all on-device paths report `"on-device"`	—	Benchmark checks `source == "on-device"` exactly; `"on-device (retry)"` etc. were scored as cloud

Note: All on-device execution paths (direct, retry, extracted) set source = "on-device" for benchmark compatibility. The fine-grained detail (e.g. "on-device (retry)", "on-device (extracted)") is preserved in result["_detail"] and shown in benchmark/demo display output. To restore verbose source tags, change "source" assignments back to the "_detail" values in generate_hybrid() and _try_extraction_then_cloud().

Test Suite

239 tests across 27 test classes — runs on any platform, no Cactus or API keys needed:

Test Class	Tests	What it covers
`TestEstimateDifficulty`	12	Tool count + multi-intent classification
`TestCountExpectedIntents`	5	NLP-based intent counting
`TestCoerceArgTypes`	8	Schema-driven type coercion
`TestValidateOutput`	8	Structural validation (names + params)
`TestInferParamRole`	12	Semantic role inference from schema
`TestExtractForRole`	25	Regex extraction for all 11 roles
`TestSemanticValidate`	6	Word-overlap + range validation
`TestRepairOutput`	7	AM/PM, negatives, semantic repair
`TestBuildCallsFromText`	9	Deterministic extraction pipeline
`TestRoutingDecisions`	19	End-to-end routing with thresholds
`TestThresholdBoundaries`	5	Exact boundary conditions for thresholds
`TestSignalPriority`	3	Handoff checked before confidence
`TestBenchmarkCompatibility`	2	F1 normalization + call matching
`TestToolRelevance`	9	Keyword-based tool ranking
`TestSegmentQuery`	7	Multi-intent query splitting
`TestAugmentCalls`	5	Missing intent augmentation
`TestBuildCallsFromSegments`	6	Segmented extraction pipeline
`TestBenchmarkExtraction`	14	Benchmark-realistic extraction patterns
`TestExtractionF1`	17	F1 scoring against real benchmark expected values
`TestRepairChainSafety`	9	Repair doesn't degrade valid output
`TestCrossEntityConfusion`	4	Entity isolation across tools/params
`TestSemanticEdgeCases`	9	Hallucination rejection + edge cases
`TestRoutingPipelineIntegration`	5	Full pipeline F1 with realistic model failures
`TestFailingBenchmarkCases`	4	Regression tests for specific benchmark failures
`TestBenchmarkExactMatch`	8	Exact-match validation for benchmark cases
`TestSemanticValidationRejectsWrongValues`	16	Semantic rejection of hallucinated values
`TestFullPipelineFallback`	5	End-to-end fallback chain validation

Research

Phase 1: Routing Fundamentals

arxiv MCP — 30 papers on edge inference, model routing, confidence calibration
deep-research MCP — Two full runs: Gemini 2.5 Flash (76 learnings, 34 URLs, 248s) and Gemini 3.0 Flash Preview (64 learnings, 50 URLs, 158s)
bluera-knowledge MCP — Cactus SDK source analysis (confidence calculation, handoff signals)
GitHub MCP — Competitive landscape (158 forks analyzed, 3 implementations read)

Phase 2: 7-Layer Technique Validation

arxiv MCP — 53 additional papers; 6 deeply analyzed and cited throughout implementation:
- PARSE (2510.08623) — Schema optimization + reflection-based guardrails → validates our infer_param_role() + semantic_validate()
- Hybrid-Code (2512.23743) — 3-tier neuro-symbolic framework → validates our LLM → extraction → verification pipeline
- TinyAgent (2409.00608) — 1.1B model exceeds GPT-4-Turbo on function calling → validates SLM-first approach
- Hammer (2410.04587) — Function masking for description-aware calling → validates extract_for_role() semantic patterns
- ODIA (2507.08877) — Simple/complex query routing, 78% latency reduction → validates estimate_difficulty()
- ToolRM (2510.26167) — Tool-use reward modeling with self-correction → validates repair_output() + retry mechanism

Totals

83 papers searched, 8 deeply analyzed, 140+ learnings extracted
Every layer of the 7-layer framework is backed by at least one peer-reviewed paper

See RESEARCH.md for the full synthesis.

License

Hackathon project — see cactus-compute/functiongemma-hackathon for terms.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CactusRoute — 7-Layer Adaptive Edge/Cloud Hybrid Router

🏆 1st Place — Google DeepMind x Cactus Compute Global Hackathon (Feb 2026)

What is this?

Architecture

The 7 Layers Explained

Semantic Role System

Scoring Formula

Project Structure

Quick Start

Prerequisites

Setup

Run Tests (any platform)

Run Benchmark (Mac only — requires Cactus)

Run Demo (Mac only — requires Cactus)

Submit to Leaderboard

Key Optimizations Over Baseline

Test Suite

Research

Phase 1: Routing Fundamentals

Phase 2: 7-Layer Technique Validation

Totals

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.bluera/bluera-knowledge		.bluera/bluera-knowledge
.vscode		.vscode
functiongemma-hackathon		functiongemma-hackathon
papers		papers
.gitignore		.gitignore
README.md		README.md
RESEARCH.md		RESEARCH.md
STRATEGY.md		STRATEGY.md

Folders and files

Latest commit

History

Repository files navigation

CactusRoute — 7-Layer Adaptive Edge/Cloud Hybrid Router

🏆 1st Place — Google DeepMind x Cactus Compute Global Hackathon (Feb 2026)

What is this?

Architecture

The 7 Layers Explained

Semantic Role System

Scoring Formula

Project Structure

Quick Start

Prerequisites

Setup

Run Tests (any platform)

Run Benchmark (Mac only — requires Cactus)

Run Demo (Mac only — requires Cactus)

Submit to Leaderboard

Key Optimizations Over Baseline

Test Suite

Research

Phase 1: Routing Fundamentals

Phase 2: 7-Layer Technique Validation

Totals

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages