Inspiration

In 2023 alone, $25 million was lost to AI voice scams. We watched news stories of elderly people losing their life savings to deepfaked voices of their grandchildren, and saw reports of CEOs being tricked into wire transfers by AI-cloned executive voices. The terrifying part? Modern AI can clone any voice with just 3 seconds of audio.

Existing voice authentication systems focus only on who is speaking. They verify the voice matches, but they can't tell if it's a real human or an AI reading a script. We realized there's a fundamental gap: no one is testing whether the speaker actually understands what they're saying.

That's when we had our breakthrough: humans comprehend instructions, but text-to-speech systems just read verbatim.

What It Does

Catphish is a two-layer voice authentication system that stops AI deepfakes:

Layer 1: Speaker Verification (Resemblyzer)

  • Verifies the voice matches the registered user
  • Uses 256-dimensional voice embeddings with cosine similarity
  • Answers: "Is this the right person?"

Layer 2: Comprehension Analysis (Gemini)

  • Tests whether the speaker understands multi-step instructions
  • Detects the difference between humans and TTS systems
  • Answers: "Is this a real human speaking?"

The Anti-TTS Innovation

We generate adversarial phrases that expose AI imposters:

Challenge: "Say the sum of 2+3, then say apple banana."

  • Human response: "Five, apple banana"
  • AI TTS response: "Say the sum of 2+3, then say apple banana"

Humans naturally understand and execute complex instructions. AI text-to-speech systems are designed to read verbatim, struggling with dynamic cognitive challenges.

Real-World Use Cases

  • Banks verify wire transfers and high-value transactions
  • Crypto exchanges confirm withdrawal requests
  • Healthcare providers authorize prescription refills
  • Enterprises secure VPN and system access

How We Built It

Architecture

Our system connects six major components:

FastAPI Backend - Multi-tenant API with enrollment, verification, and challenge generation endpoints. Deployed on Vultr infrastructure with API key authentication, versioning, and rate limiting.

Valkey Database - In-memory storage for voice embeddings and session state. Achieves sub-10ms retrieval times, critical for real-time authentication. Stores approximately 1KB per user with multi-tenant data isolation.

Resemblyzer Engine - Generates 256-dimensional voice embeddings and performs cosine similarity matching with an 85% threshold. No raw audio stored, only embeddings for privacy.

Google Gemini 2.5 Flash - Analyzes audio to detect verbatim reading versus instruction-following. Identifies TTS artifacts and unnatural speech patterns in real-time (3-6 seconds per verification).

Solana Blockchain - Provides immutable audit trail for every verification event. Logs hashed audio fingerprints, timestamps, and results. Currently on devnet, ready for mainnet.

React Frontends - User-facing verification UI with microphone recording plus demo banking integration. Enrollment takes 2-5 seconds, verification takes 3-6 seconds.

Development Timeline

Hour 0-8: Architecture design and API scaffolding. Designed two-layer verification system, set up FastAPI with multi-tenant structure, integrated Valkey for session management.

Hour 8-16: Core voice processing and AI integration. Implemented Resemblyzer embedding generation, built anti-TTS phrase generator with cognitive challenges, integrated Gemini audio comprehension analysis,.

Hour 16-24: Frontend and deployment. Built React verification UI, created demo banking integration, deployed to Vultr infrastructure, tested against various TTS systems (ElevenLabs, Azure TTS, Google TTS), refined similarity thresholds and phrase generation, added comprehensive error handling, added Solana transaction logging.

Challenges We Faced

Anti-TTS Phrase Generation

Creating phrases that reliably distinguish humans from AI without being too difficult for legitimate users was our first major challenge.

We tried complex math problems (too hard for users), simple word repetition (too easy for advanced TTS), and visual formatting cues (accessibility issues).

Our solution: multi-part instructions that require cognitive processing. Examples include "Say the number five, spell the word two, then say both words" and "Count backwards from 3, then say your favorite color." These work because TTS systems are designed to read verbatim, not interpret instructions.

Real-Time Performance

Voice authentication needs to feel instant. Users won't tolerate 10+ second delays.

Our pipeline includes audio upload and processing, Resemblyzer embedding generation, Gemini API call for comprehension analysis, Valkey lookups for stored embeddings, and Solana transaction logging.

We solved this with Valkey for sub-10ms embedding retrieval (versus 50-200ms with disk databases), asynchronous processing for parallel Gemini and Resemblyzer calls, optimized prompts to minimize Gemini token usage, and non-blocking blockchain writes.

Result: 3-6 second total verification time, fast enough for production.

Similarity Threshold Tuning

Finding the right balance between security and usability required extensive testing. Too strict (over 90%) and legitimate users get rejected because voices change with colds, stress, and time of day. Too loose (under 80%) and attackers with good voice clones get through.

We settled on 85% similarity threshold after testing with different recording devices (phone, laptop, headset), various environmental conditions (quiet room, background noise), multiple TTS systems trying to spoof, and the same user at different times of day.

This gives us 85-90% anti-spoofing accuracy while maintaining good UX.

Multi-Tenant Architecture

Banks and exchanges can't share voice embeddings. Strict data isolation is required.

We implemented API key-based tenant separation, Valkey key prefixing (tenant_id:user_id:embedding), separate Solana wallets per tenant, and ensured no cross-tenant data leakage is possible.

This was critical for production readiness. Enterprises won't use a system that could leak biometric data between organizations.

Privacy vs. Auditability

Blockchain audit trails versus biometric data privacy presented a design challenge.

Storing raw audio on-chain would create immutable voice recordings that could be replayed, violate privacy regulations, and bloat the blockchain with large audio files.

Our approach: hash audio fingerprints with SHA-256, log hash plus verification result plus timestamp to Solana. This provides immutable proof that verification occurred with zero PII (Personally Identifiable Information) or biometric data on-chain, and allows authenticity verification without exposing voice data.

What We Learned

Technical Insights

Valkey is incredibly fast. We initially planned to use PostgreSQL. Switching to Valkey reduced our embedding retrieval time from 150ms to under 10ms. For real-time biometrics, this difference makes the product viable.

Gemini's multimodal capabilities are underutilized. Most projects use Gemini for text generation. Its audio comprehension analysis is incredibly powerful for detecting subtle differences between human speech and AI synthesis. The model can identify verbatim reading patterns, lack of natural pauses and inflections, absence of cognitive processing delays, and TTS artifacts in audio frequencies.

AI versus AI works. Using Gemini to detect AI-generated voices feels like fighting fire with fire, and it works. The key insight: AI is great at pattern recognition, and TTS systems have patterns humans don't.

Blockchain needs a real use case. Most hackathon blockchain projects feel focuses purely on the financial side. Immutable audit trails for high-stakes authentication decisions is exactly what blockchain solves. Companies need to prove to regulators that they verified identities correctly. Solana gives them that proof.

What's Next

3-Month Roadmap

  • Pilot deployments with organizations
  • Implement further Machine Learning tools to further classify spoofed audio (extra layer of security)
  • Multi-language support (Gemini handles 100+ languages)

6-Month Roadmap

  • Production Solana mainnet deployment
  • iOS/Android SDKs for mobile integration
  • Live phone call integration (verify during calls)

12-Month Roadmap

  • Custom AI models for improved TTS detection
  • Scale to 1M+ active users
  • Expansion beyond authentication (fraud detection, call center verification)

Our Dare

Try to fool our system with AI voice clones (ElevenLabs, Azure TTS), pre-recorded audio of yourself, or someone who sounds similar to you.

"Try to deepfake us, we dare you."

Built With

Share this project:

Updates