Skip to content

Feature: Twilio Skill — Phone Calls, SMS, and Voice from the Agent #409

@teknium1

Description

@teknium1

Overview

Give Hermes Agent a phone number. Users in the Nous Research Discord have demonstrated one of the most compelling agent use cases: an AI that can call your phone. A user recently gave Hermes on Kimi K-2.5 a single prompt — "Download TTS, spin up its own voice, then call my phone with the single weirdest joke it could invent" — and it actually did it.

This issue proposes a Twilio skill that teaches Hermes how to make outgoing phone calls (with TTS or custom audio), send SMS/MMS messages, and eventually receive incoming calls and texts. The combination of Hermes's existing text_to_speech tool + Twilio's Programmable Voice API creates a powerful voice-enabled agent that can reach out into the real world.

Use cases:

  • Agent calls you with status updates, alerts, or results from long-running tasks
  • Agent sends SMS notifications when deployments complete, cron jobs finish, or issues need attention
  • Agent leaves voicemail-style messages with TTS-generated briefings
  • Agent makes calls with custom AI-generated voices (via existing TTS tool → audio file → Twilio)
  • Receive SMS commands to trigger agent actions (lightweight alternative to full messaging gateways)
  • Phone-based two-factor authentication flows for agent-managed services
  • Automated phone tree navigation (agent calls a support line, navigates IVR menus with DTMF tones)

Research Findings

Twilio API Architecture

Twilio's REST API is straightforward — everything is curl-able with Basic Auth:

Make a phone call with TTS:

curl -X POST "https://api.twilio.com/2010-04-01/Accounts/$TWILIO_ACCOUNT_SID/Calls.json" \
  -u "$TWILIO_ACCOUNT_SID:$TWILIO_AUTH_TOKEN" \
  -d "To=+15558675310" \
  -d "From=$TWILIO_PHONE_NUMBER" \
  -d "Twiml=<Response><Say voice=\"Polly.Amy\">Hello! This is your Hermes Agent calling with an important update.</Say></Response>"

Make a call with a pre-recorded audio file:

curl -X POST "https://api.twilio.com/2010-04-01/Accounts/$TWILIO_ACCOUNT_SID/Calls.json" \
  -u "$TWILIO_ACCOUNT_SID:$TWILIO_AUTH_TOKEN" \
  -d "To=+15558675310" \
  -d "From=$TWILIO_PHONE_NUMBER" \
  -d "Twiml=<Response><Play>https://example.com/my-tts-audio.mp3</Play></Response>"

Send an SMS:

curl -X POST "https://api.twilio.com/2010-04-01/Accounts/$TWILIO_ACCOUNT_SID/Messages.json" \
  -u "$TWILIO_ACCOUNT_SID:$TWILIO_AUTH_TOKEN" \
  -d "To=+15558675310" \
  -d "From=$TWILIO_PHONE_NUMBER" \
  -d "Body=Your deployment to production completed successfully. All 42 tests passing."

Send MMS with image:

curl -X POST "https://api.twilio.com/2010-04-01/Accounts/$TWILIO_ACCOUNT_SID/Messages.json" \
  -u "$TWILIO_ACCOUNT_SID:$TWILIO_AUTH_TOKEN" \
  -d "To=+15558675310" \
  -d "From=$TWILIO_PHONE_NUMBER" \
  -d "Body=Here is the chart you requested" \
  -d "MediaUrl=https://example.com/chart.png"

Key Technical Details

Authentication: HTTP Basic Auth with Account SID + Auth Token. Environment variables: TWILIO_ACCOUNT_SID, TWILIO_AUTH_TOKEN, TWILIO_PHONE_NUMBER.

TwiML (Twilio Markup Language): XML instructions for call behavior. Key verbs:

  • <Say> — Text-to-speech (supports 26+ voices including Amazon Polly voices)
  • <Play> — Play an audio file from a URL
  • <Gather> — Collect DTMF input (phone keypad) or speech input
  • <Record> — Record the caller's voice
  • <Dial> — Connect to another phone number
  • <Pause> — Wait N seconds

Call flow with custom audio (Hermes TTS → Twilio):

1. Hermes generates TTS audio via text_to_speech tool → local .mp3 file
2. Upload audio to a publicly accessible URL (e.g., via here.now skill #378, 
   or a simple python HTTP server, or cloud storage)
3. Make Twilio call with <Play> pointing to the audio URL
4. Twilio calls the recipient and plays the custom AI voice

Twilio CLI alternative: The twilio CLI tool provides even simpler commands:

twilio api core messages create --from "+15017122661" --to "+15558675310" --body "Hello from Hermes"
twilio api core calls create --from "+15017122661" --to "+15558675310" --twiml "<Response><Say>Hello</Say></Response>"

Pricing context:

  • Outgoing calls: ~$0.014/min (US), ~$0.02-0.05/min (international)
  • SMS: ~$0.0079/message (US)
  • Phone numbers: ~$1.15/month (US local)
  • Free trial: $15.50 credit with a trial account

Rate limits:

  • API calls: 1 call per second (CPS) by default, up to 5 CPS for verified accounts
  • SMS: varies by number type (10DLC, toll-free, short code)

Integration with Existing Hermes Tools

text_to_speech tool — Hermes already generates speech audio. The natural flow:

  1. Agent writes the message text
  2. text_to_speech generates audio with the user's configured voice
  3. Audio is uploaded to a public URL
  4. Twilio call uses <Play> to deliver the custom voice

image_generate tool — Agent generates images, sends via MMS

schedule_cronjob tool — Schedule recurring phone calls or SMS:

  • "Call me every morning at 8am with a weather briefing"
  • "Text me when the server health check fails"

send_message tool — SMS could be an additional delivery target alongside Telegram/Discord


Current State in Hermes Agent

What we already have:

What's missing:

  • No skill teaching the agent how to use Twilio
  • No structured pattern for making phone calls
  • No SMS sending capability
  • No integration between TTS output and phone delivery

Implementation Plan

Skill vs. Tool Classification

This should be a skill (Phase 1-2) because:

  • All Twilio operations can be done via curl to the REST API or the twilio CLI
  • Credentials are managed via environment variables (standard pattern)
  • No binary data handling — audio URLs and text are all that's needed
  • No streaming or real-time events for outgoing operations
  • The agent already has terminal to run curl commands; the skill teaches it how

Phase 3 (incoming) may warrant a lightweight tool or gateway adapter if we want to handle incoming calls/SMS as a full messaging platform. But Phase 1-2 as a skill is clean and sufficient.

Bundled vs Skills Hub: Skills Hub initially. Twilio is a paid service that not everyone uses. Once community adoption proves demand, consider bundling. (Similar rationale to the google-workspace skill which is bundled despite requiring API setup.)

What We'd Need

  1. twilio skill — Instructions for making calls, sending SMS/MMS, checking status
  2. Environment variable documentation — TWILIO_ACCOUNT_SID, TWILIO_AUTH_TOKEN, TWILIO_PHONE_NUMBER
  3. TwiML templates — Common call flows (say message, play audio, gather input)
  4. Integration patterns — How to combine with text_to_speech, image_generate, schedule_cronjob

Phased Rollout

Phase 1: Outgoing Calls + SMS (MVP)

  • Skill teaching curl-based Twilio API usage
  • Make phone calls with Twilio's built-in TTS (<Say> verb)
  • Make phone calls with custom audio (<Play> verb + TTS tool integration)
  • Send SMS messages
  • Send MMS with media URLs
  • Check call/message status
  • Templates for common TwiML patterns
  • Deliverables: twilio skill with reference templates

Phase 2: Advanced Voice + Notifications

  • Integration with schedule_cronjob for recurring calls/texts
  • Phone call with DTMF navigation (<Gather> for navigating IVR menus)
  • Call recording and playback
  • Multi-step call flows (say something, wait for keypad input, respond)
  • SMS as a delivery target for send_message tool
  • Conference calls (connect multiple numbers)
  • Voicemail detection and handling
  • Deliverables: Enhanced skill + send_message SMS adapter

Phase 3: Incoming Calls + SMS Gateway

  • Receive incoming SMS as agent messages (Twilio webhook → Hermes)
  • Receive incoming calls with voice interaction (Twilio → STT → agent → TTS → Twilio)
  • This becomes a full telephony gateway platform (like Telegram/Discord but for phone)
  • Would need a webhook server (could use Pinggy Feature: Pinggy Skill — Zero-Install Localhost Tunnels via SSH #361 or a simple Flask/FastAPI server)
  • Deliverables: gateway/platforms/twilio.py telephony adapter

Phase 4: Conversational Voice Agent


Example Workflows

1. "Call me with a joke" (what the Discord user did)

User: Call +15551234567 with the weirdest joke you can think of

Agent:
1. Compose the joke
2. text_to_speech("Why did the quantum physicist break up with the biologist?...")
   → ~/.hermes/audio_cache/joke_abc123.mp3
3. Upload audio to public URL (python -m http.server or here.now)
4. curl -X POST Twilio Calls API with <Play> pointing to the audio URL
5. "Done! Calling +15551234567 now. Call SID: CA123..."

2. "Text me when the deploy finishes"

User: Deploy to production and text me at +15551234567 when it's done

Agent:
1. Run deployment commands
2. Verify deployment succeeded
3. curl -X POST Twilio Messages API
   Body: "✅ Deployment to production completed. 42/42 tests passing. 
          Deployed commit abc123 at 3:45 PM."

3. Scheduled morning briefing

User: Every morning at 8am, call me with a weather and calendar summary

Agent:
1. schedule_cronjob:
   schedule: "0 8 * * *"
   prompt: "Check the weather for Las Vegas and my Google Calendar. 
            Generate a TTS briefing and call +15551234567 with the summary 
            using the twilio skill."

4. Navigate a phone tree

User: Call my ISP at +18001234567 and navigate to the billing department

Agent:
1. curl Twilio Calls API with:
   Twiml: <Response>
     <Dial sendDigits="wwww1ww2ww#">+18001234567</Dial>
   </Response>
   (waits, presses 1, waits, presses 2, waits, presses #)

Pros & Cons

Pros

  • Reach beyond the screen — Phone calls and SMS work everywhere, even without internet. The agent can reach you when you're away from your computer.
  • Proven viral appeal — The Discord demo shows this captures people's imagination. "My AI called my phone" is a powerful demo.
  • Simple API — Twilio's REST API is just curl with Basic Auth. No complex SDKs needed.
  • Composable with existing tools — TTS + Twilio = custom voice calls. image_generate + MMS = visual notifications. Cronjobs + calls = scheduled briefings.
  • Low barrier — Twilio has a free trial ($15.50 credit), costs are pennies per call/text.
  • Natural extension of send_message — SMS is just another delivery channel alongside Telegram/Discord.

Cons / Risks

  • Cost — Unlike Telegram/Discord (free), Twilio costs money. Runaway agent loops could rack up charges.
  • Phone number requirement — Need to buy a Twilio number ($1.15/month). Requires Twilio account setup.
  • Abuse potential — An agent with phone call capabilities could be used for spam or harassment. Need clear safety guidelines and rate limiting.
  • Compliance — Twilio requires A2P 10DLC registration for US SMS, toll-free verification, etc. Can be bureaucratic.
  • Audio hosting for custom TTS — The agent's TTS output is a local file. Needs to be uploaded to a public URL for Twilio to play. This is a friction point (but solvable with here.now Feature: here.now Skill — Instant Static Web Publishing for Agent-Created Content #378 or a simple HTTP server).
  • Incoming calls complexity — Phase 3 (receiving) requires a webhook server, which is more infrastructure than a pure skill can handle.

Open Questions

  1. Twilio CLI vs curl? The twilio CLI is more ergonomic but adds a dependency. curl works everywhere. Skill should document both, prefer curl for portability.
  2. Audio hosting for custom voices — Best approach? Options: python3 -m http.server (temporary), here.now skill (public), ngrok/Pinggy tunnel, cloud storage (S3). Need a reliable default.
  3. Rate limiting — Should the skill include explicit rate limit warnings? Twilio's default is 1 CPS. Agent could easily exceed this in a loop.
  4. Cost guardrails — Should we recommend a spending limit? Twilio supports account-level spending limits in the console.
  5. SMS as send_message target? — Should Phase 2 add SMS as a first-class delivery target in the send_message tool (alongside Telegram, Discord)? This would mean adding to gateway/config.py Platform enum.
  6. Incoming SMS → full gateway? — If we support incoming SMS in Phase 3, should it be a full gateway platform (like Telegram) or a lighter integration?
  7. Safety policy — Should we add a confirmation step before making phone calls? ("I'm about to call +1555... — proceed?") Or trust the user's intent?

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions