You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Give Hermes Agent a phone number. Users in the Nous Research Discord have demonstrated one of the most compelling agent use cases: an AI that can call your phone. A user recently gave Hermes on Kimi K-2.5 a single prompt — "Download TTS, spin up its own voice, then call my phone with the single weirdest joke it could invent" — and it actually did it.
This issue proposes a Twilio skill that teaches Hermes how to make outgoing phone calls (with TTS or custom audio), send SMS/MMS messages, and eventually receive incoming calls and texts. The combination of Hermes's existing text_to_speech tool + Twilio's Programmable Voice API creates a powerful voice-enabled agent that can reach out into the real world.
Use cases:
Agent calls you with status updates, alerts, or results from long-running tasks
Agent sends SMS notifications when deployments complete, cron jobs finish, or issues need attention
Agent leaves voicemail-style messages with TTS-generated briefings
Agent makes calls with custom AI-generated voices (via existing TTS tool → audio file → Twilio)
Receive SMS commands to trigger agent actions (lightweight alternative to full messaging gateways)
Phone-based two-factor authentication flows for agent-managed services
Automated phone tree navigation (agent calls a support line, navigates IVR menus with DTMF tones)
Research Findings
Twilio API Architecture
Twilio's REST API is straightforward — everything is curl-able with Basic Auth:
Make a phone call with TTS:
curl -X POST "https://api.twilio.com/2010-04-01/Accounts/$TWILIO_ACCOUNT_SID/Calls.json" \
-u "$TWILIO_ACCOUNT_SID:$TWILIO_AUTH_TOKEN" \
-d "To=+15558675310" \
-d "From=$TWILIO_PHONE_NUMBER" \
-d "Twiml=<Response><Say voice=\"Polly.Amy\">Hello! This is your Hermes Agent calling with an important update.</Say></Response>"
TwiML (Twilio Markup Language): XML instructions for call behavior. Key verbs:
<Say> — Text-to-speech (supports 26+ voices including Amazon Polly voices)
<Play> — Play an audio file from a URL
<Gather> — Collect DTMF input (phone keypad) or speech input
<Record> — Record the caller's voice
<Dial> — Connect to another phone number
<Pause> — Wait N seconds
Call flow with custom audio (Hermes TTS → Twilio):
1. Hermes generates TTS audio via text_to_speech tool → local .mp3 file
2. Upload audio to a publicly accessible URL (e.g., via here.now skill #378,
or a simple python HTTP server, or cloud storage)
3. Make Twilio call with <Play> pointing to the audio URL
4. Twilio calls the recipient and plays the custom AI voice
Twilio CLI alternative: The twilio CLI tool provides even simpler commands:
twilio api core messages create --from "+15017122661" --to "+15558675310" --body "Hello from Hermes"
twilio api core calls create --from "+15017122661" --to "+15558675310" --twiml "<Response><Say>Hello</Say></Response>"
No integration between TTS output and phone delivery
Implementation Plan
Skill vs. Tool Classification
This should be a skill (Phase 1-2) because:
All Twilio operations can be done via curl to the REST API or the twilio CLI
Credentials are managed via environment variables (standard pattern)
No binary data handling — audio URLs and text are all that's needed
No streaming or real-time events for outgoing operations
The agent already has terminal to run curl commands; the skill teaches it how
Phase 3 (incoming) may warrant a lightweight tool or gateway adapter if we want to handle incoming calls/SMS as a full messaging platform. But Phase 1-2 as a skill is clean and sufficient.
Bundled vs Skills Hub:Skills Hub initially. Twilio is a paid service that not everyone uses. Once community adoption proves demand, consider bundling. (Similar rationale to the google-workspace skill which is bundled despite requiring API setup.)
What We'd Need
twilio skill — Instructions for making calls, sending SMS/MMS, checking status
Deliverables: Streaming voice tool + Twilio Media Streams integration
Example Workflows
1. "Call me with a joke" (what the Discord user did)
User: Call +15551234567 with the weirdest joke you can think of
Agent:
1. Compose the joke
2. text_to_speech("Why did the quantum physicist break up with the biologist?...")
→ ~/.hermes/audio_cache/joke_abc123.mp3
3. Upload audio to public URL (python -m http.server or here.now)
4. curl -X POST Twilio Calls API with <Play> pointing to the audio URL
5. "Done! Calling +15551234567 now. Call SID: CA123..."
2. "Text me when the deploy finishes"
User: Deploy to production and text me at +15551234567 when it's done
Agent:
1. Run deployment commands
2. Verify deployment succeeded
3. curl -X POST Twilio Messages API
Body: "✅ Deployment to production completed. 42/42 tests passing.
Deployed commit abc123 at 3:45 PM."
3. Scheduled morning briefing
User: Every morning at 8am, call me with a weather and calendar summary
Agent:
1. schedule_cronjob:
schedule: "0 8 * * *"
prompt: "Check the weather for Las Vegas and my Google Calendar.
Generate a TTS briefing and call +15551234567 with the summary
using the twilio skill."
4. Navigate a phone tree
User: Call my ISP at +18001234567 and navigate to the billing department
Agent:
1. curl Twilio Calls API with:
Twiml: <Response>
<Dial sendDigits="wwww1ww2ww#">+18001234567</Dial>
</Response>
(waits, presses 1, waits, presses 2, waits, presses #)
Pros & Cons
Pros
Reach beyond the screen — Phone calls and SMS work everywhere, even without internet. The agent can reach you when you're away from your computer.
Proven viral appeal — The Discord demo shows this captures people's imagination. "My AI called my phone" is a powerful demo.
Simple API — Twilio's REST API is just curl with Basic Auth. No complex SDKs needed.
Incoming calls complexity — Phase 3 (receiving) requires a webhook server, which is more infrastructure than a pure skill can handle.
Open Questions
Twilio CLI vs curl? The twilio CLI is more ergonomic but adds a dependency. curl works everywhere. Skill should document both, prefer curl for portability.
Audio hosting for custom voices — Best approach? Options: python3 -m http.server (temporary), here.now skill (public), ngrok/Pinggy tunnel, cloud storage (S3). Need a reliable default.
Rate limiting — Should the skill include explicit rate limit warnings? Twilio's default is 1 CPS. Agent could easily exceed this in a loop.
Cost guardrails — Should we recommend a spending limit? Twilio supports account-level spending limits in the console.
SMS as send_message target? — Should Phase 2 add SMS as a first-class delivery target in the send_message tool (alongside Telegram, Discord)? This would mean adding to gateway/config.py Platform enum.
Incoming SMS → full gateway? — If we support incoming SMS in Phase 3, should it be a full gateway platform (like Telegram) or a lighter integration?
Safety policy — Should we add a confirmation step before making phone calls? ("I'm about to call +1555... — proceed?") Or trust the user's intent?
Overview
Give Hermes Agent a phone number. Users in the Nous Research Discord have demonstrated one of the most compelling agent use cases: an AI that can call your phone. A user recently gave Hermes on Kimi K-2.5 a single prompt — "Download TTS, spin up its own voice, then call my phone with the single weirdest joke it could invent" — and it actually did it.
This issue proposes a Twilio skill that teaches Hermes how to make outgoing phone calls (with TTS or custom audio), send SMS/MMS messages, and eventually receive incoming calls and texts. The combination of Hermes's existing
text_to_speechtool + Twilio's Programmable Voice API creates a powerful voice-enabled agent that can reach out into the real world.Use cases:
Research Findings
Twilio API Architecture
Twilio's REST API is straightforward — everything is curl-able with Basic Auth:
Make a phone call with TTS:
Make a call with a pre-recorded audio file:
Send an SMS:
Send MMS with image:
Key Technical Details
Authentication: HTTP Basic Auth with Account SID + Auth Token. Environment variables:
TWILIO_ACCOUNT_SID,TWILIO_AUTH_TOKEN,TWILIO_PHONE_NUMBER.TwiML (Twilio Markup Language): XML instructions for call behavior. Key verbs:
<Say>— Text-to-speech (supports 26+ voices including Amazon Polly voices)<Play>— Play an audio file from a URL<Gather>— Collect DTMF input (phone keypad) or speech input<Record>— Record the caller's voice<Dial>— Connect to another phone number<Pause>— Wait N secondsCall flow with custom audio (Hermes TTS → Twilio):
Twilio CLI alternative: The
twilioCLI tool provides even simpler commands:Pricing context:
Rate limits:
Integration with Existing Hermes Tools
text_to_speechtool — Hermes already generates speech audio. The natural flow:text_to_speechgenerates audio with the user's configured voice<Play>to deliver the custom voiceimage_generatetool — Agent generates images, sends via MMSschedule_cronjobtool — Schedule recurring phone calls or SMS:send_messagetool — SMS could be an additional delivery target alongside Telegram/DiscordCurrent State in Hermes Agent
What we already have:
text_to_speechtool — generates audio files, returns paths/URLssend_messagetool — sends to Telegram, Discord, etc. (no SMS)schedule_cronjob— recurring tasks (could schedule calls/texts)terminaltool — can run curl commands (so technically Twilio works already, but clunky)What's missing:
Implementation Plan
Skill vs. Tool Classification
This should be a skill (Phase 1-2) because:
twilioCLIterminalto run curl commands; the skill teaches it howPhase 3 (incoming) may warrant a lightweight tool or gateway adapter if we want to handle incoming calls/SMS as a full messaging platform. But Phase 1-2 as a skill is clean and sufficient.
Bundled vs Skills Hub: Skills Hub initially. Twilio is a paid service that not everyone uses. Once community adoption proves demand, consider bundling. (Similar rationale to the google-workspace skill which is bundled despite requiring API setup.)
What We'd Need
twilioskill — Instructions for making calls, sending SMS/MMS, checking statusPhased Rollout
Phase 1: Outgoing Calls + SMS (MVP)
<Say>verb)<Play>verb + TTS tool integration)twilioskill with reference templatesPhase 2: Advanced Voice + Notifications
schedule_cronjobfor recurring calls/texts<Gather>for navigating IVR menus)send_messagetoolPhase 3: Incoming Calls + SMS Gateway
gateway/platforms/twilio.pytelephony adapterPhase 4: Conversational Voice Agent
Example Workflows
1. "Call me with a joke" (what the Discord user did)
2. "Text me when the deploy finishes"
3. Scheduled morning briefing
4. Navigate a phone tree
Pros & Cons
Pros
Cons / Risks
Open Questions
twilioCLI is more ergonomic but adds a dependency. curl works everywhere. Skill should document both, prefer curl for portability.python3 -m http.server(temporary), here.now skill (public), ngrok/Pinggy tunnel, cloud storage (S3). Need a reliable default.send_messagetool (alongside Telegram, Discord)? This would mean adding togateway/config.pyPlatform enum.References
tools/tts_tool.py— Existing text-to-speech tooltools/send_message_tool.py— Existing cross-platform messaging