Feature: Twilio Skill — Phone Calls, SMS, and Voice from the Agent

## Overview

Give Hermes Agent a phone number. Users in the Nous Research Discord have demonstrated one of the most compelling agent use cases: **an AI that can call your phone**. A user recently gave Hermes on Kimi K-2.5 a single prompt — "Download TTS, spin up its own voice, then call my phone with the single weirdest joke it could invent" — and it actually did it.

This issue proposes a **Twilio skill** that teaches Hermes how to make outgoing phone calls (with TTS or custom audio), send SMS/MMS messages, and eventually receive incoming calls and texts. The combination of Hermes's existing `text_to_speech` tool + Twilio's Programmable Voice API creates a powerful voice-enabled agent that can reach out into the real world.

**Use cases:**
- Agent calls you with status updates, alerts, or results from long-running tasks
- Agent sends SMS notifications when deployments complete, cron jobs finish, or issues need attention
- Agent leaves voicemail-style messages with TTS-generated briefings
- Agent makes calls with custom AI-generated voices (via existing TTS tool → audio file → Twilio)
- Receive SMS commands to trigger agent actions (lightweight alternative to full messaging gateways)
- Phone-based two-factor authentication flows for agent-managed services
- Automated phone tree navigation (agent calls a support line, navigates IVR menus with DTMF tones)

---

## Research Findings

### Twilio API Architecture

Twilio's REST API is straightforward — everything is curl-able with Basic Auth:

**Make a phone call with TTS:**
```bash
curl -X POST "https://api.twilio.com/2010-04-01/Accounts/$TWILIO_ACCOUNT_SID/Calls.json" \
  -u "$TWILIO_ACCOUNT_SID:$TWILIO_AUTH_TOKEN" \
  -d "To=+15558675310" \
  -d "From=$TWILIO_PHONE_NUMBER" \
  -d "Twiml=<Response><Say voice=\"Polly.Amy\">Hello! This is your Hermes Agent calling with an important update.</Say></Response>"
```

**Make a call with a pre-recorded audio file:**
```bash
curl -X POST "https://api.twilio.com/2010-04-01/Accounts/$TWILIO_ACCOUNT_SID/Calls.json" \
  -u "$TWILIO_ACCOUNT_SID:$TWILIO_AUTH_TOKEN" \
  -d "To=+15558675310" \
  -d "From=$TWILIO_PHONE_NUMBER" \
  -d "Twiml=<Response><Play>https://example.com/my-tts-audio.mp3</Play></Response>"
```

**Send an SMS:**
```bash
curl -X POST "https://api.twilio.com/2010-04-01/Accounts/$TWILIO_ACCOUNT_SID/Messages.json" \
  -u "$TWILIO_ACCOUNT_SID:$TWILIO_AUTH_TOKEN" \
  -d "To=+15558675310" \
  -d "From=$TWILIO_PHONE_NUMBER" \
  -d "Body=Your deployment to production completed successfully. All 42 tests passing."
```

**Send MMS with image:**
```bash
curl -X POST "https://api.twilio.com/2010-04-01/Accounts/$TWILIO_ACCOUNT_SID/Messages.json" \
  -u "$TWILIO_ACCOUNT_SID:$TWILIO_AUTH_TOKEN" \
  -d "To=+15558675310" \
  -d "From=$TWILIO_PHONE_NUMBER" \
  -d "Body=Here is the chart you requested" \
  -d "MediaUrl=https://example.com/chart.png"
```

### Key Technical Details

**Authentication:** HTTP Basic Auth with Account SID + Auth Token. Environment variables: `TWILIO_ACCOUNT_SID`, `TWILIO_AUTH_TOKEN`, `TWILIO_PHONE_NUMBER`.

**TwiML (Twilio Markup Language):** XML instructions for call behavior. Key verbs:
- `<Say>` — Text-to-speech (supports 26+ voices including Amazon Polly voices)
- `<Play>` — Play an audio file from a URL
- `<Gather>` — Collect DTMF input (phone keypad) or speech input
- `<Record>` — Record the caller's voice
- `<Dial>` — Connect to another phone number
- `<Pause>` — Wait N seconds

**Call flow with custom audio (Hermes TTS → Twilio):**
```
1. Hermes generates TTS audio via text_to_speech tool → local .mp3 file
2. Upload audio to a publicly accessible URL (e.g., via here.now skill #378, 
   or a simple python HTTP server, or cloud storage)
3. Make Twilio call with <Play> pointing to the audio URL
4. Twilio calls the recipient and plays the custom AI voice
```

**Twilio CLI alternative:** The `twilio` CLI tool provides even simpler commands:
```bash
twilio api core messages create --from "+15017122661" --to "+15558675310" --body "Hello from Hermes"
twilio api core calls create --from "+15017122661" --to "+15558675310" --twiml "<Response><Say>Hello</Say></Response>"
```

**Pricing context:**
- Outgoing calls: ~$0.014/min (US), ~$0.02-0.05/min (international)
- SMS: ~$0.0079/message (US)
- Phone numbers: ~$1.15/month (US local)
- Free trial: $15.50 credit with a trial account

**Rate limits:**
- API calls: 1 call per second (CPS) by default, up to 5 CPS for verified accounts
- SMS: varies by number type (10DLC, toll-free, short code)

### Integration with Existing Hermes Tools

**`text_to_speech` tool** — Hermes already generates speech audio. The natural flow:
1. Agent writes the message text
2. `text_to_speech` generates audio with the user's configured voice
3. Audio is uploaded to a public URL
4. Twilio call uses `<Play>` to deliver the custom voice

**`image_generate` tool** — Agent generates images, sends via MMS

**`schedule_cronjob` tool** — Schedule recurring phone calls or SMS:
- "Call me every morning at 8am with a weather briefing"
- "Text me when the server health check fails"

**`send_message` tool** — SMS could be an additional delivery target alongside Telegram/Discord

---

## Current State in Hermes Agent

**What we already have:**
- `text_to_speech` tool — generates audio files, returns paths/URLs
- `send_message` tool — sends to Telegram, Discord, etc. (no SMS)
- `schedule_cronjob` — recurring tasks (could schedule calls/texts)
- `terminal` tool — can run curl commands (so technically Twilio works already, but clunky)
- #314 — Voice Mode issue (speech input/output for CLI, related but different scope)
- #378 — here.now skill (could host audio files for Twilio to play)

**What's missing:**
- No skill teaching the agent how to use Twilio
- No structured pattern for making phone calls
- No SMS sending capability
- No integration between TTS output and phone delivery

---

## Implementation Plan

### Skill vs. Tool Classification

This should be a **skill** (Phase 1-2) because:
- All Twilio operations can be done via curl to the REST API or the `twilio` CLI
- Credentials are managed via environment variables (standard pattern)
- No binary data handling — audio URLs and text are all that's needed
- No streaming or real-time events for outgoing operations
- The agent already has `terminal` to run curl commands; the skill teaches it how

**Phase 3 (incoming)** may warrant a lightweight **tool** or **gateway adapter** if we want to handle incoming calls/SMS as a full messaging platform. But Phase 1-2 as a skill is clean and sufficient.

**Bundled vs Skills Hub:** **Skills Hub** initially. Twilio is a paid service that not everyone uses. Once community adoption proves demand, consider bundling. (Similar rationale to the google-workspace skill which is bundled despite requiring API setup.)

### What We'd Need

1. **`twilio` skill** — Instructions for making calls, sending SMS/MMS, checking status
2. **Environment variable documentation** — TWILIO_ACCOUNT_SID, TWILIO_AUTH_TOKEN, TWILIO_PHONE_NUMBER
3. **TwiML templates** — Common call flows (say message, play audio, gather input)
4. **Integration patterns** — How to combine with text_to_speech, image_generate, schedule_cronjob

### Phased Rollout

**Phase 1: Outgoing Calls + SMS (MVP)**
- Skill teaching curl-based Twilio API usage
- Make phone calls with Twilio's built-in TTS (`<Say>` verb)
- Make phone calls with custom audio (`<Play>` verb + TTS tool integration)
- Send SMS messages
- Send MMS with media URLs
- Check call/message status
- Templates for common TwiML patterns
- Deliverables: `twilio` skill with reference templates

**Phase 2: Advanced Voice + Notifications**
- Integration with `schedule_cronjob` for recurring calls/texts
- Phone call with DTMF navigation (`<Gather>` for navigating IVR menus)
- Call recording and playback
- Multi-step call flows (say something, wait for keypad input, respond)
- SMS as a delivery target for `send_message` tool
- Conference calls (connect multiple numbers)
- Voicemail detection and handling
- Deliverables: Enhanced skill + send_message SMS adapter

**Phase 3: Incoming Calls + SMS Gateway**
- Receive incoming SMS as agent messages (Twilio webhook → Hermes)
- Receive incoming calls with voice interaction (Twilio → STT → agent → TTS → Twilio)
- This becomes a full telephony gateway platform (like Telegram/Discord but for phone)
- Would need a webhook server (could use Pinggy #361 or a simple Flask/FastAPI server)
- Deliverables: `gateway/platforms/twilio.py` telephony adapter

**Phase 4: Conversational Voice Agent**
- Real-time voice conversations over phone (bidirectional audio streaming)
- Twilio Media Streams (WebSocket-based audio streaming)
- Integration with #314 Voice Mode for consistent voice experience
- Low-latency STT → agent → TTS pipeline
- Deliverables: Streaming voice tool + Twilio Media Streams integration

---

## Example Workflows

### 1. "Call me with a joke" (what the Discord user did)
```
User: Call +15551234567 with the weirdest joke you can think of

Agent:
1. Compose the joke
2. text_to_speech("Why did the quantum physicist break up with the biologist?...")
   → ~/.hermes/audio_cache/joke_abc123.mp3
3. Upload audio to public URL (python -m http.server or here.now)
4. curl -X POST Twilio Calls API with <Play> pointing to the audio URL
5. "Done! Calling +15551234567 now. Call SID: CA123..."
```

### 2. "Text me when the deploy finishes"
```
User: Deploy to production and text me at +15551234567 when it's done

Agent:
1. Run deployment commands
2. Verify deployment succeeded
3. curl -X POST Twilio Messages API
   Body: "✅ Deployment to production completed. 42/42 tests passing. 
          Deployed commit abc123 at 3:45 PM."
```

### 3. Scheduled morning briefing
```
User: Every morning at 8am, call me with a weather and calendar summary

Agent:
1. schedule_cronjob:
   schedule: "0 8 * * *"
   prompt: "Check the weather for Las Vegas and my Google Calendar. 
            Generate a TTS briefing and call +15551234567 with the summary 
            using the twilio skill."
```

### 4. Navigate a phone tree
```
User: Call my ISP at +18001234567 and navigate to the billing department

Agent:
1. curl Twilio Calls API with:
   Twiml: <Response>
     <Dial sendDigits="wwww1ww2ww#">+18001234567</Dial>
   </Response>
   (waits, presses 1, waits, presses 2, waits, presses #)
```

---

## Pros & Cons

### Pros
- **Reach beyond the screen** — Phone calls and SMS work everywhere, even without internet. The agent can reach you when you're away from your computer.
- **Proven viral appeal** — The Discord demo shows this captures people's imagination. "My AI called my phone" is a powerful demo.
- **Simple API** — Twilio's REST API is just curl with Basic Auth. No complex SDKs needed.
- **Composable with existing tools** — TTS + Twilio = custom voice calls. image_generate + MMS = visual notifications. Cronjobs + calls = scheduled briefings.
- **Low barrier** — Twilio has a free trial ($15.50 credit), costs are pennies per call/text.
- **Natural extension of send_message** — SMS is just another delivery channel alongside Telegram/Discord.

### Cons / Risks
- **Cost** — Unlike Telegram/Discord (free), Twilio costs money. Runaway agent loops could rack up charges.
- **Phone number requirement** — Need to buy a Twilio number ($1.15/month). Requires Twilio account setup.
- **Abuse potential** — An agent with phone call capabilities could be used for spam or harassment. Need clear safety guidelines and rate limiting.
- **Compliance** — Twilio requires A2P 10DLC registration for US SMS, toll-free verification, etc. Can be bureaucratic.
- **Audio hosting for custom TTS** — The agent's TTS output is a local file. Needs to be uploaded to a public URL for Twilio to play. This is a friction point (but solvable with here.now #378 or a simple HTTP server).
- **Incoming calls complexity** — Phase 3 (receiving) requires a webhook server, which is more infrastructure than a pure skill can handle.

---

## Open Questions

1. **Twilio CLI vs curl?** The `twilio` CLI is more ergonomic but adds a dependency. curl works everywhere. Skill should document both, prefer curl for portability.
2. **Audio hosting for custom voices** — Best approach? Options: `python3 -m http.server` (temporary), here.now skill (public), ngrok/Pinggy tunnel, cloud storage (S3). Need a reliable default.
3. **Rate limiting** — Should the skill include explicit rate limit warnings? Twilio's default is 1 CPS. Agent could easily exceed this in a loop.
4. **Cost guardrails** — Should we recommend a spending limit? Twilio supports account-level spending limits in the console.
5. **SMS as send_message target?** — Should Phase 2 add SMS as a first-class delivery target in the `send_message` tool (alongside Telegram, Discord)? This would mean adding to `gateway/config.py` Platform enum.
6. **Incoming SMS → full gateway?** — If we support incoming SMS in Phase 3, should it be a full gateway platform (like Telegram) or a lighter integration?
7. **Safety policy** — Should we add a confirmation step before making phone calls? ("I'm about to call +1555... — proceed?") Or trust the user's intent?

---

## References

- [Twilio Voice API — Make Outbound Calls](https://www.twilio.com/docs/voice/tutorials/how-to-make-outbound-phone-calls)
- [Twilio Call Resource API Reference](https://www.twilio.com/docs/voice/api/call-resource)
- [Twilio SMS Tutorial — Send Messages](https://www.twilio.com/docs/messaging/tutorials/how-to-send-sms-messages)
- [Twilio CLI Quickstart](https://www.twilio.com/docs/twilio-cli/quickstart)
- [TwiML Reference — Say Verb](https://www.twilio.com/docs/voice/twiml/say)
- [TwiML Reference — Play Verb](https://www.twilio.com/docs/voice/twiml/play)
- Hermes `tools/tts_tool.py` — Existing text-to-speech tool
- Hermes `tools/send_message_tool.py` — Existing cross-platform messaging
- #314 — Voice Mode (related: speech I/O for CLI)
- #378 — here.now skill (related: audio file hosting for Twilio playback)
- #361 — Pinggy skill (related: localhost tunnels for webhook receiving)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Twilio Skill — Phone Calls, SMS, and Voice from the Agent #409

Overview

Research Findings

Twilio API Architecture

Key Technical Details

Integration with Existing Hermes Tools

Current State in Hermes Agent

Implementation Plan

Skill vs. Tool Classification

What We'd Need

Phased Rollout

Example Workflows

1. "Call me with a joke" (what the Discord user did)

2. "Text me when the deploy finishes"

3. Scheduled morning briefing

4. Navigate a phone tree

Pros & Cons

Pros

Cons / Risks

Open Questions

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Feature: Twilio Skill — Phone Calls, SMS, and Voice from the Agent #409

Description

Overview

Research Findings

Twilio API Architecture

Key Technical Details

Integration with Existing Hermes Tools

Current State in Hermes Agent

Implementation Plan

Skill vs. Tool Classification

What We'd Need

Phased Rollout

Example Workflows

1. "Call me with a joke" (what the Discord user did)

2. "Text me when the deploy finishes"

3. Scheduled morning briefing

4. Navigate a phone tree

Pros & Cons

Pros

Cons / Risks

Open Questions

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions