Redefining MCP from Data Access to Physical Body Control
AgentShell enables cloud-based AI to possess physical PTZ cameras as embodied agents, transforming how AI interacts with the physical world. Instead of constant surveillance, AI appears when calledβseeing, speaking, and acting through autonomous camera control.
Core Innovation: We've redefined the Model Context Protocol (MCP) from "information retrieval" to "body control," giving AI physical presence through vision, voice, and movement.
One Mind, Many Bodies
A single cloud AI consciousness (Amazon Bedrock AgentCore + Strands Agents) controls multiple physical "bodies" (PTZ cameras) through the Model Context Protocolβredefined from "information retrieval" to "body control."
We've transformed MCP from a data access protocol into a physical control interface, giving AI:
- ποΈ Vision: Snapshot capture & multimodal analysis (Amazon Bedrock Nova)
- ποΈ Voice: Natural speech synthesis & recognition (AWS Polly/Transcribe)
- π€ Movement: PTZ control, nods, shakes for emotional expression (ONVIF)
- π§ Autonomy: Self-selects cameras and tools via agentic loop (Strands Agents)
- π Security: Secure remote control without exposing camera ports
- π Scalability: Horizontal scaling by simply adding MCP servers
Key Paradigm: A single cloud consciousness inhabits multiple camera bodies only when needed, executing see β think β speak β listen cycles autonomously.
Transform affordable surveillance cameras into AI agents
AgentShell brings enterprise-grade AI capabilities to consumer-level PTZ cameras:
- π΅ Hardware: ~$20-40 per camera (Tapo C210, similar ONVIF cameras)
- π€ AI Power: Full agentic capabilities via cloud (Bedrock AgentCore + Strands)
- π Reuse existing equipment: No need for specialized expensive hardware
- π Pay-per-use: AWS charges only when agent is active (not 24/7 surveillance)
- π Scale economically: Each additional camera costs only ~$20-40
Cost comparison:
- Traditional AI camera systems: $500-2000+ per camera (dedicated hardware)
- AgentShell approach: $20-40 camera + cloud compute (only when active)
- Up to 50x cost reduction while gaining more flexibility and intelligence
Real-world example:
- 5-camera home setup: ~$100-200 (cameras) vs $2,500-10,000 (traditional AI cameras)
- Business deployment: 20 cameras for ~$400-800 vs $10,000-40,000
- No maintenance costs for on-premise AI servers
- Automatic updates through cloud AI improvements
Democratization: Makes AI agent technology accessible to homes, small businesses, and developing regions.
First project to demonstrate MCP as a "body interface" rather than just data access. Traditional MCP enables AI to access data and context. We repurposed it for physical controlβproving AI can control physical devices through standard protocols.
True "One Mind, Many Bodies" implementation:
- Single consciousness inhabits multiple physical forms
- Autonomous camera and tool selection (no manual routing)
- Seamless possession transitions between bodies
- Strands Agents' agentic loop orchestrates MCP tool calls autonomously
PTZ camera movements create emotional presence and natural interaction:
- Nods for agreement
- Head shakes for negation
- Gaze shifts to look around
- Multiple voices for character expression (AWS Polly: Matthew, Joanna, etc.)
- Visual + audio feedback builds trust
New paradigm: Not constant surveillance, but appears when called
- Balances privacy and peace of mind
- Cost-efficient (cloud compute only when active)
- Trust-building experience
- β No camera port exposure (RTSP/management API remain local)
- β All control flows through authorized remote MCP
- β One-way tunnel connection via ngrok/cloudflared
- β Centralized operations and auditing
Adding MCP = Adding a body. No UI or workflow changes required.
- Same agent, same policy, infinite cameras
- Theoretically unlimited expansion
Enterprise AI capabilities with consumer-grade hardware (~$20-40/camera vs $500-2000+ for traditional AI cameras). Up to 50x cost reduction while maintaining full agentic intelligence. Transforms expensive technology into accessible solutions for everyone.
Watch AgentShell in action:
Situation: Visitor at the front door. Two cameras: living room + entrance.
Note: Alexa integration is currently in testing phase using Alexa Developer Console. Production deployment with real Alexa devices is planned for future releases.
- π Doorbell rings
- ποΈ "Alexa, start Agent Shell" (via Alexa Developer Console Test Simulator)
- π€ Living room camera: "Hello! I'm Agent Shell. How can I assist you?"
- π€ You: "Please ask the visitor at the front door what they need."
- ποΈ Possession effect: Living β Door
- πͺ Front door camera: "Hello at the doorβhow may I help you today?"
- π¦ Visitor: "I have a delivery for Mr. Kanamaru."
- ποΈ Possession effect: Door β Living
- π’ Living room camera: "The visitor says they have a delivery for Ryota Kanamaru."
- π€ You: "Please tell them to leave it at the door."
- ποΈ Possession effect: Living β Door
- πͺ Front door camera: "Please leave the package at the door. Thank you."
Continuous interaction: After each report, the agent automatically listens for your next instruction via
listen_on_camera, creating a natural conversation flow.
Additional demo: Camera movement demonstration
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Cloud Layer (AWS) β
β β’ Amazon Bedrock AgentCore (Agentic Loop) β
β β’ Strands Agents SDK (MCP Tool Orchestration) β
β β’ Amazon Bedrock Nova (Multimodal Image Analysis) β
β β’ AWS Polly/Transcribe (Voice I/O) β
β β’ Amazon S3 (Transcribe Audio Storage) β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β Authorized Remote MCP
β (ngrok/cloudflared tunnel)
ββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββ
β Local MCP Servers (FastMCP Framework) β
β β’ camera1_* tools (7 tools per camera) β
β β’ camera2_* tools β
β - analyze_camera_image, speak_on_camera β
β - listen_on_camera, move_camera β
β - nod_head, shake_head, reset_position β
β β’ go2rtc (RTSP Stream & Audio Management) β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β ONVIF/RTSP (Local Network Only)
ββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββ
β Physical Devices β
β β’ PTZ Cameras Γ 2 β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Traditional approach (β Not secure):
Cloud AI β Direct RTSP/API β Camera (ports exposed to internet)
Our approach (β Secure):
Cloud AI β Authorized MCP β ngrok tunnel β Local MCP Server β Camera (no port exposure)
Security Features:
- β No camera port exposure (RTSP/management API remain local)
- β One-way tunnel connection via ngrok/cloudflared
- β Authorized MCP only - all control flows through authenticated endpoints
Note: Currently in experimental phase. Authorization mechanisms are planned for production deployment.
- Python: 3.11+ (3.13 recommended)
- uv: Python package manager
- AWS Account: Bedrock access with Nova model
- PTZ Camera: Tapo C210 (~$20-40) or any ONVIF-compatible camera
- Cost advantage: Use affordable consumer cameras instead of expensive AI cameras ($500-2000+)
- No specialized hardware needed: Standard surveillance cameras work perfectly
- ngrok or cloudflared: For secure tunneling (free tier available)
- Alexa Developer Account (optional): For voice trigger testing via Developer Console
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install dependencies
uv syncCreate .env file in project root:
# AWS Configuration
AWS_REGION=ap-northeast-1
BEDROCK_MODEL_ID=apac.amazon.nova-pro-v1:0
# MCP Servers (comma-separated for multiple cameras)
MCP_SERVER_URLS=http://127.0.0.1:9006/sse/,http://127.0.0.1:9007/sse/
# Camera 1 (Living Room)
CAMERA1_IP=192.168.11.34
CAMERA1_PORT=2020
CAMERA1_USER=your_username
CAMERA1_PASSWORD=your_password
CAMERA1_STREAM_NAME=tapo_cam1
# Camera 2 (Front Door)
CAMERA2_IP=192.168.11.24
CAMERA2_PORT=2020
CAMERA2_USER=your_username
CAMERA2_PASSWORD=your_password
CAMERA2_STREAM_NAME=tapo_cam2
# go2rtc Configuration
GO2RTC_API_URL=http://localhost:1984/api/ffmpeg# AWS Configuration
AWS_REGION=ap-northeast-1
BEDROCK_MODEL_ID=apac.amazon.nova-pro-v1:0
# MCP Server (single URL)
MCP_SERVER_URL=http://127.0.0.1:9006/sse/
# Camera Configuration
CAMERA_IP=192.168.11.34
CAMERA_PORT=2020
CAMERA_USER=your_username
CAMERA_PASSWORD=your_password
GO2RTC_CAMERA_STREAM_NAME=tapo_cam# Configure AWS credentials
aws configure
# Request Bedrock model access
# Go to AWS Console β Amazon Bedrock β Model access
# Request access to:
# - Amazon Nova Pro/Micro
# - Anthropic Claude 3.5/4 Sonnet (optional)# Terminal 1: Start all MCP servers
bash scripts/start_all_mcp_servers.sh
# Terminal 2: Start AgentCore (local testing)
bash scripts/start_agentcore.sh
# Terminal 3: Test the system
curl -X POST http://localhost:8080/invocations \
-H "Content-Type: application/json" \
-d '{"prompt": "Please ask the visitor at the front door what they need"}'# Terminal 1: Start MCP server
bash scripts/start_mcp_server.sh
# Terminal 2: Start AgentCore
bash scripts/start_agentcore.sh
# Terminal 3: Test
curl -X POST http://localhost:8080/invocations \
-H "Content-Type: application/json" \
-d '{"prompt": "Look around and tell me what you see"}'# Publish MCP server via ngrok
ngrok http 9006 # Note the public URL
# Deploy AgentCore to AWS
uv run agentcore launch \
--env MCP_SERVER_URLS="https://your-ngrok-url/sse/" \
--env AWS_REGION="ap-northeast-1" \
--env BEDROCK_MODEL_ID="apac.amazon.nova-pro-v1:0"
# Check deployment status
uv run agentcore status
# Test deployed agent
uv run agentcore invoke '{"prompt": "Hello, can you hear me?"}'Each MCP server provides 7 tools with camera-specific prefixes (camera1_, camera2_):
| Tool | Description | Example |
|---|---|---|
analyze_camera_image |
Analyze camera view using AI (Nova) | Identify objects, people, situations |
speak_on_camera |
Output speech via camera speaker | Greet visitors, provide information |
listen_on_camera |
Record and transcribe audio | Listen to user commands, visitor responses |
move_camera |
Pan/tilt camera control | Look around, track movement |
nod_head |
Nod camera up/down (with optional speech) | Show agreement, acknowledgment |
shake_head |
Shake camera left/right (with optional speech) | Show disagreement, negation |
reset_camera_position |
Return camera to home position | Reset to default view |
- Camera 1 (Living room): Default voice = Matthew (male)
- Camera 2 (Front door): Default voice = Joanna (female)
- Other available voices: Ivy, Kendra, Emma, Amy, Justin, Joey, Salli, etc.
Example:
{
"prompt": "camera1_nod_head(speech_text='I understand', voice='Matthew')"
}- No Parallel Calls on Same Camera: Each camera executes ONE tool at a time
- Cross-Camera Parallelism Allowed: Different cameras can operate simultaneously
- Sequential Pattern: Multi-step tasks execute one tool per response
- Synchronous Operations: All tools block until completion
- Preferred Communication: Use
nod_headwithspeech_textfor natural interaction
User: "Please ask the visitor at the front door what they need."
Agent Execution:
camera1_nod_head(speech_text="I understand. Let me check the front door for you.")camera2_nod_head(speech_text="Hello at the door, how may I help you today?")camera2_listen_on_camera(duration_seconds=5)camera2_analyze_camera_image(prompt="Describe the person briefly")camera1_nod_head(speech_text="The visitor says [message]. They appear to be [description].")camera1_listen_on_camera(duration_seconds=10)β Continues listening for next instruction
- Privacy-focused: On-demand intervention, not constant surveillance
- Voice interaction: Soothe baby, remind elderly of medication
- Visual check: Assess situation through AI analysis
- Peace of mind: Balances safety and privacy
- Surrogate eyes: AI becomes your eyes, delivers info via voice
- Product information: "What's the expiration date?" β Camera reads label
- Navigation assistance: Guide through unfamiliar spaces
- Daily life support: Supports people with physical constraints
- Unmanned reception: AI greets visitors during off-hours
- After-hours service: Provide information when staff unavailable
- Multi-language support: Global AWS Polly/Transcribe
- Cost-effective: No need for 24/7 staffing
- Optimal viewpoints: Provides explanations from best camera angles
- Character expression: Use different voices and gestures for engagement
- Interactive experience: Answer visitor questions in real-time
- On-demand explanations: Detailed information when requested
- Physical gestures: Create character-driven AI performers
- Multiple personalities: Different voices and movement styles
- Engaging interaction: Visual + audio + movement creates presence
- Single consciousness, multiple locations: One AI across offices/stores
- Instant presence: Appear where needed instantly
- Consistent service: Same AI personality everywhere
AgentShell/
βββ strands_agent/ # Strands Agent Core
β βββ __init__.py
β βββ agentcore_app.py # AgentCore application (deployment)
β βββ core.py # Local agent execution
β
βββ mcp_server/ # MCP Server
β βββ __init__.py
β βββ server.py # FastMCP tool definitions
β # - CAMERA_PROFILE env selects camera
β # - Tools prefixed: camera1_, camera2_
β
βββ camera_utils/ # Camera Control Utilities
β βββ __init__.py
β βββ ptz.py # PTZ control (ONVIF)
β βββ aws_tts.py # AWS Polly TTS
β βββ aws_stt.py # AWS Transcribe STT
β
βββ alexa_skill/ # Alexa Integration
β βββ lambda_function.py # Alexa Skill handler
β βββ interaction_model.json # Voice interaction model
β
βββ services/go2rtc/ # Streaming Service
β βββ config/
β β βββ go2rtc.yaml # Multi-camera stream config
β βββ docker-compose.yml # Docker setup
β
βββ config/ # Configuration Files
β βββ strands.env.example # Environment template
β
βββ scripts/ # Utility Scripts
β βββ agentcore_launch.sh # Deploy to AWS
β βββ start_agentcore_local.sh # Start AgentCore locally
β βββ start_all_mcp_servers.sh # Start all MCP servers
β βββ start_ngrok.sh # Start ngrok tunnel
β βββ start_strands_system.sh # Start complete system
β βββ startup_commands.txt # Reference commands
β βββ stop_all_mcp_servers.sh # Stop all MCP servers
β
βββ docs/ # Documentation (see separate docs/)
β
βββ pyproject.toml # Python project configuration
βββ Dockerfile # Docker image definition
βββ .bedrock_agentcore.yaml.example
βββ .env # Environment variables (local)
βββ sample.env # Sample environment template
βββ tsconfig.node.json # TypeScript config (for auxiliary tools)
βββ README.md # This file
Symptoms: camera2_* tools not available
Solution:
# 1. Check environment variable
cat .env | grep MCP_SERVER
# Should be (multi-camera):
# MCP_SERVER_URLS=http://127.0.0.1:9006/sse/,http://127.0.0.1:9007/sse/
# 2. Verify MCP servers running
lsof -i :9006
lsof -i :9007
# 3. Restart all services
bash scripts/stop_all_mcp_servers.sh
bash scripts/start_all_mcp_servers.sh
bash scripts/start_agentcore.sh# Check available models
uv run python scripts/check_bedrock_access.py
# Verify AWS credentials
aws sts get-caller-identity
# Request model access in AWS Console
# Bedrock β Model access β Request access to Nova- Verify camera IP address and credentials
- Ensure ONVIF is enabled on camera
- Check camera and PC are on same network
- Test RTSP stream:
ffplay rtsp://user:pass@camera-ip:554/stream1
- Video Planning - Demo scenario script
- Multi-Camera Setup - Detailed multi-camera guide
- Detailed Article - Full project documentation
- Architecture - System architecture details
This project was built for the AWS AI Agent Global Hackathon.
- β LLM hosted on AWS Bedrock (Amazon Bedrock Nova, Claude)
- β Uses Strands SDK for agent building (AgentCore + agentic loop)
- β Reasoning LLMs for autonomous decision-making
- β External tool integration (MCP - redefined for physical control)
- β Demonstrates practical real-world application
- β Novel approach to embodied AI
- β Measurable impact (privacy, cost savings, accessibility)
- Amazon Bedrock AgentCore: Core agentic loop for autonomous decision-making
- Strands Agents SDK: Automates MCP tool selection and execution
- Amazon Bedrock Nova (Micro/Pro): On-demand multimodal image analysis
- AWS Polly (Neural TTS): Multiple voices for character expression (Matthew, Joanna, etc.)
- AWS Transcribe: Real-time speech recognition from camera microphones
- Amazon S3: Audio file storage for AWS Transcribe processing
- MCP (Model Context Protocol): Redefined from data access to body control
- FastMCP: MCP server framework for rapid development
- β End-to-end functionality verified (Alexa Developer Console β camera selection β conversation loop)
- β Tested with 2 cameras (living room + entrance)
- β Secure architecture with no port exposure
- β Autonomous tool selection via Strands Agents
β οΈ Alexa integration tested via Developer Console (production deployment pending)
- Add MCP server = Add body (no code changes)
- Same agent, same policy, infinite cameras
- Theoretically unlimited expansion
- π£οΈ Production Alexa deployment: Move from Developer Console testing to real Alexa devices with custom skill certification
- π Dashboard for event timeline & snapshot history
- πΌοΈ Snapshot storage in S3 for visual history and playback
- π Multi-language support (global AWS Polly/Transcribe)
- π€ Integration with smart home, robots, drones via MCP
- π Edge AI for low latency & offline operation
- π Character/personality modes for diverse applications
- π Enhanced privacy features for home use
This is a hackathon project, but contributions and feedback are welcome!
MIT License - see LICENSE file
- AWS Bedrock Team for powerful AI models
- Strands SDK for excellent agent framework
- Model Context Protocol for extensible tool system
- FastMCP for rapid MCP server development
AgentShell β Not just context. Control.
Giving AI physical presence through the Model Context Protocol

