CoffeeAgntcy Optimization — Group 27

Advanced Python for Data Science | Spring 2026

Team — Group 27

Name	Role
Deepali Balakrishna Ksheersagar	Profiling & Analysis
Aditya Desai	Optimization & Implementation

Project Overview

This project profiles and optimizes CoffeeAgntcy — a real-world distributed multi-agent AI system built on the open-source AGNTCY infrastructure.

CoffeeAgntcy simulates a fictitious coffee company where AI agents communicate with each other over a message bus to answer questions about coffee origins, flavors, and profiles.

Our goal: Identify performance bottlenecks in this distributed agent workflow and apply Advanced Python techniques to optimize them — then measure and compare the results with real data.

System Architecture

┌─────────────────────────────────────────────────────────┐
│                    User (Browser UI)                    │
│                  http://localhost:3000                  │
└─────────────────────┬───────────────────────────────────┘
                      │ HTTP POST /agent/prompt
┌─────────────────────▼───────────────────────────────────┐
│           Supervisor Agent — Exchange Server            │
│        (LangGraph orchestrated, port 8000)              │
└─────────────────────┬───────────────────────────────────┘
                      │ A2A Protocol over SLIM message bus
┌─────────────────────▼───────────────────────────────────┐
│             SLIM — Secure Low-Latency                   │
│            Interactive Messaging Bus                    │
│                   (port 46357)                          │
└─────────────────────┬───────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────┐
│           Grader Agent — Farm Server                    │
│        (Q Grader Sommelier, LangGraph + GPT-4o)         │
└─────────────────────┬───────────────────────────────────┘
                      │ API Call
┌─────────────────────▼───────────────────────────────────┐
│                  OpenAI GPT-4o API                      │
│              (External — main bottleneck)               │
└─────────────────────────────────────────────────────────┘

All 10 Docker Containers Running

Container	Purpose
`exchange-server`	Supervisor Agent — receives user prompts
`farm-server`	Grader Agent — answers using AI
`slim`	Message bus between agents
`nats`	Backup pub/sub messaging system
`clickhouse-server`	Database for conversation history
`grafana`	Performance dashboard
`otel-collector`	OpenTelemetry data collector
`mce-api-layer`	Metrics computation API
`metrics-computation-engine`	Processes performance metrics
`ui`	Frontend website

Repository Structure

Coffee-Agentcy-Optimization/
└── Version-1/
    ├── README.md                              ← You are here
    ├── profiler.py                            ← Baseline profiling script
    ├── optimized.py                           ← Optimization v1 (asyncio + cache)
    ├── optimized_v2.py                        ← Optimization v2 (advanced)
    └── Presentation/
        └── Group_27_CoffeeAgntcy_Optimized.pptx

How to Run This Project

Prerequisites

Make sure you have these installed:

Tool	Version	Install
Python	3.11+	python.org
Docker Desktop	Any	docker.com
Git	Any	git-scm.com
OpenAI API Key	—	platform.openai.com

Step 1 — Clone This Repo

git clone https://github.com/Desaiadi/Coffee-Agentcy-Optimization.git
cd Coffee-Agentcy-Optimization/coffeeAGNTCY/coffee_agents/corto
cp .env.example .env

Open .env and add these lines at the bottom:

LLM_MODEL=openai/gpt-4o
OPENAI_API_KEY=your-openai-api-key-here
OPENAI_ENDPOINT=https://api.openai.com/v1
OPENAI_MODEL_NAME=gpt-4o

Start all containers:

docker compose up

Wait ~2 minutes, then open http://localhost:3000

Step 2 — Install Python Dependencies

pip install memory_profiler aiohttp

Step 3 — Run Baseline Profiler

python profiler.py

This will output:

Latency per request (min, max, avg)
Throughput (requests per second)
Peak and average memory usage
cProfile breakdown of all function calls

Step 4 — Run Optimized Version

python optimized_v2.py

This runs three tests:

Cold run (no cache) — concurrent requests
Warm run (all cached) — instant responses
Mixed run — some cached, some new

Phase 1: Profiling — What We Measured

We wrote profiler.py using three Python profiling tools:

Tool 1: `time.perf_counter()`

The most precise timer in Python. Measures exact wall-clock time for each request.

start = time.perf_counter()
result = send_request(prompt)
elapsed = time.perf_counter() - start
print(f"Request took: {elapsed:.3f}s")

Think of it as: A digital stopwatch with nanosecond precision.

Tool 2: `cProfile`

Records every single Python function that was called, how many times it was called, and how long it ran.

profiler = cProfile.Profile()
profiler.enable()
send_request("What does Ethiopian coffee taste like?")
profiler.disable()

stats = pstats.Stats(profiler)
stats.sort_stats("cumulative")
stats.print_stats(10)

Think of it as: A security camera that records every step of the program.

Tool 3: `memory_profiler`

Monitors how much RAM the program uses during execution, sampled every 0.1 seconds.

from memory_profiler import memory_usage
mem = memory_usage((send_request, (prompt,)), interval=0.1)
print(f"Peak memory: {max(mem):.2f} MiB")

Think of it as: A RAM meter that watches your program's memory in real time.

Tool 4: `docker stats`

Measures CPU and memory usage of each Docker container while a request is running.

docker stats --no-stream

Baseline Results (Before Optimization)

Latency

Run	Time
Request 1	3.001s
Request 2	2.093s
Request 3	2.830s
Average	2.641s
Min	2.093s
Max	3.001s

Throughput & Memory

Metric	Value
Throughput	0.379 requests/second
Peak Memory	34.34 MiB
Avg Memory	33.37 MiB

Container Resources (During Request)

Container	CPU Usage	Memory
exchange-server	48.59%	711.6 MiB
farm-server	0.00%	422 MiB
slim	0.00%	3.97 MiB
clickhouse	5.58%	577.3 MiB

Most Important Finding — cProfile Output

560 function calls in 2.107 seconds

ncalls  tottime  percall  cumtime  function
     1    0.000    0.000    2.107    profiler.py:17(send_request)
     1    0.000    0.000    2.106    urllib/request.py(urlopen)
     1    0.000    0.000    2.106    urllib/request.py(open)
     ...

tottime = 0.000 for ALL Python functions.

This means Python itself runs in microseconds. The entire 2.1 seconds is spent waiting for the network (the OpenAI API response). This told us that making Python code faster wouldn't help — we needed to change how we make requests.

Phase 2: Optimizations

Optimization 1 — Asyncio + Concurrent Requests

The Problem

Before optimization, requests ran sequentially — one at a time:

Timeline:
[Request 1: 2.6s wait...] [Request 2: 2.6s wait...] [Request 3: 2.6s wait...]
Total = 7.9 seconds

Python sat idle while waiting for the AI to respond, then started the next request.

The Fix

Use Python's asyncio (asynchronous programming) to send all requests at the same time:

import asyncio
import aiohttp

async def send_request_async(session, prompt):
    async with session.post(API_URL, json={"prompt": prompt}) as resp:
        return await resp.json()

async def run_concurrent():
    async with aiohttp.ClientSession() as session:
        # Create all tasks at once
        tasks = [send_request_async(session, p) for p in prompts]
        # Run them ALL simultaneously
        results = await asyncio.gather(*tasks)
    return results

asyncio.run(run_concurrent())

The Result

Timeline:
[Request 1: 6.5s ─────────────────────────────────►]
[Request 2: 5.8s ───────────────────────────────►  ]  All done in 6.5s!
[Request 3: 6.1s ────────────────────────────────► ]

Wall time: 7.924s → 6.571s = 17% faster

Optimization 2 — Response Caching

The Problem

Every time someone asks "What does Ethiopian coffee taste like?", the system makes a brand new API call to OpenAI, waits 2-3 seconds, and pays for a new API call — even if we already know the answer.

The Fix

Store answers in a dictionary (cache). On the second request for the same question, return the stored answer instantly:

import hashlib

cache = {}

def cache_key(prompt):
    # Convert prompt to a unique identifier
    return hashlib.md5(prompt.lower().strip().encode()).hexdigest()

async def send_request_cached(session, prompt):
    key = cache_key(prompt)

    # Check if we already know the answer
    if key in cache:
        print("Cache hit! Returning instantly...")
        return cache[key], 0.0   # ← returns in 0 milliseconds

    # First time seeing this question — ask the AI
    result = await send_request_async(session, prompt)
    cache[key] = result['response']   # ← remember the answer
    return result['response'], elapsed

The Result

First ask:   "Ethiopian coffee taste?" → 2.64 seconds (API call)
Second ask:  "Ethiopian coffee taste?" → 0.000 seconds (cache hit!)

Repeat query speed: 2.64s → 0.000s = 100% faster Cached throughput: 0.379 → 4,701 req/sec = 12,000x improvement

Optimization 3 — Connection Pooling + Semaphore

The Problem

Every HTTP request was opening a brand new TCP connection to the server, using it once, then closing it. This is like picking up a new disposable phone for every phone call instead of keeping one open.

Also, sending too many requests simultaneously could overwhelm the server.

The Fix

Connection Pooling: Keep 20 connections open and reuse them:

connector = aiohttp.TCPConnector(
    limit=20,              # Keep 20 connections ready
    limit_per_host=10,     # Max 10 to same server
    keepalive_timeout=30,  # Keep connections alive for 30s
    ttl_dns_cache=300      # Cache DNS lookup for 5 minutes
)

Semaphore: Limit to 3 concurrent requests max to avoid server overload:

semaphore = asyncio.Semaphore(3)

async def bounded_request(session, prompt):
    async with semaphore:   # Only 3 can run at once
        return await send_with_retry(session, prompt)

Retry with Exponential Backoff: If a request fails, wait and try again:

async def send_with_retry(session, prompt, retries=3):
    for attempt in range(retries):
        try:
            return await send_request_async(session, prompt)
        except Exception:
            wait = 2 ** attempt   # Wait 1s, 2s, 4s...
            await asyncio.sleep(wait)

The Result

Throughput: 0.379 → 0.514 req/sec = 36% improvement

Final Results — Complete Before vs After

Performance Comparison Table

Metric	Baseline	Optimized (Cold)	Optimized (Cached)	Best Improvement
Wall Time (3 req)	7.924s	6.571s	0.001s	↓ 99.99%
Throughput	0.379 req/s	0.514 req/s	4,701 req/s	↑ 12,000x
Repeat Query Time	2.641s	2.641s	0.000s	↓ 100%
Avg Memory	33.37 MiB	30.17 MiB	30.17 MiB	↓ 9.6%
Peak Memory	34.34 MiB	38.91 MiB	38.91 MiB	slight increase

Throughput Breakdown

Baseline Sequential:   ████░░░░░░░░░░░░░░░░  0.379 req/sec
Optimized Concurrent:  █████░░░░░░░░░░░░░░░  0.514 req/sec  (+36%)
Mixed (partial cache): ██████░░░░░░░░░░░░░░  0.649 req/sec  (+71%)
Full Cache (warm):     ████████████████████  4,701 req/sec  (+12,000x)

💡 Key Lessons Learned

1. Profile First, Optimize Second

Without cProfile, we might have spent time optimizing Python code that was already running in microseconds. The profiler revealed the real bottleneck was network I/O — completely changing our optimization strategy.

2. The Bottleneck Was Not Where We Expected

We expected Python code to be slow. Instead, tottime = 0.000 for all Python functions proved the code was instant — 99.9% of time was spent waiting for the OpenAI API. This is called an I/O-bound program.

3. Caching Has the Highest ROI

A simple dictionary cache gave us a 12,000x speedup for repeat queries with just 5 lines of code. In real production systems, this is implemented with Redis.

4. Asyncio Is Essential for I/O-Bound Tasks

When your program spends most of its time waiting (for APIs, databases, files), asyncio lets you do something useful during that wait instead of sitting idle.

5. Connection Pooling Reduces Hidden Overhead

Every TCP connection has setup overhead. Reusing connections with TCPConnector eliminates this overhead and improves throughput by 36%.

🔮 Future Work

Improvement	Description	Expected Gain
Redis Cache	Persistent cache that survives restarts	Same 12,000x but permanent
Auto-scaling	Multiple farm agent instances	Linear scaling with workers
Numba JIT	Accelerate serialization code	10-100x for CPU-bound parts
Request Batching	Group similar queries together	Reduce API calls by 50%+
Streaming Responses	Stream tokens as they generate	Perceived latency ↓ 80%

Tech Stack

Technology	Purpose
Python 3.14	Main programming language
asyncio	Asynchronous concurrent programming
aiohttp	Async HTTP client with connection pooling
cProfile	Built-in Python function profiler
memory_profiler	RAM usage tracking
Docker / Docker Compose	Container orchestration
LangGraph	Agent workflow orchestration
SLIM	Secure agent-to-agent messaging bus
OpenAI GPT-4o	Large language model
litellm	LLM provider abstraction layer

References

Group 27 — Advanced Python for Data Science — Spring 2026

Name		Name	Last commit message	Last commit date
Latest commit History 789 Commits
.github		.github
Presentation_files		Presentation_files
assets		assets
coffeeAGNTCY		coffeeAGNTCY
docs		docs
.gitignore		.gitignore
Agentcy-README.md		Agentcy-README.md
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MAINTAINERS.md		MAINTAINERS.md
README.md		README.md
SECURITY.md		SECURITY.md
TUTORIAL.md		TUTORIAL.md
files (1).zip		files (1).zip
improvement-Logs.docx		improvement-Logs.docx
improvement-Logs.pdf		improvement-Logs.pdf
mkdocs.yml		mkdocs.yml
optimized.py		optimized.py
optimized_v2.py		optimized_v2.py
profiler.py		profiler.py
~$provement-Logs.docx		~$provement-Logs.docx

Folders and files

Latest commit

History

Repository files navigation

CoffeeAgntcy Optimization — Group 27

Team — Group 27

Project Overview

System Architecture

All 10 Docker Containers Running

Repository Structure

How to Run This Project

Prerequisites

Step 1 — Clone This Repo

Step 2 — Install Python Dependencies

Step 3 — Run Baseline Profiler

Step 4 — Run Optimized Version

Phase 1: Profiling — What We Measured

Tool 1: time.perf_counter()

Tool 2: cProfile

Tool 3: memory_profiler

Tool 4: docker stats

Baseline Results (Before Optimization)

Latency

Throughput & Memory

Container Resources (During Request)

Most Important Finding — cProfile Output

Phase 2: Optimizations

Optimization 1 — Asyncio + Concurrent Requests

The Problem

The Fix

The Result

Optimization 2 — Response Caching

The Problem

The Fix

The Result

Optimization 3 — Connection Pooling + Semaphore

The Problem

The Fix

The Result

Final Results — Complete Before vs After

Performance Comparison Table

Throughput Breakdown

💡 Key Lessons Learned

1. Profile First, Optimize Second

2. The Bottleneck Was Not Where We Expected

3. Caching Has the Highest ROI

4. Asyncio Is Essential for I/O-Bound Tasks

5. Connection Pooling Reduces Hidden Overhead

🔮 Future Work

Tech Stack

References

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Tool 1: `time.perf_counter()`

Tool 2: `cProfile`

Tool 3: `memory_profiler`

Tool 4: `docker stats`

Packages