Inspiration
For the past few years, we have both been surrounded by meetings in a variety of different settings, whether it be industry giants (Qualcomm, Samsara, Nutanix, to name a few), startup settings, or research meetings. While these meetings clarify a lot and set meaningful action items towards progress for your team, we both feel that with the many exchanges and constant other worries that one has in their workday, these meetings no longer carry their weight after leaving the Zoom call.
What it does
Sidekick joins your Zoom meetings and, in real time, monitors key action items that must be tracked and taken note of. That way, when you hop off the call, your tasks, reminders, and follow up syncs are all already logged for you - all in one place. With a plethora of integrations to your team's most widely used platforms, Sidekick takes care of keeping everything from your calendar to your Github up to date to make sure you can focus on the things that are important.
How we built it
The technical pipeline goes something like this.
- We used the open-source Attendee bot to join Zoom meetings via a WebSocket endpoint. Instead of relying on prebuilt transcription APIs, we streamed raw PCM audio directly from Zoom using the WebSocket. The audio was buffered in 30ms chunks, normalized to a 16kHz mono format for compatibility, and passed through Voice Activity Detection (VAD) to eliminate silence and background noise. This optimized stream was then routed to Groq’s Whisper v3, enabling real-time transcription with sub-100ms latency.
- Once we were able to collect the transcription text from Groq and OpenAI's Whisper v3, we sent it to a supervisor agent made in Langraph that parses through the transcription, filters out noise, and collects commands relevant to our available integrations.
- Agents in the background, summoned through Langgraph via the supervisor agent, are then calling onto different MCP servers (ie Github, Outlook, Calendar, Linear, etc) to create tool call procedures and execute commands parsed out. Each worker agent is responsible for a set of tools for a specific integration.
- A JSON of tool call procedures and commands is stored on the backend and ported over to Streamlit actively as agents continue to act on commands determined by the supervisor agent, so the user can approve or deny the incoming actions to ensure human-in-the-loop features.
- The user approves and denies actions displayed on the Streamlit to execute out the actions across different platforms.
Challenges we ran into
We ran into 2 primary challenges
1) Integrating Zoom and Whisper: Initially, we tried using an outdated open-source bot (MeetingBot) that failed to give us direct audio access. After pivoting to Attendee, we were able to successfully stream raw audio and pipe it directly into Groq’s Whisper model for low-latency transcription.
2) Dealing with the low latency LLM calls on the Whisper output: At first, our strategy was to have the agents in real time do their calls. But, we realized that this meant wrapping the Whisper outputs with a higher time operation of deploying agents and analyzing commands, significantly worsening the live streaming experience. Therefore, we pivoted more towards the real time command parsing and dealing with the agents in the background to ensure that when you get off the call, work has gotten done for you.
Accomplishments that we're proud of
We might have the WORLD'S FIRST ZOOM bot that can transcribe audio in real time and give it back to you ..
The end to end meeting to action pipeline is something that we spent a lot of time on, from the modularity we integrated within our code to the end result. Although very difficult, it pushed our technical boundaries.
The agent design and code for that was also something we were very proud of. After spending hours on intelligent state design to create context aware agents that persisted through multiple actions and complex live-streaming as well as developing each individual agent in a modular StateGraph system, there was a lot of thought and strategy that even just went into design and planning for it that really made us appreciate the whole process.
What we learned
A huge learning and advantage that seemed to serve us well later on was code modularity and ensuring that we were developing with the purpose of integrating as it became significantly easier to understand where to integrate and do so quickly as time ran out. Another huge learning we had was just overall working with more agents and MCP integrations. As we spent a majority of the time after creating the core infrastructure working on various integrations, we got to experiment a lot with how MCPs can fuel agents and learn more about them practically and in-depth.
What's next for Sidekick
Streaming architectures are hard. Even small latency in LLM inference adds up, especially when processing audio in real time, which would be the first milestone the Sidekick team would work around. The beauty of the idea, however, is just how modular it is. While right now we are working on more generic platform integrations, this can be used for internal tooling and much more.
Built With
- amazon-web-services
- anthropic
- docker
- github-api
- groq
- langgraph
- mcps
- openai
- outlook
- python
- streamlit
- terraform
- whisper
- zoom-sdk
Log in or sign up for Devpost to join the conversation.