Astra

1. PROJECT OVERVIEW

Project Name: Astra
Tagline: Astra is an autonomous promptless ai secretary that helps friends enjoy activities together without the hassle of tedious group coordination.
Elevator Pitch: Astra helps to coordinate and handle situations within the ever-changing and unpredictable world of group events, without being intrusive to allow participants to focus on what really matters. Astra gathers context and understands when to participate to efficiently and seamlessly make bookings, coordinate timings and deal with unpredictable changes in plans.

2. INSPIRATION

Typical AI agents right now usually require the user to trigger it through interfaces like chatting with it or calling an endpoint. However, we envision the future of AI agents to be some what like co-workers, embedded into our existing systems and able to intelligent decide if it should do something. We think that this autonomy will yet again push the boundaries of what is possible with AI agents.

We wanted to start exploring this by first tackling one of the problems that we individually face on a frequent basis — managing group events. As the coordinator of many events, we got tired of having to burden someone with the task of dealing with everyone's constantly changing plans, as well as the daily struggle of trying to find places and reservations which fit everyone's schedule. We decided to do something about this, to make a friend dedicated to this very task, a friend for life.

3. WHAT IT DOES

Promptless: Astra is able to embed in existing systems, watch for changes, and intelligent decide if it should perform an action all without being prompted by the user. This is vastly different from current user experiences where users usually have to trigger an AI agent to perform an action. This creates a seamless experience where AI agent can truly work alongside us.
Autonomous functionality: Astra has a fully autonomous thinking process where once given a goal, it repeatedly thinks, executes actions, and observe results in a step by step process. This allows Astra to receive different external signals and gain contextual knowledge along the way and actually modify or improve it's interpretation of the goal all while trying to complete a goal. This means that Astra could be in the middle of making a booking, only to receive a message from the user that they would like to go to another place to eat -- Astra is able to process that information and know that it now needs to make a booking at a new place and that it should abandon it's previous booking process or proceed cancel it.
Execute complex web flows: Booking reservations, if not available on technological platforms like Resy, is actually a fairly complex and varied task. This usually involves some form of google search, multiple button clicks, and sometimes maybe even a call to the place. The solution space of how this can be done is huge. As Astra attempts to complete this task, it is able to go through rounds of decision making and self evaluation to determine if it has achieved its goal of booking a reservation. If the previous attempts have failed, Astra is also able to decide if it should try alternative methods like calling the place and execute it, or if it should stop trying and just inform the user of the booking status. This entire self evaluation loops means that Astra can almost always complete a booking autonomously despite the fact that booking a reservation is a vastly varied process.
Integration with existing messaging app: Astra is completed embedded into Telegram, a common messaging app that is already being used by friend groups. This allows users to adopt the tool quickly.

Use Cases

In any group setting where coordination and management of events is required, you can integrate Astra as your group’s personal secretary. For example, if you are planning a karaoke session with a group friends, Astra helps you seamlessly get and manage reservations at the karaoke place as your group’s requirement change. Have that last minute "I can't make it" or "Can Eunice come along too" ? Astra takes care of all of that for you. No more "Let me call to see if they have a bigger room". The promptless nature of AlexBot allows this to happen without users feeling like they still need to ‘manage’ the bot.
When arranging a meetup with a friend, you can integrate AlexBot into the conversation to help check if restaurants are available and book restaurants

4. HOW WE BUILT IT

Architecture Diagram

Overall Architecture

architecture

Agent Architecture

We played around with different architectures for different agents depending on the type of agent we wanted them to be (e.g more task-focused vs more flexible). However, generally our agents comprised of these few components:

Task Composition Pattern

Tree of thought: Generating a plan based on the goal and then breaking down each task in plan into subtasks.
- Each step within the tree of thought still had it’s own self reflection loop where it would attempt to complete each task autonomously. But the fact that the tasks were all children of the initially generated plan made sure that the agent was extremely focused on the goal it was given. This architecture was useful for task focused agents that had simpler tasks and needed to be less distracted
Chain of thought: Generate your next thought based on everything that is currently happening.
- This architecture tells the agent to perform an action or thought, then generate its next action or thought, then generate its next action or thought etc.
- This was useful for agents that needed to be more flexible (i.e incorporate external signals)
- This was also useful for agents that had more complex tasks because the fact that each thought / action was accessible on the same nested level made the agent’s self evaluation and remediation loop much smarter.

Self reflection

Our self reflection loop closely resembles the ReAct framework where for each task or subtask, the agent:

Thinks about what the next best action to complete the task is
Performs an action
Evaluate the results of the action
Determine if the task has been completed. If no, repeat the loop

Memory

All our agents share similar memory principles. Each architecture contains the following few concepts:

Short-Term Memory (STM) or Working Memory: Stores information that the agent currently needs or just the most recent events.
- A list of weighted events — an event is everything that happens, including external signals, thoughts the agent had, actions the agent took and observations it saw from the actions.
- The short term memory self condenses whenever it exceeds a certain length. Self condensation is a moving window summarization of its current memories.
- At every condensation step, we keep about half of it’s most recent short term memory intact and only condense the earlier half. This makes sure that the agent’s short term memory doesn’t jump and cause the agent to make drastically different thoughts after a memory condensation process.
- We also added the concept of weights to each event to signify if certain events are more important and hence should stay in the short term memory longer.
Permanent memory:
- Procedural (tools or actions it has access to)
- Fixed context information (e.g the role of the agent and or general user preferences)
Long-Term Memory (LTM): Stores all information for the agent to access if it needs. Stored in a vector database and needs to be explicitly retrieved by the agent when it thinks that past memories could help.

This is particularly useful when the user brings up a point that needs context from past messages. At this point, the agent’s short term memory might be missing this context and it will need to attempt to retrieve additional context through searching through it’s past memories.

Tools

All agents have concepts of tools or actions they can take. This ranges from:
- Reasoning like actions e.g ‘Think’, ‘Clarify’, ‘Wait’
- Tool based actions e.g ‘CallRestaurant’, ‘GoogleSearch’

Technology Stack:

    - Python
    - Vapi AI
    - Playwright
    - Telegram API
    - Mongodb
    - Various Gpt models

Development Process

We split the work of developing the systems into four main components -
- Building the Main Agent architectures which drive each agent
- Building out tools and integrations for the agents to use
- Building the sub-agents to perform the coordination, booking and navigation
- Integrating all components into a single system
We started working on the main agent architectures, which include 2 different agent architectures which we intended to use for different sub-agents. At the same time, we were able to start up all required middleware and start working on the more complex tools (page navigation and booking).
- Building tools from bottom up allowed us to convert steps which we thought we had to automate into discrete, agent-friendly tools. This allowed us to leverage the power of agents build some complex tools as agents themselves.
Once we had our main agent architectures, we started building out the agents and integrating the tools we had built. This was an iterative process which allowed us to constantly improve the existing tools and make context-specific adjustments on top of the agent architecture.
- We had some issues here trying to coax more task-focused agents to focus on feedback on previous tasks. This led us to integrate log memory and context into its internal memory
- However, this also led to logs growing, diluting the value of each log entity and ultimately making the log grow too big. This led us to integrate log compaction into the agent architecture
- Lastly, log compaction did cause agents to lose some pertinent memories with important information. To combat this, we integrated 3 strategies
  - We integrated a focused memory to ensure important details would be persisted. The agent could identify information which is important to the task and persist it in memory.
  - We assigned weightage to each type of log action to skew how logs are compacted and persist more important information.
  - We introduced a tool to allow the agent to perform a search on chat history to retrieve important information and context.
Finally, we stitched the agents together, allowing them to behave and communicate as a system, and started testing end to end flows.

5. CHALLENGES WE RAN INTO

Technical Hurdles:

Allowing the agent to be aware of multiple external signals while trying to achieve a goal & use those signals to adapt current execution processes

Often times the AI agent performing it's loop to complete it's goal runs in a single main thread where it needs to await results from each step. However, because we wanted the agent to be able to run and yet continuously monitor the chat an adjust it's goal based on new context, we had to play around with the architecture a little bit.
We decoupled the agent’s memory log from the execution engine which allowed the agent to constantly reference new events when deciding if it should proceed with it’s current course of action
We forced the agent to break down large tasks into multiple smaller, discrete tasks. This gives the agent more frequent breaks in between it's steps in order to reference new events while executing actions to reach it’s goal.

Trying to get the agent to recognize contextual and social clues in a group chat setting to determine when it should render assistance.

We experimented with multiple prompts and structures to tune the behaviour of the agent
We structured the agent’s memory as an event based log and allowed it to learn from feed back based on its interactions with various groups. We then allowed the agent to condense it’s memory to solidify its learning.
We also integrated entity memory into the agent’s memories to allow it to better understand and deal with different personalities in the group chat.

Navigating complex booking webpages and search flows

We created a separate task-focused agent architecture to perform booking and search tasks. This allows the agent to generate and re-generate subtasks and actions to navigate the complexity of different scenarios
We designed our booking management and navigation tool as a separate agent with its own memory to allow it to dynamically navigate complex html pages and find the optimal sequence of steps to perform a booking, regardless of layout.
We integrated a variety of tools for the agent to perform different actions to achieve its aim, such as page navigation, search, and voice calling. This allowed the agent to recover from failures by trying different strategies, or find information via different avenues to achieve it’s aims.

Project Management:

Some of us had trouble with the hackathon being on a weekday as most of us have full time jobs.

6. ACCOMPLISHMENTS THAT WE’RE PROUD OF

We successfully created multiple agent architectures to fit multiple types of agents, allowing them to achieve their goals in ways which fit the task.
We successfully created and deployed our agent in a groupchat setting via Telegram integration and trained it to be a useful member of the group. In one instance, the agent hilariously determined that one group participant was being hostile and it should try to diffuse the situation
We were able to create an agent which was capable of performing the end to end booking flow via multiple avenues, without any human intervention.
Our agent actually managed to call a real restaurant (Spicy King) and secured a reservation for 5 for Wednesday June 5th at 7pm.