Icarus

front-end
conversation generating fields
flowchart

Inspiration

As we go about our day, humans create massive amounts of unstructured data. Whether we are students taking lecture notes, pedestrians taking photos of flyers, or engineers writing code, we are constantly generating data.

These massive, unstructured data lakes might exist as sales call reports at the bottom of your desk or a folder of screenshots sitting in your phone. However, if we could transform this unstructured data into structured, queryable databases, we would open up a whole new world of data analysis and visualization.

Even better, we can augment generative models with the previously unstructured, multimodal data that tends to be locked away from models, allowing users greater control and iterative improvement opportunities over the data accessible to our virtual assistants.

What it does

Our application takes multimodal data lakes (photos, text files, pdfs, and more) and, in intelligent, dynamic conversation with users, generates a structure to make this data most useful to the user and to their model. From there, it creates a queryable database ready out of the box for data analysis and visualization.

This queryable database serves as high-fidelity memory for a retrieval augmented generative model (RAG). By providing a condensed, structured version of the user's data to the RAG, we can transform the black box of retrieval into intelligent queries of structured databases.

How we built it

Our application consists of two segments: memory generation and agent action.

Memory Generation A user uploads a series of files into our application. If the files are photos, we use Google's Pix2Struct model to create highly descriptive text-based versions of each photo, extracting important information.

The user then picks training files, which can be either their text-native files or their text-converted photos (now descriptions). In conversation with the GPT 4 LLM, the user and the LLM select an appropriate structure for the data, defining fields by which a group of files could be condensed. For example, if a user uploads twenty screenshots of tax reports, the model and user might generate region, name, date, and other useful queryable fields. We built these using a flask back-end that calls OpenAI's GPT API, adaptively managing the context window to ensure GPT produces useful fields.

From there, GPT iterates through each file, searching for useful information to condense into each field. Once it has finished, we provide the user with a JSON file or Excel file ready to be queried, visualized, and analyzed.

Retrieval Augmented Generative Action Our application also allows the user to interact with a model, providing the model with the database it needs to intelligently respond to prompts. The user can ask questions related to the memory it has uploaded to the application, and the application will use the JSON file and user's prompting to generate a response.

Challenges we ran into

Generation of Fields Generating fields and structuring data is a difficult problem to solve in a short amount of time. Our JSON file representations of previously unstructured data are not always the cleanest, but by constantly reordering the context window given to GPT, we can ensure the best output.

Photo Upload It was difficult to find a framework that would allow us to extract useful data from photos and make them useable to the GPT API. However, we were able to use google's Pix2Struct to do so.

Accomplishments that we're proud of

We are proud of the ability to adaptively structure previously unstructured data. We are also extremely excited at the ability for the model to access JSONs and even processed photos, competently applying this information to help the user.

What we learned

We learned about the surprising ability of GPT to interact intelligently with large and complex multimodal inputs and their textual representations. We also got a ton of practice interacting with various models and cutting-edge AI frameworks.