Inspiration

We’re a team of developers and hackers who love tinkering with new technologies. Obviously, this means that we have been super excited about building projects in the ML space, especially applications involving LLMs. However, we realized two crucial issues with using LLMs in production applications. First, LLMs hallucinate, confidently responding with incorrect information. Second, we cannot explain on what basis the LLM gives its answer. This is why essentially all production LLM applications use retrieval augmented generation (RAG). Through RAGs, you supply the LLM with relevant, factual, and citable information, significantly increasing the quality of its responses. Our initial idea for Treehacks was to build an app based on such a system: we wanted to build a fully automated literature review. Yet, when building the system, we spend most of our time sourcing, cleaning, processing, embedding, and maintaining data for the retrieval process.

After talking with other developers, we have realized that this is a significant hurdle many in the AI community face: the LLM app ecosystem provides robust abstractions for most parts of the backend infrastructure, yet it falls short in offering solutions for the critical data component needed for retrieval. This gap significantly impacts the development of RAG applications, making it a slow, expensive, and arduous journey to embed data into vector databases. The challenge of sourcing, embedding, and maintaining data, with its high costs and slow processing times, threw us off our initial course, making it an issue we were determined to solve.

We observed that most RAG applications require similar types of data, such as legal documents, health records, research papers, news articles, educational material, and books. Each time developers create a RAG application, they find themselves having to reinvent the wheel to populate their vector databases—collecting, pre-processing, and managing data instead of focusing on the actual application development.

To solve this problem, we have built an API that lets developers retrieve relevant data for their AI/LLM application without collecting, preprocessing, and managing it. Our tool sits in between developers and vector databases, abstracting away all the complexity of sourcing and managing data for RAG applications. This allows developers to focus on what they do best: build applications. Our solution also addresses a critical mismatch for developers: the vast amount of data they need to preprocess versus how much they actually utilize. Given the steep prices of embedding models, developers must pay for all the data they ingest, regardless of how much is ultimately used. Our experience suggests that a small subset of the embedding data is frequently queried, while the vast majority is unread. Blanket eliminates this financial burden for developers.

Finally, we are also building the infrastructure to process and embed unstructured data, giving developers access to ten times the amount of data that they previously could harness, significantly enhancing the capabilities of their applications. For example, until now only the abstracts of ArXiv research papers had been embedded, as the full papers are stored in difficult-to-process PDF files. Over the course of Treehacks, we were able to embed the actual paper content itself, unlocking an incredible wealth of knowledge.

In the current RAG development stack, despite advancements and abstractions provided by tools like Langchain, open-source vector databases like Chroma, and APIs to LLM models, collecting relevant data remains the sole significant hurdle for developers building AI/LLM applications. Blanket emerges as the final piece of this puzzle, offering an API that allows developers to query the data they need with a single line of code, thereby streamlining the development process and significantly reducing overhead.

We want to emphasize that this is not a theoretical solution. We have actively demonstrated its efficacy. For our demo, we built an application that automatically generates a literature review from a research question, utilizing the Langchain and Blanket’s API. Achieved in merely six lines of code, this showcases the power and efficiency of our solution, making Blanket a groundbreaking tool for developers in the AI space.

What it does

Blanket is an API which lets developers retrieve relevant data for their AI/LLM application. We are a developer tool that sits between developers and vector databases (such as ChromaDB and Pinecone), abstracting away all the complexity of sourcing and managing data for RAG applications. We aim to embed large high quality, citable datasets, (using both structured and unstructured data) from major verticals (Legal, Health, Education, Research, News, ...) into vector databases, such as Chroma DB. Our service will ensure that the data is up-to-date, accurate, and citable, freeing developers from the tedious work of data management.

During Treehacks, we embedded the full contents and abstracts of around 20,000 (due to time and cost constraints) computer science related ArXiv research papers. We built an easy-to-use API that lets users query our databases in their AI/LLM application removing the need for them to deal with data. Finally, we built our original idea of an app that generates an academic literature review of a research question using the blanket API with only 6 lines of code.

from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from blanket.utils.api import Blanket

def get_lit_review(query):

    prompt_template = """
    Create a 600 word literature review on the following topic: {query}.

    Use the following papers and context. Cite the authors and the title of the paper when you quote. 
    Only use the context that is relevant, dont add a references section or a title just the review itself.
    Context: \n\n
    {context}
    """

    prompt = ChatPromptTemplate.from_template(prompt_template)
    model = ChatOpenAI()
    chain = prompt | model

    context = Blanket().get_research_data_sort_paper(query, numResults=10)

    return chain.invoke({"query": query, "context": context}).content, context

The API we have built currently only allows for the querying of data related to research papers. Below are the three user facing function each returning the data in a different format such that the developer can choose the format most suited for them.

def get_research_data(self, query: str, num_results: int = 10) -> list[dict]:
        """
        Retrieves research data based on a specified query, formatted for client-facing applications.

        This method conducts a search for research papers related to the given query and compiles
        a list of relevant papers, including their metadata. Each item in the returned list represents
        a single result, formatted with both the textual content found by the configured Vector DB and structured
        metadata about the paper itself.

        Parameters:
        - query (str): The query string to search for relevant research papers.
        - num_results (int, optional): The number of results to return. Defaults to 10.

        Returns:
        - list[dict]: A list where each element is a dictionary containing:
            - "text": The textual content related to the query as found by the Vector DB, which may include
            snippets from the paper or generated summaries.
            - "meta": A dictionary of metadata for the paper, including:
                - "title": The title of the paper.
                - "authors": A list or string of the paper's authors.
                - "abstract": The abstract of the paper.
                - "source": A URL to the full text of the paper, typically pointing to a PDF on arXiv.

        The return format is designed to be easily used in client-facing applications, where both
        the immediate context of the query's result ("text") and detailed information about the source
        ("meta") are valuable for end-users. This method is particularly useful for applications
        requiring quick access to research papers' metadata and content based on specific queries,
        such as literature review tools or academic search engines.

        Example Usage:
        >>> api = YourAPIClass()
        >>> research_data = api.get_research_data("deep learning", 5)
        >>> print(research_data[0]["meta"]["title"])
        "Title of the first relevant paper"

        Note:
        Multiple elements of the list may relate to the same paper, to return results batched by paper
        please use the `get_research_data_sort_paper` method instead.
        """

    def get_research_data_sort_paper(self, query: str, num_results: int = 10) -> dict[dict]:
        """
        Retrieves and organizes research data based on a specified query, with a focus on sorting
        and structuring the data by paper ID.

        This method searches for research papers relevant to the given query. It then organizes
        the results into a dictionary, where each key is a paper ID, and its value is another
        dictionary containing detailed metadata about the paper and its contextual relevance
        to the query.

        Parameters:
        - query (str): The query string to search for relevant research papers.
        - num_results (int, optional): The desired number of results to return. Defaults to 10.

        Returns:
        - dict[dict]: A nested dictionary where each key is a paper ID and each value is a
        dictionary with the following structure:
            - "title": The title of the research paper.
            - "authors": The authors of the paper.
            - "abstract": The abstract of the paper.
            - "source": A URL to the full text of the paper, typically pointing to arXiv.
            - "context": A dictionary where each key is an index (starting from 0) and each value
            is a text snippet or summary relevant to the query, as found in the paper or generated.

        This structure is especially useful for client-facing applications that require detailed
        information about each paper, along with contextual snippets or summaries that highlight
        the paper's relevance to the query. The `context` dictionary within each paper's data allows
        for a granular presentation of how each paper relates to the query, facilitating a deeper
        understanding and exploration of the research landscape.

        Example Usage:
        >>> api = YourAPIClass()
        >>> sorted_research_data = api.get_research_data_sort_paper("neural networks", 5)
        >>> for paper_id, paper_info in sorted_research_data.items():
        >>>     print(paper_info["title"], paper_info["source"])
        "Title of the first paper", "https://arxiv.org/pdf/paper_id.pdf"
        """

    def get_research_data_easy_cite(self, query: str, num_results: int = 10) -> list[str]:
        """
        Generates a list of easily citable strings for research papers relevant to a given query.

        This method conducts a search for research papers that match the specified query and formats
        the key information about each paper into a citable string. This includes the title, authors,
        abstract, and a direct source link to the full text, along with a relevant text snippet or
        summary that highlights the paper's relevance to the query.

        Parameters:
        - query (str): The query string to search for relevant research papers.
        - num_results (int, optional): The desired number of results to return. Defaults to 10.

        Returns:
        - list[str]: A list of strings, each representing a citable summary of a research paper.
        Each string includes the paper's title, authors, abstract, source URL, and a relevant
        text snippet. This format is designed to provide a quick, comprehensive overview suitable
        for citation purposes in academic or research contexts.

        Example Usage:
        >>> api = YourAPIClass()
        >>> citations = api.get_research_data_easy_cite("deep learning", 5)
        >>> for cite in citations:
        >>>     print(cite)
        Paper title: [Title of the Paper]
        Authors: [Authors List]
        Abstract: [Abstract Text]
        Source: [URL to the paper]
        Text: [Relevant text snippet or summary]
        """

How we built it

We built our solution by blending innovative tech (such as vectorDBs), optimization techniques, and a seamless design for developers. Here’s how we pieced together our project:

1. Cloud Infrastructure We established our cloud infrastructure by creating two Azure cloud instances. One instance is dedicated to continuously managing the embedding process, while the other manages the deployed vector database.

2. Vector Database Selection For our backend database, we chose Chroma DB. This decision was driven by Chroma DB's compatibility with our goals and ethos of seamless developer tooling. Chroma DB serves as one of the backbone tools of our system, storing the embedded databases and enabling fast, reliable retrieval of embedded information.

*3. Embedding Model * We embed documents using VoyageAI’s voyage-lite-02-instruct model. We selected it for its strong semantic similarity performance on the Massive Text Embedding Benchmark (MTEB) Leaderboard. However, it's important to note that while this model offers superior accuracy, it comes with higher costs and slower embedding times—a trade-off we accepted for the sake of quality.

4. Data Processing and Ingestion Pipeline With our infrastructure in place, we focused on building a robust data processing and ingestion pipeline. Written in Python, this pipeline is responsible for collecting, processing, embedding, and storing the academic papers into our database. This step was crucial for automating the data flow and ensuring our database remains extensive and comprehensive.

5. Optimization Techniques We also optimized our data processing. By leveraging a wide array of systems optimization techniques, including batch processing and parallelization, we ensured our infrastructure could handle large volumes of data efficiently. These techniques allowed us to maximize our system's performance and speed, laying the groundwork for quickly processing new data.

6. Literature Review Demo App The culmination of our efforts is the literature review demo application. Utilizing our API and integrating with Langchain, we developed an application capable of generating accurate, high-quality literature reviews for research questions in a matter of seconds. This demonstration not only showcases the power of our API but also the practical application of our system in academic research.

7. Frontend Development Finally, to make our application accessible and user-friendly, we designed a simple yet effective frontend using HTML. This interface allows users to interact with our demo app easily, submitting research questions and receiving comprehensive literature reviews in return.

Challenges we ran into

Over the course of this project, we ran into a few challenges:

1. Optimizing chunking and retrieval accuracy. In order to ensure accurate and relevant retrieval of data, we needed to choose smart chunking strategies. We thus had to experiment with many different strategies, measure which ones performed better compared to others, and ultimately make a decision based on data we collected.

2. Dealing with embedding models. A crucial part of the system is the generation of embeddings for data. However, most high-quality embedding models are run through APIs. This makes them expensive. In addition, accessing these embedding APIs is at times very slow.

3. Dealing with PDFs. As PDFs use specific encoding formats, extracting and processing data from PDFs is not straightforward. We had to deal with quite a few error cases and had to find ways to filter for badly-formatted data. This took more time and effort than we had initially expected.

4. Deploying the database. In order to be able to access our database through our API, we deployed Chroma on Azure. We ran it in a docker container. However, the database crashed twice due to memory constraints, leading to us losing our generated embeddings. So, we figured out how to use the disk by directly inspecting ChromaDB’s source code.

Accomplishments that we're proud of

1. Embedding full-text ArXiv papers. We are the first team to embed the full texts of thousands of ArXiv papers into a widely accessible database. We believe that this can have a wide range of use cases, from application development, to education and academic research.

2. Pivoting during the Hackathon. We successfully pivoted from creating an LLM application to building a developer tool after identifying a key point of friction in the development pipeline. Ultimately, we were able to create our initial application–in 6 lines of code on top of our new API.

3. Optimizing our code. When we initially created our data processing and embedding pipeline, it was fairly slow. However, through a combination of systems optimizations, we were able to achieve 10x speedups over our original approach.

4. Creating cloud architecture. We built and configured a server to run the ArXiv embedding pipeline in perpetuity until all papers are embedded. In addition, we created a different server that fully manages our backend database infrastructure.

What we learned

Over the course of Treehacks, we learned a tremendous amount about the development process of LLM applications. We delved deeply into exploring the tradeoffs between different tools and architectures. Having a wide variety of technical requirements in our own project, we were able to explore and learn more about these tradeoffs. In addition, we gained experience in applying optimization strategies. On the one hand, we optimized our data processing on a systems level. On the other hand, we optimized our accuracy retrieval accuracy by applying different chunking and embedding strategies. Overall, we have gained a much greater appreciation for the problem of data management for RAG-based applications and the whole LLM application ecosystem as a whole.

What's next for Blanket.ai

After Treehacks, we want to start working closely with LLM application developers to better understand their data and infrastructure needs. In addition, we plan to embed the full texts of all of ArXiv’s (approximately 3 million) research papers into a database accessible through our API to any developer. To do so, we aim to make the API production-ready, decreasing response times, increasing throughput capabilities, and releasing documentation. Furthermore, we want to spread the word about the Blanket API by advertising on forums and developer meetups. Finally, we aim to build widely-available databases for data in other verticals, such as legal, health, and education.

Built With

Share this project:

Updates