ClauseAI is a document processing and querying application that leverages AI-powered tools for extracting metadata, vectorizing content, and providing intelligent query responses. It combines the power of OpenAI models and Qdrant to create a seamless document management system.
Follow these steps to set up and run the ClauseAI project locally:
- Python 3.10 or higher
- Virtual environment manager (optional but recommended)
-
Clone the Repository:
git clone <repository_url> cd ClauseAI
-
Set up a Virtual Environment:
python -m venv virtual source virtual/bin/activate # On Windows, use virtual\Scripts\activate
-
Install Dependencies:
pip install -r requirements.txt
-
Set Environment Variables: Create a
.envfile in the root directory with the following content:QDRANT_URL=<your_qdrant_url> QDRANT_API_KEY=<your_qdrant_api_key> OPENAI_API_KEY=<your_openai_api_key> -
Run the Application:
streamlit run workflow.py
ClauseAI consists of two main functionalities:
- Upload a PDF document.
- Convert the document to Markdown format and extract metadata.
- Generate vector embeddings for the document content and store them in Qdrant.
- Extract entities using GPT-4 for metadata enrichment.
- Select a processed document by its ID.
- Query the document using two mechanisms:
- Qdrant: Fetch the most relevant context chunks.
- LLM: Refine the Qdrant output using OpenAI GPT-4 for a natural-language response.
- PDF to Markdown Conversion: Extracts textual content and metadata from uploaded PDF documents.
- Vectorization: Converts document content into vector embeddings using OpenAI embeddings and stores them in Qdrant for efficient querying.
- Entity Extraction: Uses GPT-4 to identify and extract key entities in the document.
- Intelligent Querying: Combines Qdrant's vector search and GPT-4's natural language understanding to deliver detailed query responses.
We welcome contributions to ClauseAI! Please fork the repository and create a pull request with your changes.