Inspiration
Our team was inspired by the tendency of all public information about defensive contracts to stay unintelligible and opaque to most people interested in knowing more about them. They always seem to be hidden in obscure locations and concealing jargon. However, for somebody interested in understanding how they affect financial products, this information needs to be easily accessible and digestible. The DoD, with an annual budget of $1.60 Trillion, is one of the most influential organizations in the world. Hence, this project.
What it does
Defense Insights is an advanced tool to automatically extract, analyze, and visualize critical data on US defense contracts. It parses through over 2500 webpages to extract all the available historical data about all the contracts signed by the Department of Defense. Using a GPT 3.5 Turbo model, it analyzes the information and identifies nine critical data points within each contract - company name, location, federal agency, date, contract value, contract type, purpose, completion date, and contract number. This information is collected from over 30,000 contracts to be summarized and quantified.
From over 2,500 web pages worth of unstructured contract information to structured data.
How we built it
Webscraping: We deployed Python scripts with Selenium and BeautifulSoup for dynamic web scraping to capture the latest contract documents. We also implemented a listening algorithm to check if the website has been updated every three hours.
AI Data Parsing: We used a custom-trained GPT model to intelligently parse and structure raw text into defined categories like amounts, dates, and company names.
Databases: We used a serverless managed Postgres-based database system called Neon to receive all the important json files and create the database.
Data Visualization: Rendered dynamic visualizations of the relevant data points using Django.
Challenges we ran into
Our IP was flagged by the DoD website for repeated making requests while running the web scraper program - which was trying to access about 2,500 URLs on that website back-to-back and we were subsequently blocked from accessing the website for a time period.
The most significant challenge we ran into was tweaking the AI model to work with hundreds of never-ending edge cases which are to be expected as we are trying to obtain over 250,000 data points from purely unstructured text.
Accomplishments that we're proud of
- Being able to manage the multiple moving pieces of the project including the front-end in 36 hours
- Accurate utilization of the LLM to obtain the data points required.
- The mere scale of the project.
What we learned
- Sending three thousand back-to-back requests to the Department of Defense website is not wise.
- Development needs to be done symmetrically to achieve maximum efficiency to try and meet in the middle.
Log in or sign up for Devpost to join the conversation.