Inspiration
We’ve spent the past few months thinking about information in markets. From playing around with Kalshi and Polymarket data, to taking courses in agent-based modeling and special interest politics it has completely enveloped our free time. BallotBox is our shot at democratizing scattered election data to give candidates and constituents visibility and control into the campaigns they’re participating in.
What it does
BallotBox is a platform built to provide comprehensive and meaningful insights into future ballot measures and elections – whether they be national or local.
First, one can use the platform as a historical research engine. With data stretching into the early 1900’s, one can easily see how the topics covered in ballot measures, along with their relative support amongst sub-populations in America have changed over time.
One can do similar types of analysis on historical elections, looking at all levels of political races from the smallest local elections to national elections. We provide analysis and deeper understanding of historical stances that politicians have taken to elevate the user’s understanding of US politics.
Looking to the future, we have also put together tools for optimizing ballot measure messaging and phrasing for future initiatives. Specifically, given a ballot measure in a given state, we use current data on the state to predict the likelihood of said ballot measure passing. Further, our tool provides suggestions for how to rephrase ballot measures to maximize the odds that the measure passes, oftentimes increasing the odds of passage by up to 15%.
In elections, we are also able to do deep and thorough analysis of politicians’ stances and platforms to predict whether politicians will win elections or not. With a staggering 70% accuracy rate on recent elections, we believe that this method which considers not only the politicians’ stances, but also contextual understanding of the environment that the politicians are campaigning in as well as competitors’ messaging represents a new paradigm in political analysis.
How I built it
Using the CivicEngine API, Census Data and BallotPedia, we put together a database of election and ballot measure outcomes going as far back as the 1900’s. We cross-referenced BallotPedia data on vote distributions against the CivicEngine to create datasets which we used to train models to predict several things: The likelihood of a ballot measure passing given specific verbiage and state metadata. The likelihood of a given candidate beating a set of political opponents based on their public stances on issues
For the ballot measure prediction model, we used OpenAI’s api to create embeddings of the key parts of ballot measures including the affirmative statement, the negative statement, and the question statement. We joined this with heuristics about the area that a given ballot measure was considered in including the population, median wage, and more before passing this to a model that is trained to predict whether a given ballot measure will pass or not.
For the election prediction model, we once again embedded the stances of all candidates using the OpenAI API. That vector is joined with information on a given candidate’s political party affiliation, the region that they are running for office in, the type of election that the candidate is participating in, and more. We stored this vector for every candidate that we could consider. Then, we trained a model to, given the target candidate’s vector and the competition’s vectors, predict the likelihood of a given candidate winning the election.
All of this information is put into a frontend which is based on the Sway website. We used Sway assets to adapt to the style, and programmed the full stack using a combination of NextJS, React, TailwindCSS, MySQL, Vercel, Docker, PHP, Typescript, Python, FastAPI and more.
Challenges I ran into,
Most of the challenges we ran into during this project are around data scraping and management. For one, our team is not particularly well-versed in GraphQL. Further, the structure of the API does not lend itself to mass-scraping for supervised training. This made getting the data we were after from the CivicEngine API particularly difficult.
Besides these challenges, we also found that many of the data columns we were expecting to exist in the dataset were not, in fact, in the dataset when we finally dumped everything. For instance, we were expecting there to be a field in the ballot measures section which told us whether or not a given ballot measure passed or not. Unfortunately, there was not. So, we had to do extra work to get the labels for whether ballot measures passed or not for supervised training. This took an exceptional amount of time due to the protections in place against scraping PowerBI tables.
With respect to model training, our dataset of labeled ballot measures was limited to ~15000 entries, which meant our models easily overfit on our data. We were thoughtful about tweaking parameters and balancing out training vs. validation loss to ensure our final model’s outputs were precise and diverse enough to scale into a production environment.
Finally, we also struggled to implement Modal in the earlier parts of the project. Understanding and unlocking this compute would have made our training much more efficient.
Accomplishments that I'm proud of
We built a pretty expansive, full stack project in a day, and we’re particularly proud of how well we distributed the work between the three of us in data scraping, machine learning, and front end, yielding certainly one of the most efficient projects we’ve worked on together. This culminated in some challenging reconciliation, but we pulled together a functioning version of our platform that incorporated everyone’s work in its entirety.
What I learned
For one of us, it was our first time properly using Docker to host local databases (via PHPMyAdmin MySQL), and as well as a first experience running jobs with Modal. Both were enjoyable and are tools that we’ll carry on using in the near future – it was fun to learn the functionality through real use cases rather than theory alone.
Further, we explored novel methods for forecasting and prediction that we haven’t gotten the chance to before such as using GNN’s to aggregate adversarial information and custom implementations of attention. It was rewarding to get the chance to implement the theory that we have learned in class in a practical manner in the real world.
What's next for BallotBox
We’ve built a modular framework that lets us plug in better models, which improve with more labeled data and better text encoding models. We’re interested in building in Polymarket and Kalshi integrability, both to serve as a strong benchmark against our predicted outcome probabilities, and opportunity for users to leverage our models for market alpha.
Since our focus was primarily on the CivicEngine API, we incorporated a relatively limited amount of supplementary metadata to further contextualize our model. We’d almost certainly improve our model performance with further effort here, as long as we scale our architecture appropriately (ie. being cautious of vanishing gradients, etc.)
We’re immensely grateful for Sway’s support this weekend – both through their granting of the CivicEngine data, as well as their in-person troubleshooting through the two days.
Built With
- modal
- phpmyadmin
- python
Log in or sign up for Devpost to join the conversation.