Inspiration

As students who go to the University of Cincinnati and have lived in Ohio for our whole lives, we are very tied to Ohio and its politics. Something that has been in the news recently has been Ohio's redistricting efforts(along with other efforts across the country). Seeing these political redistricting efforts is what inspired us to create this project to remove the political effort and instead create districts by grouping similar demographics to create a more representative map.

What it does

This project creates new congressional district maps for most states in the United States. These maps are designed to cluster similar counties together. The idea being that similar counties will have similar interests in their representatives. Ultimately, more people will have a representative who matches their wishes.

How we built it

There are three main sections of how we built this project: Data Preparation, District Construction Algorithm, and District Visualization.

Data Preparation:

The first step of this project was to acquire and prepare the data. We utilized US census data, containing demographic information by county from 2020-2024. After loading this data into a Pandas DataFrame, we made a copy of the total population information to be used later, then normalized all the data on a per-state basis. After normalizing the feature space, we have a 74-dimensional embedding space representing each of the counties, which will be used in the next section to calculate the similarity between counties.

Although we have an embedding space of our counties, we must use this dataset to identify which counties are adjacent to each other, since the embedding space does not force counties close to each other to be physically adjacent. To prep this data, we first split the columns to separate the county name from its state, then we removed any adjacency connections between counties in different states. This is because congressional districts cannot cross state lines. After this, we created an adjacency list dictionary with entries for each county, and then saved that dictionary as a JSON file to be used later.

District Construction Algorithm

The next step of the project was the actual algorithm to construct the districts. There are two main phases to the algorithm: creating the embedding space and using the cosine distance of the vectors in the embedding space as a heuristic to select new counties.

Building the Embeddings

Embeddings are simply one-dimensional vectors that represent data. For counties, the embedding space can include any empirical demographic that is shared amongst the other counties. For simplicity and time constraints, we create a 74-dimensional embedding containing information on population size, ethnicity, and gender data. The embedding is then normalized to all other state counties. This allows all features to be relevant in the embedding space. After the embedding space is created, we can use the distances between each county to evaluate the similarity between the counties. To measure similarity, we used the cosine similarity between the embeddings. We chose to use cosine similarity instead of Euclidean Distance because Euclidean metrics lose their effectiveness as dimensionality increases, so they are not appropriate for a 74-dimensional embedding space.

Building the Congressional Districts

After creating the embedding space, we can then begin building the congressional districts. For the districts, they must follow the following rules:

  • Each district must have similar population sizes
  • A district cannot cross state lines
  • Each state is allocated a specific number of Congressional Districts
  • Each district must be contiguous

Each rule affects our algorithm and requires us to have more information than the embedding space. For a contiguous district that does not cross state lines, we must know a state's counties and which counties are adjacent to each other. Then, as we build our districts, we must know the population size of each county and ensure we do not construct districts of disproportionate populations.

To build congressional districts following the specified rules, we use a list of adjacent counties per county, the embedding of each county, and a running list of available counties to be added to congressional districts.

First, we randomly select a county from the available counties and add it to a congressional district while recording the population of the county. Then, take the average embedding of each county in a congressional district and find the farthest county from the average district embedding, and add it to an empty congressional district. Repeat this process until each district has one county.

Once we have initialized a congressional district to have dissimilar counties, we pick the district with the smallest population size. Then, find the county that is most similar to the district and is adjacent to any counties to the district, and add the counties' embedding and population size to the district. Repeat this process until all counties have been added to a district.

District Visualization

The final part of the project was to visualize our new maps. We achieved this by utilizing Tableau's mapping functionality. We started with a shapefile, which can be found here, containing the geospatial data for every US county. Once we loaded this data into Tableau, we joined the county data with our new congressional districts, which allowed us to easily visualize the districts using Tableau's built-in mapping support.

Challenges we ran into

Throughout this project, the number one issue that we ran into was simply finding good data that was easy to work with. We had originally intended to use a wider range of attributes in our embedding space, including economic and education data. We had found datasets for these metrics at a county level provided by the USDA here. However, there were a number of inconsistencies between these datasets themselves and with the population metric dataset from the Census Bureau. We had spent multiple hours attempting to sanitize this data so that it could all be joined together, but ultimately, we decided that it was a better use of our time to focus on the actual algorithm and visualization over spending more time cleaning data.

This issue of finding good, clean data is also why we chose not to visualize districts for Connecticut, Hawaii, and parts of Alaska. The census data that we used to create the districts did not match the shapefile data used to create the mapping visualization in these states. The inconsistency makes it very difficult to properly show the correct calculated districts in the mapping visualization. This is why the visualization has several chunks missing in these states. This is something that could be addressed in the future; however, we chose to prioritize other aspects of the project given the time restrictions of a hackathon.

We also ran into the issue of not being able to split counties into two districts in our visualization. Because we use a predefined shapefile for the counties, visualizing split counties would require editing the shapefile every time we generate a set of districts, which is not feasible in a hackathon time frame, but could be a good extension of the project for the future.

Accomplishments that we're proud of

We are both very proud that our algorithm worked and that it worked well. It was something that we had devised ourselves, no pre-planning, no use of AI, so seeing it work to generate districts was very rewarding. It especially felt good to see the visualization and look at our generated district map for Ohio, then compare it against the actual congressional map and see that our map was relatively similar to what is actually in use. It felt like we had generated a completely feasible map.

You can also view our congressional district map here.

What we learned

Throughout this project, we gained a wide variety of experience and knowledge. We got lots of first-hand experience in searching through and then cleaning messy data, along with a refresher on working with data using Pandas. We also furthered our knowledge on embedding spaces, which is something that had been covered in classes, but we had never had hands-on experience with. Another important item that we learned was the importance of planning your program ahead and creating pseudocode before implementation. This was something that we implemented in our project timeline that was EXTREMELY beneficial to us. Finally, we got more experience in data visualizations and Tableau.

What's next for Representative Redistricting

There are several clear areas of improvement for this project. The first one is expanding the dataset and embedding space. Adding more forms of data, such as economic, family, or education data, could lead to a more accurate similarity metric and better districts. We also see future use for this project as a way to evaluate current district maps against our impartial maps, as a way to evaluate the bias of a given congressional district map.

Share this project:

Updates