Portfolio

Public opinion toward climate change initiatives, Women’s Summit Fellow

Analyzes Twitter data on Green New Deal and Paris Agreement to provide a different perspective on public opinion toward them.

Code

Understanding public opinion is essential for politicians and business people alike. In the context of climate change, a robust understanding of public opinion toward climate change initiatives is important to determining the most effective implementation approach. Using social media and natural language processing (NLP) techniques to evaluate this supplements traditional polling and survey methods, likely captures the opinion of a different portion of the population, and can help create a more complete understanding of public opinion on the subject.

As Correlation One Women’s Summit Fellows, my team and I formulated the problem of evaluating public opinion toward climate change policies, and sought out data to answer it. Using sentiment and topic analysis, we compared discussions about the Green New Deal and Paris Agreements. The similarity of discussions around both topics suggested that opinion toward them has more to do with opinion toward climate change more broadly, rather than toward the specific policy.

NLP, sentiment analysis, t-tests, latent dirichlet allocation (LDA), Azure Cloud, databricks

Mapping Historical New York, Research Assistant

Determines geographic location of 1850 NYC census records.

Code

In 1850 New York City, census enumerators didn’t record addresses. Recreating the geography of New York City at that time is part of the work of the Columbia University Spatial Research Center’s Mapping Historical New York Project. This process involves linking census records and 1850 city directory records, which did have addresses, and using the addresses found through linkage to extrapolate further.

As a research assistant, I fine-tuned the linkage process, and developed a machine learning modeling approach to assign geographic locations to unlinked records. This approach involved first creating classifiable geographical locations by clustering block numbers, and then using gradient boosting to predict those locations.

Feature engineering sequences, representing records near each other geographically, was essential to associate census records (primarily demographic data) with geography. The sequence with the strongest effect on the model involved calculating the distance between consecutive records with known locations through linkage, selecting a threshold based on the distribution of distance, and starting a new sequence anytime that threshold was exceeded.

The final model I worked on could predict block cluster with 0.77 accuracy.

Gradient boosting, unsupervised clustering, pandas, matplotlib, feature engineering

Capstone: GE Asset Tracking

Clusters trajectories, series of x,y points, by shape.

Code

Report

Trajectories are one way to think about how things move, from medical equipment within hospitals to taxis in cities to planes across the globe. Although each of these things move very differently in scale and pattern, their movements can all be represented as a series of x,y points.

As part of our Masters capstone, my team and I worked with a mentor from GE Research to cluster those series of points, those trajectories, by shape, because before deciding how things should move (to minimize costs of transport, cost of inventory, etc.) we needed to understand how they do move. We developed models on simulated datasets with varying characteristics and a ground truth to evaluate how well the models worked.

The most successful approach involved using an LSTM autoencoder to create representations of trajectories, and clustering those representations with the kmeans algorithm. The model clustered simulated trajectories with a fowlkes-mallows score of 0.87.

Neural networks, autoencoders, LSTM, unsupervised clustering, distance metrics, numpy

Visualizing Colonialism

Explores economic impact of colonization today through visualizations.

Code

Report

Colonialism was an economic enterprise that involved moving resources from colonized countries into colonizing countries for hundreds of years, along with building infrastructure and systems to support that process. In this project, we investigated what data could tell about colonization’s continued impact, especially in terms of economics.

While firm conclusions were impossible due to the complexity of the subject, we were curious about what the data would suggest, and how it would compare to qualitative analyses of colonialism. One of the most interesting results was a graph of the current per capita GDPs of all previously colonized countries. All countries with relatively high GDPs had a well known story behind their wealth, either they were among the Asian Tigers, or had vast oil resources.

Public opinion toward climate change initiatives, Women’s Summit Fellow

NLP, sentiment analysis, t-tests, latent dirichlet allocation (LDA), Azure Cloud, databricks

Mapping Historical New York, Research Assistant

Gradient boosting, unsupervised clustering, pandas, matplotlib, feature engineering

Capstone: GE Asset Tracking

Neural networks, autoencoders, LSTM, unsupervised clustering, distance metrics, numpy

Visualizing Colonialism

R, ggplot2, data wrangling and cleaning, exploratory data analysis, visualization, heat maps, missing data analysis

Share this: