Welcome! This repository contains scripts that train and apply a machine learning model to classify political advertisements based on their content and determine which political party (Democratic, Republican, or Other) the ads belong to.
This repo is part of the Cross-platform Election Advertising Transparency Initiative (CREATIVE). CREATIVE is an academic research project that has the goal of providing the public with analysis tools for more transparency of political ads across online platforms. In particular, CREATIVE provides cross-platform integration and standardization of political ads collected from Google and Facebook. CREATIVE is a joint project of the Wesleyan Media Project (WMP) and the privacy-tech-lab at Wesleyan University.
To analyze the different dimensions of political ad transparency we have developed an analysis pipeline. The scripts in this repo are part of the Data Classification step in our pipeline.
This repo contains scripts for a multinomial ad-level party classifier that classifies ads into DEM/REP/OTHER. The difference to the other party classifier is that for this classifier the training data consists of individual ads whose pd_id has party_all coded in the WMP entity file which is a list of all the unique sponsors of ads on Google and Facebook. By contrast, the other party classifier concatenates all ads of a pd_id into one. In situations where you need clear and specific predictions about political party affiliations for ads, it is better to use the other party classifier. This is because the other party classifier operates under the assumption that all ads associated with a single pd_id will belong to the same party, leading to more consistent and potentially more accurate predictions about party affiliation when viewing the ads collectively rather than individually. The main purpose of this ad-level classifier is to get predictions for individual ads, which can then be used to express the degree to which an ad belongs to either party.
Data processed and generated by the scripts in this repository are stored as compressed CSV files (csv.gz) in the /data folder. The outputs include class labels (DEM/REP/OTHER) and aggregated labels at the pd_id level (advertiser_id for Google) determined by a majority vote. In case of a tie in which the classifier can't decide the party, the label defaults to OTHER. In addition to the class labels, the classifier computes probabilities that indicate the likelihood of each ad belonging to the DEM, REP, or OTHER categories. However, to obtain more accurate class probabilities, we recommend you use the other party classifier.
To start setting up the repo and run the scripts, first clone this repo to your local directory:
git clone https://github.com/Wesleyan-Media-Project/party_classifier.gitThen, ensure you have the required dependencies installed. The scripts are tested on Python 3.9, 3.10 and 3.11. The packages we used are described in requirements_py.txt. You can install the required Python packages by running:
pip install -r requirements_py.txtThe scripts in the code directory are numbered in the order in which they should be run. For example, you should follow the order 01, 02, 03, etc according to the file names. Scripts that directly depend on one another are ordered sequentially. Scripts with the same number are alternatives, usually they are the same scripts on different data, or with minor variations. For example, 03_inference_google_2022.py and 03_inference_fb_2022.py are applying the party classifiers trained on different datasets. Inference scripts on 2022 political advertising datasets contain "_2022" in the filenames.
If you want to use the trained model we provide, you can also only run the inference script since the model files are already present in the /models folder.
Note: If you do not want to train models from scratch, you can use the trained model we provide here, and skip to 3.4.
To run this repo, you first need to train a classification model. We have two training scripts that use two different training data:
-
Training that is done using the portions of Meta and Google 2022 datasets for which the sponsors' party leanings or affiliations are known, based on merging with the most recent WMP entities file
wmpentity_2022_012125_mergedFECids.dta(for Meta, version 2025-01-21) and2022_google_entities_20240303_woldidstomerge.csv(for Google, version 2024-03-03). You need the following files:- Meta 2022 entity file:
wmpentity_2022_012125_mergedFECids.dta - Google 2022 entity file:
2022_google_entities_20240303_woldidstomerge.csv
- Google 2022 general election ad datasets:
g2022_adid_text.csv.gzandg2022_adid_var1.csv.gz
- Facebook 2022 general election ad datasets:
fb_2022_adid_text.csv.gzandfb_2022_adid_var1.csv.gz
- Meta 2022 entity file:
For our training data, all ads associated with a specific sponsor can only be in either training or test set. Prior to the train/test split, the concatenated ads are de-duplicated, so that only one version of every concatenated ad content can go into either train/test (we could potentially only de-duplicate within page_ids, but currently don't).
The following fields are used in the classifier by concatenating them in the following order, separated by a single space:
| disclaimer | page_name | ad_creative_body | ad_creative_link_caption | ad_creative_link_description | ad_creative_link_title | ocr | asr |
|---|
We use two versions of logistic regression classifier with varying levels of regularization strength. We found that stronger regularization provides more accurate results.
You can find the trained models we provide here.
Here is the model performance on the held-out test set:
precision recall f1-score support
DEM 0.89 0.94 0.92 3953
OTHER 0.66 0.13 0.22 142
REP 0.86 0.82 0.84 2066
accuracy 0.88 6161
macro avg 0.80 0.63 0.66 6161
weighted avg 0.88 0.88 0.88 6161
Please note that to access the files stored on Figshare, you will need to fill out a brief form and then will immediately get data access.
Once you have your model ready, you can run the inference scripts. All the inference scripts are named starting with 03_. For Facebook 2022 inference, you will need fb_2022_adid_text.csv.gz and fb_2022_adid_var1.csv.gz. For Google 2022 inference, you will need g2022_adid_text.csv.gz and g2022_adid_var1.csv.gz.
In this repository, the 2020 train and inference scripts are written in Python with 2020 in file names in the 2020 directory.
We would like to thank our supporters!
This material is based upon work supported by the National Science Foundation under Grant Numbers 2235006, 2235007, and 2235008. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
The Cross-Platform Election Advertising Transparency Initiative (CREATIVE) is a joint infrastructure project of the Wesleyan Media Project and privacy-tech-lab at Wesleyan University in Connecticut.
