CREATIVE --- Ad-level Party Classifier

Welcome! This repository contains scripts that train and apply a machine learning model to classify political advertisements based on their content and determine which political party (Democratic, Republican, or Other) the ads belong to.

This repo is part of the Cross-platform Election Advertising Transparency Initiative (CREATIVE). CREATIVE is an academic research project that has the goal of providing the public with analysis tools for more transparency of political ads across online platforms. In particular, CREATIVE provides cross-platform integration and standardization of political ads collected from Google and Facebook. CREATIVE is a joint project of the Wesleyan Media Project (WMP) and the privacy-tech-lab at Wesleyan University.

To analyze the different dimensions of political ad transparency we have developed an analysis pipeline. The scripts in this repo are part of the Data Classification step in our pipeline.

1. Overview

This repo contains scripts for a multinomial ad-level party classifier that classifies ads into DEM/REP/OTHER. The difference to the other party classifier is that for this classifier the training data consists of individual ads whose pd_id has party_all coded in the WMP entity file which is a list of all the unique sponsors of ads on Google and Facebook. By contrast, the other party classifier concatenates all ads of a pd_id into one. In situations where you need clear and specific predictions about political party affiliations for ads, it is better to use the other party classifier. This is because the other party classifier operates under the assumption that all ads associated with a single pd_id will belong to the same party, leading to more consistent and potentially more accurate predictions about party affiliation when viewing the ads collectively rather than individually. The main purpose of this ad-level classifier is to get predictions for individual ads, which can then be used to express the degree to which an ad belongs to either party.

2. Data

Data processed and generated by the scripts in this repository are stored as compressed CSV files (csv.gz) in the /data folder. The outputs include class labels (DEM/REP/OTHER) and aggregated labels at the pd_id level (advertiser_id for Google) determined by a majority vote. In case of a tie in which the classifier can't decide the party, the label defaults to OTHER. In addition to the class labels, the classifier computes probabilities that indicate the likelihood of each ad belonging to the DEM, REP, or OTHER categories. However, to obtain more accurate class probabilities, we recommend you use the other party classifier.

3. Setup

To start setting up the repo and run the scripts, first clone this repo to your local directory:

git clone https://github.com/Wesleyan-Media-Project/party_classifier.git

Then, ensure you have the required dependencies installed. The scripts are tested on Python 3.9, 3.10 and 3.11. The packages we used are described in requirements_py.txt. You can install the required Python packages by running:

pip install -r requirements_py.txt

The scripts in the code directory are numbered in the order in which they should be run. For example, you should follow the order 01, 02, 03, etc according to the file names. Scripts that directly depend on one another are ordered sequentially. Scripts with the same number are alternatives, usually they are the same scripts on different data, or with minor variations. For example, 03_inference_google_2022.py and 03_inference_fb_2022.py are applying the party classifiers trained on different datasets. Inference scripts on 2022 political advertising datasets contain "_2022" in the filenames.

If you want to use the trained model we provide, you can also only run the inference script since the model files are already present in the /models folder.

3.1 Training

Note: If you do not want to train models from scratch, you can use the trained model we provide here, and skip to 3.4.

To run this repo, you first need to train a classification model. We have two training scripts that use two different training data:

Training that is done using the portions of Meta and Google 2022 datasets for which the sponsors' party leanings or affiliations are known, based on merging with the most recent WMP entities file wmpentity_2022_012125_mergedFECids.dta (for Meta, version 2025-01-21) and 2022_google_entities_20240303_woldidstomerge.csv(for Google, version 2024-03-03). You need the following files:
- Meta 2022 entity file: wmpentity_2022_012125_mergedFECids.dta
- Google 2022 entity file: 2022_google_entities_20240303_woldidstomerge.csv
- Google 2022 general election ad datasets: g2022_adid_text.csv.gz and g2022_adid_var1.csv.gz
- Facebook 2022 general election ad datasets: fb_2022_adid_text.csv.gz and fb_2022_adid_var1.csv.gz

For our training data, all ads associated with a specific sponsor can only be in either training or test set. Prior to the train/test split, the concatenated ads are de-duplicated, so that only one version of every concatenated ad content can go into either train/test (we could potentially only de-duplicate within page_ids, but currently don't).

The following fields are used in the classifier by concatenating them in the following order, separated by a single space:

disclaimer	page_name	ad_creative_body	ad_creative_link_caption	ad_creative_link_description	ad_creative_link_title	ocr	asr

3.2 Model

We use two versions of logistic regression classifier with varying levels of regularization strength. We found that stronger regularization provides more accurate results.

You can find the trained models we provide here.

3.3 Performance

Here is the model performance on the held-out test set:


               precision    recall  f1-score   support

         DEM       0.89      0.94      0.92      3953
       OTHER       0.66      0.13      0.22       142
         REP       0.86      0.82      0.84      2066

    accuracy                           0.88      6161
   macro avg       0.80      0.63      0.66      6161
weighted avg       0.88      0.88      0.88      6161

3.4 Inference

Please note that to access the files stored on Figshare, you will need to fill out a brief form and then will immediately get data access.

Once you have your model ready, you can run the inference scripts. All the inference scripts are named starting with 03_. For Facebook 2022 inference, you will need fb_2022_adid_text.csv.gz and fb_2022_adid_var1.csv.gz. For Google 2022 inference, you will need g2022_adid_text.csv.gz and g2022_adid_var1.csv.gz.

In this repository, the 2020 train and inference scripts are written in Python with 2020 in file names in the 2020 directory.

4. Thank You

We would like to thank our supporters!

This material is based upon work supported by the National Science Foundation under Grant Numbers 2235006, 2235007, and 2235008. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

The Cross-Platform Election Advertising Transparency Initiative (CREATIVE) is a joint infrastructure project of the Wesleyan Media Project and privacy-tech-lab at Wesleyan University in Connecticut.

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
analysis		analysis
code		code
data		data
models		models
.gitignore		.gitignore
CREATIVE_logo.png		CREATIVE_logo.png
Creative_Pipelines.png		Creative_Pipelines.png
LICENSE		LICENSE
README.md		README.md
nsf.png		nsf.png
pipeline.sh		pipeline.sh
plt_logo.png		plt_logo.png
requirements_py.txt		requirements_py.txt
wmp-logo.png		wmp-logo.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CREATIVE --- Ad-level Party Classifier

Table of Contents

1. Overview

2. Data

3. Setup

3.1 Training

3.2 Model

3.3 Performance

3.4 Inference

4. Thank You

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CREATIVE --- Ad-level Party Classifier

Table of Contents

1. Overview

2. Data

3. Setup

3.1 Training

3.2 Model

3.3 Performance

3.4 Inference

4. Thank You

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages