Using Data Analysis & Visualization to Understand the Gender Gap in Literacy

By Rachel Walter, Fall 2019

Gender Gap

Photo retrieved from Variety

Introduction

Motivation

I am a computer science major with a minor in women's studies. I thought it would be interesting to explore how these two fields can interact by learning how to analyze and visualize data with a critical focus on gender. This project will specifically investigate the gender gap in education globally, as measured by literacy rates of each country. We will also look into how location and gender interact to yield gender disparities.

By the end of this tutorial you will be able to:

  1. Collect data from an online source
  2. Clean & reorganize data
  3. Find summary statistics for a dataset
  4. Use map visualizations to see geographic trends in data
  5. Graph and model the relationships between two sets of data
  6. Train and Choose the Hyperparameters of a Classification Model
  7. So much more!

This process will be able to be applied to a variety of datasets that distinguish location (i.e. country, city, continent) and and are sex disaggregated.

Required Libraries/Tools

The following libraries/tools will be used throughout the project. They will allow us to do and visualize more! If you do not have any of these libraries installed, you can install them into your development environmemt with the help of $ pip3 install [package]. You can find more information on them through their documentation linked below:

  • pandas, a data analysis toolkit which helps us work with data frames - docs
  • numpy, a package for scientific computing with special n-d arrays - docs
  • folium for creating map visualizations - docs
  • sklearn for regressions and modelling - docs
  • matplotlib for graphing - docs
  • requests for HTTP requests - docs
  • BeautifulSoup for web-scraping - docs
In [1]:
# importing required libraries and tools
import pandas as pd
import numpy as np
import folium
from sklearn import linear_model
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, validation_curve, cross_val_score
import matplotlib.pyplot as plt
import requests
from bs4 import BeautifulSoup 

Background Readings: Gender Gap in Education

Globally, access to education to dependent on a variety of factors including infrastructure development, prescence of conflict, socioeconomic status, and gender. In this case, I am going to look only at how gender (and location) affects education atainment. This, of course, may not paint a full picture since multiple forces can compound to privelege or oppress educational opportunity.

Readings below offer a perspective on known trends in the gender gap in education and why it exists:

Background Readings: Classification

When I was learning about classification and refining hyperparameters, I used the following resources

Part 1: Data Collection

The first step of this project is to collect data. In my case, this meant choosing a data source that provided the information I needed for analyzing gender and education. I found that UNICEF keeps publicly available data on various global development measures surrounding children, such as childhood survival rates. I was able to find data on youth and adult literacy rates (sex disaggregated).

I updated the Excel spreadsheets to only contain the data tables, making it much easier to access the data. These updated .xlsx files are available to download from my GitHub for this project. We are going to start collection with one pandas dataframe for the adult dataset and the youth data set.

In [2]:
# Read the data from the Excel file -- YOU MIGHT HAVE TO CHANGE THE FILE PATH
youth = pd.read_excel('Table-Youth-and-Adult-Literacy-Rate-updated-Oct.-2015_78.xlsx', sheet_name=0)
adult = pd.read_excel('Table-Youth-and-Adult-Literacy-Rate-updated-Oct.-2015_78.xlsx', sheet_name=1)
print(youth.head())
print(adult.head())
  ISO Code Countries and areas  Reference year(s)     Total  Unnamed: 4  \
0      NaN                 NaN                 NaN      NaN         NaN   
1      AFG         Afghanistan              2011.0  46.9901         NaN   
2      ALB             Albania              2011.0  98.7912         NaN   
3      DZA             Algeria              2006.0  91.7796         NaN   
4      AND             Andorra                 NaN        -         NaN   

       Sex  Unnamed: 6 Unnamed: 7  Unnamed: 8                           Source  
0     Male         NaN     Female         NaN                              NaN  
1  61.8791         NaN    32.1132         NaN  UNESCO Institute for Statistics  
2  98.7314         NaN    98.8562         NaN  UNESCO Institute for Statistics  
3  94.3815         NaN    89.1382         NaN  UNESCO Institute for Statistics  
4        -         NaN          -         NaN                              NaN  
  ISO Code Countries and areas  Reference year(s)     Total  Unnamed: 4  \
0      NaN                 NaN                 NaN      NaN         NaN   
1      AFG         Afghanistan              2011.0  31.7411         NaN   
2      ALB             Albania              2011.0  96.8453         NaN   
3      DZA             Algeria              2006.0  72.6487         NaN   
4      AND             Andorra                 NaN        -         NaN   

       Sex  Unnamed: 6 Unnamed: 7  Unnamed: 8                           Source  
0     Male         NaN     Female         NaN                              NaN  
1  45.4171         NaN    17.6121         NaN  UNESCO Institute for Statistics  
2  98.0082         NaN    95.6915         NaN  UNESCO Institute for Statistics  
3   81.284         NaN    63.9188         NaN  UNESCO Institute for Statistics  
4        -         NaN          -         NaN                              NaN  

Explanation of Data

  • This data set is collected across multiple years, ranging from 2005-2013
  • Data was collected by UNESCO Institute for Statistics and Prepared by the Data and Analytics Section; Division of Data, Research and Policy, UNICEF
  • Total, Male, and Female all represent the percentage of literacy in that population

Part 2: Data Processing

Initial Data Cleaning

Next we need to process the data little bit more to make it clean. This means everything will be well labelled and we will handle missing data. Other times you might want to reorganize your data table to be tidy, but since our tables are fairly simple and clean I am not going to reorganize.

We can see just from head() that there are extra columns that are all NaN and that the Source column all says "UNESCO Institute for Statistics," which is not useful for our analysis. We also have to work on re-labelling the Male and Female columns, which are currently "Unnamed : 7" and "Sex." Some countries are included in the table but have no actual data, such as Andorra.

In [3]:
# CLEAN YOUTH DATA TABLE
# Drop the columns are are all NaNs
youth = youth.drop(axis=1, labels=['Unnamed: 4','Unnamed: 6', 'Unnamed: 8'])

# Drop the Source column
youth = youth.drop(axis=1, labels=['Source'])

# Relabel the Male/Female columns and drop the first row
youth = youth.drop(axis=0, index=0)
youth = youth.rename(columns={"Sex": "Boy", "Unnamed: 7": "Girl"})

# Drop rows that do not have Total, Male, or Female data
youth = youth[youth.Total != '-']
youth = youth[youth.Boy != '-']
youth = youth[youth.Girl != '-']
youth = youth.dropna()

# Rename the Total to Youth_Total for when we create a combined table we can keep both Total columns 
youth = youth.rename(columns={"Total": "Youth_Total"})

youth.head()
Out[3]:
ISO Code Countries and areas Reference year(s) Youth_Total Boy Girl
1 AFG Afghanistan 2011.0 46.9901 61.8791 32.1132
2 ALB Albania 2011.0 98.7912 98.7314 98.8562
3 DZA Algeria 2006.0 91.7796 94.3815 89.1382
5 AGO Angola 2013.0 72.9818 79.3764 66.6683
7 ARG Argentina 2013.0 99.261 99.083 99.4445

Let us repeat this process on the adult data set.

In [4]:
# CLEAN ADULT DATA TABLE
# Drop the columns are are all NaNs
adult = adult.drop(axis=1, labels=['Unnamed: 4','Unnamed: 6', 'Unnamed: 8'])

# Drop the Source column
adult = adult.drop(axis=1, labels=['Source'])

# Relabel the Male/Female columns and drop the first row
adult = adult.drop(axis=0, index=0)
adult = adult.rename(columns={"Sex": "Male", "Unnamed: 7": "Female"})

# Drop rows that do not have Total, Male, or Female data
adult = adult[adult.Total != '-']
adult = adult[adult.Male != '-']
adult = adult[adult.Female != '-']
adult = adult.dropna()

adult.head()
Out[4]:
ISO Code Countries and areas Reference year(s) Total Male Female
1 AFG Afghanistan 2011.0 31.7411 45.4171 17.6121
2 ALB Albania 2011.0 96.8453 98.0082 95.6915
3 DZA Algeria 2006.0 72.6487 81.284 63.9188
5 AGO Angola 2013.0 70.7784 82.3233 59.6714
6 ATG Antigua and Barbuda 2013.0 98.95 98.4 99.42

Adding Country-Continent Mapping

As stated in the introduction, I am interested in how location and gender may interact to yield differences in education and literacy. Since each country only has one data point, it would be useful to be able to group countries geographically to see if this relationship exists. The easiest way to do this is to match each country to its respective continent. Do not worry, this is much easier than you might expect thanks to tools like BeautfulSoup.

I found a Wikipedia page) with a table which stores the country name, 2-character code, 3-character code, and continent code. I used Beautiful Soup and Pandas to create a dataframe from that table. The hardest part was using print(soup.prettify()) to find my data in HTML, however most tables are of the class "wikitable sortable," making it much easier to find and seperate out that data!

In [5]:
# Request Wikipedia Page with Data Table
website_url = requests.get(
    'https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_by_continent_(data_file)').text

# Extract HTML for parsing
soup = BeautifulSoup(website_url,'html')
# You can print out the HTML if you need to find your table
#print(soup.prettify())

# Find the data table we want to scrape
table = soup.find('table',{'class':'wikitable sortable'})

# Use the pandas library to convert that HTML table into a pandas dataframe
countrydf = pd.read_html(str(table))[0]
countrydf.head()
Out[5]:
CC a-2 a-3 # Name
0 AS AF AFG 4.0 Afghanistan, Islamic Republic of
1 EU AL ALB 8.0 Albania, Republic of
2 AN AQ ATA 10.0 Antarctica (the territory South of 60 deg S)
3 AF DZ DZA 12.0 Algeria, People's Democratic Republic of
4 OC AS ASM 16.0 American Samoa

Now, as you can see above, we have a dataset that can match our ISO country codes with the continent codes (EU = Europe, SA = South America, AS = Asia, OC = Oceania, AF = Africa, and NA = North America). We will now make a few changes to the countrydf so that it can be combined with our main data sets.

In [6]:
# Drop name, number, and 2 character code from the table, we don't need these
countrydf = countrydf.drop(axis=1, labels=['Name', '#', 'a-2'])

# Rename the a-3 column to "ISO Code." This will help with the merge!
countrydf = countrydf.rename(columns={"a-3": "ISO Code"})

# Merge country df into the main data sets (youth, adult, and the combined set)
youth = pd.merge(youth, countrydf)
adult = pd.merge(adult, countrydf)

# See below in "Creating Combined Table" for more information on "merge"
countrydf.head()
Out[6]:
CC ISO Code
0 AS AFG
1 EU ALB
2 AN ATA
3 AF DZA
4 OC ASM

Creating Combined Table

Now that both tables are tidy, I want to create a third table that will help me explore if the youth and adult scores are related for each country. However, we are not guarenteed to have all that data for every country. I am going to do a merge of youth and adult dataframes to create a new table of only countries that are represented in both datasets.

The alternate option to merging the tables would be be making a data table schema like the following:

| ISO Code | Countries and areas | Reference year(s) | Adult/Youth | Total | Male | Female |

Where Youth/Adult would be some 0/1 or Y/A value indicating whether that data would be for the youth or adult data set. This would be more in-line with tidy data theory. However, for my sake the ease of coding up visualizations, I am keeping my data in its less tidy form.

In [7]:
df = pd.merge(youth,adult)
df.head()
Out[7]:
ISO Code Countries and areas Reference year(s) Youth_Total Boy Girl CC Total Male Female
0 AFG Afghanistan 2011.0 46.9901 61.8791 32.1132 AS 31.7411 45.4171 17.6121
1 ALB Albania 2011.0 98.7912 98.7314 98.8562 EU 96.8453 98.0082 95.6915
2 DZA Algeria 2006.0 91.7796 94.3815 89.1382 AF 72.6487 81.284 63.9188
3 AGO Angola 2013.0 72.9818 79.3764 66.6683 AF 70.7784 82.3233 59.6714
4 ARG Argentina 2013.0 99.261 99.083 99.4445 SA 97.9738 97.933 98.0119

We know have tables that, for each country represented, store the ISO code, name, reference years, and literacy rates for each population/sub-population.

Part 3: Exploratory Analysis & Data Visualization

Now that our data is all set, we can begin analyzing it! There are a few beginning exploratory visualizations and analyses I want to do with our data, all of which will be described and demonstrated below.

General Statistical Analysis

Find the Min, Max, Mean, Median, and Standard Deviation for Literacy of Each Population

This is the basic statistics for any data set which can help us see the central tendency and get a better feel for our data. I am looking at the statistics for each population category (i.e. Youth_Total, Total, Boy, Girl, Male, and Female.

In [8]:
# Analyze Stats for YOUTH_TOTAL
mean = np.mean(youth['Youth_Total'])
median = np.median(youth['Youth_Total'])
mini = np.min(youth['Youth_Total'])
maxi = np.max(youth['Youth_Total'])
stddev = np.std(youth['Youth_Total'])

print('SUMMARY STATS FOR YOUTH_TOTAL')
print('Mean: ', mean, '\nStd Dev: ', stddev, '\nMedian: ', median, '\nMin: ', mini, '\nMax: ', maxi)
SUMMARY STATS FOR YOUTH_TOTAL
Mean:  88.66993818147021 
Std Dev:  16.87815559166715 
Median:  97.80792625902775 
Min:  23.5237779546856 
Max:  100
In [9]:
# Analyze Stats for GIRL
mean = np.mean(youth['Girl'])
median = np.median(youth['Girl'])
mini = np.min(youth['Girl'])
maxi = np.max(youth['Girl'])
stddev = np.std(youth['Girl'])

print('SUMMARY STATS FOR GIRL')
print('Mean: ', mean, '\nStd Dev: ', stddev, '\nMedian: ', median, '\nMin: ', mini, '\nMax: ', maxi)
SUMMARY STATS FOR GIRL
Mean:  87.17530295221185 
Std Dev:  19.74295951751332 
Median:  98.24875928925945 
Min:  15.0577712412646 
Max:  100
In [10]:
# Analyze Stats for BOY
mean = np.mean(youth['Boy'])
median = np.median(youth['Boy'])
mini = np.min(youth['Boy'])
maxi = np.max(youth['Boy'])
stddev = np.std(youth['Boy'])

print('SUMMARY STATS FOR BOY')
print('Mean: ', mean, '\nStd Dev: ', stddev, '\nMedian: ', median, '\nMin: ', mini, '\nMax: ', maxi)
SUMMARY STATS FOR BOY
Mean:  90.2758547066034 
Std Dev:  14.102681234776536 
Median:  97.6878457490552 
Min:  34.5336241866628 
Max:  100
In [11]:
# Analyze Stats for TOTAL
mean = np.mean(adult['Total'])
median = np.median(adult['Total'])
mini = np.min(adult['Total'])
maxi = np.max(adult['Total'])
stddev = np.std(adult['Total'])

print('SUMMARY STATS FOR TOTAL')
print('Mean: ', mean, '\nStd Dev: ', stddev, '\nMedian: ', median, '\nMin: ', mini, '\nMax: ', maxi)
SUMMARY STATS FOR TOTAL
Mean:  82.01606722387197 
Std Dev:  20.415916141369255 
Median:  92.2261457776693 
Min:  15.4566976828889 
Max:  99.9982624282257
In [12]:
# Analyze Stats for FEMALE
mean = np.mean(adult['Female'])
median = np.median(adult['Female'])
mini = np.min(adult['Female'])
maxi = np.max(adult['Female'])
stddev = np.std(adult['Female'])

print('SUMMARY STATS FOR FEMALE')
print('Mean: ', mean, '\nStd Dev: ', stddev, '\nMedian: ', median, '\nMin: ', mini, '\nMax: ', maxi)
SUMMARY STATS FOR FEMALE
Mean:  78.36248070375366 
Std Dev:  24.206984176149945 
Median:  90.0686971973496 
Min:  8.93973571419654 
Max:  99.9976152595066
In [13]:
# Analyze Stats for MALE
mean = np.mean(adult['Male'])
median = np.median(adult['Male'])
mini = np.min(adult['Male'])
maxi = np.max(adult['Male'])
stddev = np.std(adult['Male'])

print('SUMMARY STATS FOR MALE')
print('Mean: ', mean, '\nStd Dev: ', stddev, '\nMedian: ', median, '\nMin: ', mini, '\nMax: ', maxi)
SUMMARY STATS FOR MALE
Mean:  85.81623887131262 
Std Dev:  16.829881046242363 
Median:  93.3789429416669 
Min:  23.2474730495273 
Max:  99.9989629520247

From these summary statistics, a few features appear. First and foremost, all of the statistics for female/girl were lower than the corresponding male populations. Another intesting thing to notice is that the mean is always less than the median, indicating there may be some lower end outliers bringing down the mean. This knowledge might come in handy later when seeing how region relates to gender and literacy.

Visualize Distributions Among Populations with Boxplots

If we want a visual of the distrbition within the literacy rate data setsn instead of the statistics number, we can also graph the box and whisker plot, as shown below on the total, male, and female literacy rate (seperated into the youth and adult categories):

In [14]:
# Boxplot showing the distribution of adult literacy rates
fig1, ax1 = plt.subplots()
ax1.set_title('Literacy Rate Among Adult Population')
ax1.boxplot([adult['Total'], adult['Male'], adult['Female']], labels=['Total', 'Male','Female'])
plt.show()
In [15]:
# Boxplot showing the distribution of youth literacy rates
fig1, ax1 = plt.subplots()
ax1.set_title('Literacy Rate Among Youth Population')
ax1.boxplot([youth['Youth_Total'], youth['Boy'], youth['Girl']],  labels=['Total', 'Male','Female'])
plt.show()

Investigate Geographic Region Independently

Graph Total Literacy Percent on a Map

This priliminary visualization will help me visualize how literacy varies across the globe. Some countries, regions, or continents might have higher literacy rates. Seeing this mapped out will help us know later on in the in-depth analysis if it seems like location and gender interact.

I am using the Folium library to map the percentage of literacy. Note that you have to retrieve country map data in GeoJSON format before being able to map it out. Then, we can match our data's ISO codes to the IDs of countries' geographical areas to map out literacy rates visually.

In [16]:
# Get the GeoJSON data you need for choropleth map
url = 'https://raw.githubusercontent.com/python-visualization/folium/master/examples/data'
country_shapes = f'{url}/world-countries.json'
In [17]:
# VISUALIZING GLOBAL LITERACY RATE (ADULT TOTAL)
m = folium.Map(location=[0, 0], zoom_start=2)
folium.Choropleth(
    geo_data=country_shapes,
    name='choropleth',
    data=adult,
    columns=['ISO Code', 'Total'],
    key_on='feature.id',
    fill_color='YlGn',
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name='Literacy Rate (%)'
).add_to(m)

folium.LayerControl().add_to(m)

m
Out[17]:

Glancing over the map, it looks like South Asia and the Saharan/Sub-Saharan region in Africa have the lowest literacy rates. Most of Europe, Asia, and South America appear to have higher literacy rates.

Relationship Between Continent and Literacy

While on the topic of geographic location, let's use the continent categorization of data to see if there really are differences in literacy by continent. Once again, we will make use of the box and whisker plots to see the distribution of literacy rates for each continent.

In [18]:
# Boxplot showing the distribution of adult literacy rates by continent
fig1, ax1 = plt.subplots()
arr = []
ax1.set_title('Distribution of Adult Literacy Rates by Continent')
continents = adult['CC'].unique()
continents = [x for x in continents if str(x) != 'nan']
for continent in continents:
    temp = adult[adult.CC == continent]
    arr.append(temp['Total'])
ax1.boxplot(arr, labels=continents)
plt.show()
In [19]:
# Boxplot showing the distribution of youth literacy rates by continent
fig1, ax1 = plt.subplots()
arr = []
ax1.set_title('Distribution of Youth Literacy Rates by Continent')
continents = youth['CC'].unique()
continents = [x for x in continents if str(x) != 'nan']
for continent in continents:
    temp = youth[youth.CC == continent]
    arr.append(temp['Youth_Total'])
ax1.boxplot(arr, labels=continents)
plt.show()

We can see that many of the trends we noticed before on the map visualization are true. South America and Europe have the highest literacy rates. Asia and Oceania are generally high with some countries with lower literacy rates. However, Africa has by far the largest range in literacy rates and it's upper and lower quartile ranges are between 50-80%. To me, this indicates that location may have a relationship with literacy.

How might geography and gender interact?

Graph Female & Male Literacy Percent on a Map

I want to quickly visualize if the same geographic divides for literacy apply to the genders. For example, will both men and women's literacy be equally low in Africa, or does it impact women more? Will other countries/regions have much worse Female vs Male literacy?

In [20]:
# VISUALIZING GLOBAL LITERACY RATE (ADULT FEMALE)
m = folium.Map(location=[0, 0], zoom_start=2)
folium.Choropleth(
    geo_data=country_shapes,
    name='choropleth',
    data=adult,
    columns=['ISO Code', 'Female'],
    key_on='feature.id',
    fill_color='RdPu',
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name='Literacy Rate (%)'
).add_to(m)

folium.LayerControl().add_to(m)

m
Out[20]:
In [21]:
# VISUALIZING GLOBAL LITERACY RATE (ADULT MALE)
m = folium.Map(location=[0, 0], zoom_start=2)
folium.Choropleth(
    geo_data=country_shapes,
    name='choropleth',
    data=adult,
    columns=['ISO Code', 'Male'],
    key_on='feature.id',
    fill_color='Blues',
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name='Literacy Rate (%)'
).add_to(m)

folium.LayerControl().add_to(m)

m
Out[21]:

Focusing on gender, it appears though both genders have lower literacy in Saharan and Sub-Saharan Africa and South Asia. However, on a whole women have lower literacy and have a new low-literacy region: the Arabian peninsula. These visualizations suggest to me that both gender and region affect literacy and they likely interact together.

Map the Gap in Men's and Women's Literacy

To answer a question raised above about the gaps between men's and women's literacy, we can map the difference between the literacy rates of both populations. For example, it is not fair to fault a country that has 55% female literacy if the men's literacy rate is only 56%. Looking at the differences in literacy might be a more precise way to identify/visualize gender disparities.

In [22]:
# VISUALIZING GLOBAL LITERACY RATE GENDER GAP
# add gap column to the adult dataset for this visualization
adult['gap'] = adult['Male'] - adult['Female']

# map the gap between men's and women's literacy
m = folium.Map(location=[0, 0], zoom_start=2)
folium.Choropleth(
    geo_data=country_shapes,
    name='choropleth',
    data=adult,
    columns=['ISO Code', 'gap'],
    key_on='feature.id',
    fill_color='RdBu',
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name='Literacy Rate (%)'
).add_to(m)

folium.LayerControl().add_to(m)

m
Out[22]:

From this map, we can see that most places have around equal literacy rates for men and women. However, in areas in sub-Saharan Africa and South-West Asia/the Middle East have men with as much as a 34% more literate than women.

Investigate Gender Independently

Relationship Between Boy & Girl Literacy

My next phase of visualizations/data analysis is understanding whether the literacies of different population groups are correlated. For example, in the visualization directly below we want to see if boy literacy is related to girl literacy and to what extent. We would expect as boy's literacy increases, so would girl's literacy at an equal rate (i.e. a slope of 1 and, in an ideal world, an intercept at the origin). If this is not true, it could reveal gender disparities and we could use the slope to find the ratio of boy's to girl's educational attainment.

In [23]:
# Create a linear regression for the relationship between boy's and girl's literacy rates
X = youth['Boy'].values.reshape(-1,1)
y = youth['Girl'].values.reshape(-1,1)
reg = linear_model.LinearRegression().fit(X,y)
zto100 = np.arange(0, 101, 1).reshape(-1, 1)
pred = reg.predict(zto100)

# Graph a scatter plot of boys vs girls literacy rates
plt.figure(figsize=(10,10))
plt.plot(youth['Boy'], youth['Girl'], 'o')
plt.plot(zto100, pred)
plt.ylim((0,100))
plt.xlim((0,100))
plt.xlabel('Literacy Rate (Male Youth)')
plt.ylabel('Literacy Rate (Female Youth)')
plt.title('Youth Female vs Youth Male Literacy Rate')
plt.show()

print('Slope: ', reg.coef_, 'Intercept: ', reg.intercept_)
Slope:  [[1.34727296]] Intercept:  [-34.45091462]

We can see that although the relationship appears to be positively linear, there is an interesting relationship between mens and women's literacy. The slope is about 1.35. In countries with low boy literacy rates, girl literacy starts even lower. However, when boy literacy rates get higher, girls are able to catch up because of the rate of change.

Relationship Between Adult Male & Adult Female Literacy

In the visualization directly below we want to see if adult male literacy is related to adult female literacy and to what extent. We would expect as men's literacy increases, so would women's literacy at an equal rate (i.e. a slope of 1 and, in an ideal world, an intercept at the origin). If this is not true, it could reveal gender disparities and we could use the slope to find the ratio of men's to women's educational attainment.

In [24]:
# Create a linear regression for the relationship between adult male and female literacy rates
X = adult['Male'].values.reshape(-1,1)
y = adult['Female'].values.reshape(-1,1)
reg = linear_model.LinearRegression().fit(X,y)
zto100 = np.arange(0, 101, 1).reshape(-1, 1)
pred = reg.predict(zto100)

# Graph a scatter plot of men vs women literacy rates
plt.figure(figsize=(10,10))
plt.plot(adult['Male'], adult['Female'], 'o')
plt.plot(zto100, pred)
plt.ylim((0,100))
plt.xlim((0,100))
plt.xlabel('Literacy Rate (Adult Male)')
plt.ylabel('Literacy Rate (Adult Female)')
plt.title('Adult Female vs Adult Male Literacy Rate')
plt.show()

print('Slope: ', reg.coef_, 'Intercept: ', reg.intercept_)
Slope:  [[1.37570915]] Intercept:  [-39.69570417]

Adult and Youth literacy rates have an almost identical relationship. This means gender disparities are worse in countries where the male literacy is lower. This may be indicitive of other factors which vary country-to-country described in background readings related to causes for the educational gap. For example, poverty can make the gender gap in education worse, and countries that are more impoverished are likely to have lower literacy in general.

Investigate Age Independently

Relationship Between Child & Adult Literacy

The next population difference I wanted to look into was age as a factor of analysis. Were children more or less likely to be literate? If there are major differences between the two groups, it might imply changing access to education, educational quality, or the addition/removal of barriers to education.

In [25]:
# Create a linear regression for the relationship between adult and youth literacy rates
X = df['Total'].values.reshape(-1,1)
y = df['Youth_Total'].values.reshape(-1,1)
reg = linear_model.LinearRegression().fit(X,y)
zto100 = np.arange(0, 101, 1).reshape(-1, 1)
pred = reg.predict(zto100)

# Graph a scatter plot of men vs boys literacy rates
plt.figure(figsize=(10,10))
plt.plot(df['Total'], df['Youth_Total'], 'o')
plt.plot(zto100, pred)
plt.ylim((0,100))
plt.xlim((0,100))
plt.xlabel('Literacy Rate (Adult)')
plt.ylabel('Literacy Rate (Youth)')
plt.title('Youth vs Adult Literacy Rate')
plt.show()

print('Slope: ', reg.coef_, 'Intercept: ', reg.intercept_)
Slope:  [[0.79285457]] Intercept:  [23.72918931]

This once again has a similar trend as the women vs men and girl vs boy correlations. Youth are more likely to be literate than adults until the gap closes as adult literacy increases. Since in general youth literacy is higher, this means that children are recieving more access to education that past generations.

How might age and gender interact?

Relationship Between Child & Adult Literacy (Sex Disaggregated)

If men have high literacy, will boys do as well? I wanted to do a secondary analysis of boys vs men and girls vs women, just in case the literacy trends are different for the different genders.

In [26]:
# Create a linear regression for the relationship between adult male and boy literacy rates
X = df['Male'].values.reshape(-1,1)
y = df['Boy'].values.reshape(-1,1)
reg = linear_model.LinearRegression().fit(X,y)
zto100 = np.arange(0, 101, 1).reshape(-1, 1)
pred = reg.predict(zto100)

# Graph a scatter plot of men vs boys literacy rates
plt.figure(figsize=(10,10))
plt.plot(df['Male'], df['Boy'], 'co')
plt.plot(zto100, pred, 'b')
plt.ylim((0,100))
plt.xlim((0,100))
plt.xlabel('Literacy Rate (Adult Male)')
plt.ylabel('Literacy Rate (Youth Male)')
plt.title('Youth Male vs Adult Male Literacy Rate')
plt.show()

print('Slope: ', reg.coef_, 'Intercept: ', reg.intercept_)
Slope:  [[0.80470037]] Intercept:  [21.28440709]

This once again has a similar trend as the adult vs youth correlations. Boys are more likely to be literate than men until the gap closes as men's literacy increases.

In [27]:
# Create a linear regression for the relationship between adult female and girl literacy rates
X = df['Female'].values.reshape(-1,1)
y = df['Girl'].values.reshape(-1,1)
reg = linear_model.LinearRegression().fit(X,y)
zto100 = np.arange(0, 101, 1).reshape(-1, 1)
pred = reg.predict(zto100)

# Graph a scatter plot of women vs girls literacy rates
plt.figure(figsize=(10,10))
plt.plot(df['Female'], df['Girl'], 'mo')
plt.plot(zto100, pred, 'k')
plt.ylim((0,100))
plt.xlim((0,100))
plt.xlabel('Literacy Rate (Adult Female)')
plt.ylabel('Literacy Rate (Youth Female)')
plt.title('Youth Female vs Adult Female Literacy Rate')
plt.show()

print('Slope: ', reg.coef_, 'Intercept: ', reg.intercept_)
Slope:  [[0.77672985]] Intercept:  [26.4136716]

This once again has a similar trend as the women vs men and girl vs boy correlations. Girls are more likely to be literate than women until the gap closes as women's literacy increases.

Since both boys and girls have higher literacy rates than men and women, this may indicate new generations are having better access to education, etc. than the older generations.

Part 4: Deeper Analysis: Classification

We understand from preliminary analysis and visualization that it appears that age, location, and gender may be related and interact together to affect literacy. To see if we are able to model this, I thought it might be fun to see if we can create a classifer that when given literacy data can accurately predict the continent of origin. This might help us understand how closely literacy data and geographic location are connected.

Modify the Continent Encoding

In order to train our dataset with the randomforest classifier, we need to convert our continent code from strings to numbers. Panda's factorize does this very quickly for us!

In [28]:
df['CC'], uniques = pd.factorize(df['CC'])
df = df[df['CC'] != -1]
df.head()
Out[28]:
ISO Code Countries and areas Reference year(s) Youth_Total Boy Girl CC Total Male Female
0 AFG Afghanistan 2011.0 46.9901 61.8791 32.1132 0 31.7411 45.4171 17.6121
1 ALB Albania 2011.0 98.7912 98.7314 98.8562 1 96.8453 98.0082 95.6915
2 DZA Algeria 2006.0 91.7796 94.3815 89.1382 2 72.6487 81.284 63.9188
3 AGO Angola 2013.0 72.9818 79.3764 66.6683 2 70.7784 82.3233 59.6714
4 ARG Argentina 2013.0 99.261 99.083 99.4445 3 97.9738 97.933 98.0119

Choosing Hyperparameters

Hyperparameters are basically variables that control how classification models are trained/created. You can optimize the hyperparameters to increase the accuracy of your classification model.

For random forest hyperparameters, I did research through this tutorial. The author said that fully optimizing the hyperparameters could take over 25 minutes to calculate. The author also showed that many of the default settings led to high level of accuracy. I decided to follow along with how the author used a validation curve to optimize one hyperparameter at a time. It trained with different parameters and plotted the accuracy of each input for the hyperparameters. I chose the optimal values for n_estimators, max_depth, min_samples_split, and min_samples_leaf. Find more information about what these hyperparameters do in the sklearn documentation. This process took a minute for each hyperparameter and was much faster than the full optimization.

Note: There is a variation in the optimization each time! I ended up choosing values that seemed to do best when run multiple times.

The gist of what you do is test how accurately a classification model performs when generated with different settings for an isolated hyperparameter. By always choosing the hyperparameter value which yields the most accurate classifications, you are optimizing your classification model so it can be its best! For ease of identifying the best hyperparameter settings, I graph the results of the training and testing accuracies.

In [29]:
# isolate the literacy data to act as the inputs for our classification model
rfinput = df[['Youth_Total', 'Girl', 'Boy', 'Total', 'Male', 'Female']]

# generate training/testing data for testing our hyperparameters
X_train, X_test, y_train, y_test = train_test_split( 
              rfinput, df['CC'], test_size = 0.3, random_state = 0) 
In [30]:
# Optimize n_estimators
num_est = [100, 300, 500, 750, 800, 1200]
train_scoreNum, test_scoreNum = validation_curve(
                                RandomForestClassifier(),
                                X = X_train, y = y_train, 
                                param_name = 'n_estimators', 
                                param_range = num_est, cv = 3)

plt.figure(figsize=(10,10))
plt.plot(num_est, np.mean(test_scoreNum, axis=1), label='Cross-Validation Score')
plt.plot(num_est, np.mean(train_scoreNum, axis=1), label='Training Score')
plt.legend()
plt.show()
In [31]:
# Optimize max_depth
max_depth = [5, 8, 15, 25, 30]
train_scoreNum, test_scoreNum = validation_curve(
                                RandomForestClassifier(n_estimators=100),
                                X = X_train, y = y_train, 
                                param_name = 'max_depth', 
                                param_range = max_depth, cv = 3)

plt.figure(figsize=(10,10))
plt.plot(max_depth, np.mean(test_scoreNum, axis=1), label='Cross-Validation Score')
plt.plot(max_depth, np.mean(train_scoreNum, axis=1), label='Training Score')
plt.legend()
plt.show()
In [32]:
# Optimize min_samples_split
min_samples_split = [2, 5, 10, 15, 100]
train_scoreNum, test_scoreNum = validation_curve(
                                RandomForestClassifier(n_estimators=100),
                                X = X_train, y = y_train, 
                                param_name = 'min_samples_split', 
                                param_range = min_samples_split, cv = 3)

plt.figure(figsize=(10,10))
plt.plot(min_samples_split, np.mean(test_scoreNum, axis=1), label='Cross-Validation Score')
plt.plot(min_samples_split, np.mean(train_scoreNum, axis=1), label='Training Score')
plt.legend()
plt.show()
In [33]:
# Optimize min_samples_leaf
min_samples_leaf = [1, 2, 5, 10] 
train_scoreNum, test_scoreNum = validation_curve(
                                RandomForestClassifier(n_estimators=100),
                                X = X_train, y = y_train, 
                                param_name = 'min_samples_leaf', 
                                param_range = min_samples_leaf, cv = 3)

plt.figure(figsize=(10,10))
plt.plot(min_samples_leaf, np.mean(test_scoreNum, axis=1), label='Cross-Validation Score')
plt.plot(min_samples_leaf, np.mean(train_scoreNum, axis=1), label='Training Score')
plt.legend()
plt.show()

Train Classification Model

We can see from our optimizations that the best values for hyperparameters appear to be the following:

  • n-estimators = 300
  • max_depth = 8
  • min_samples_split = 15
  • min_samples_leaf = 10
In [34]:
# Create our classifer with optimized hyperparameters
rfclassifier = RandomForestClassifier(n_estimators=300, max_depth=8, min_samples_split=15, min_samples_leaf=10) 

See How Well Our Model Performs: 5-fold cross-validation

In simple terms, a 5-fold cross-validation breaks up our dataset into 5 random groups. One group will be used to test the model, the nine others will be used to train the model. You will then get 5 scores of how well the classification performed for each data set. This can help us see how accurate our classification model is! For further reading on k-fold cross-validation, check out this resource.

We had to use 5 instead of any other k because "the least populated class in y (the continent data) has only 5 members." In order for Oceania (the continent with the fewest countries represented in the data) to be in each training/testing set, you can only split the data into 5 groups.

In [35]:
rfscores = cross_val_score(rfclassifier, rfinput, df['CC'], cv=5)
rfscores
Out[35]:
array([0.5483871 , 0.63333333, 0.57142857, 0.71428571, 0.62962963])

The highest accuracy it achieves is around 70%, while guessing at random would yield roughly 20% accuracy. While our model is not the best at classifying the continent based on the given data, it does show that there is a at least some relationship between literacy rates within continent groupings.

This imperfect classification perfomance is likely because some continents have similar data sets. An interesting follow up would be if only including certain literacy data, i.e. only using the adult literacy rates, yielded more or less accurate result.

Part 5: Insight & Policy Decision

Insight

From my analysis, a few trends become apparent:

  • women in general have lower literacy than men
  • adults may end up having lower literacy than children
  • certain countries and regions have worse literacy rates and gaps in literacy than others
  • in countries where literacy rates are already lower have greater gaps between youth and adult and between genders

Policy Recommendations

There is an obvious gap in literacy for women and girls. However, having a background of our dataset and a feminist analysis of the use of the girl in philanphropic work, I know that adult women are not always given the same aid as girls and may even be seen as a lost cause. There can be interventions that target women and girls, even strategic partnerships that build financial indpendence, job skills, educational attainment, and leadership experience to women and girls. A prime example of this is The Pad Project which helps provide menstrual products to women in rural areas in India, enabling girls to continue education and giving women and girls the important skills described previously for social capital development.

This data analysis reveals groups and countries/geographic areas where literacy is the lowest or there are large gaps between men and women. However, my analysis cannot begin to understand why the rate is low or why there is such a difference between men and women. It might be belief systems, lack of resources, or a state of crisis. These type of issues would have to be addressed before the countries could improve literacy rates.

How Can You Use What You Learned?

There are a lot of ways privilege and oppression can manifest themselves. It can be hard to fully see the impact of systems of inequality. By knowing how to collect, visualize, and analyze data, you are now able to see and communicate more about inequality & injustice. Another great example of this data-for-good approach I have seen recently is a collaboration between NPR and the Howard Center for Investigative Journalism on Heat & Health in American Cities.

By completing this tutorial, you have seen and been able to implement multiple ways of exploring how gender, age, and location are linked to educational inequality. Now it is up to you to choose data sources and problems important to you!