By Rachel Walter, Fall 2019

Photo retrieved from Variety
I am a computer science major with a minor in women's studies. I thought it would be interesting to explore how these two fields can interact by learning how to analyze and visualize data with a critical focus on gender. This project will specifically investigate the gender gap in education globally, as measured by literacy rates of each country. We will also look into how location and gender interact to yield gender disparities.
By the end of this tutorial you will be able to:
This process will be able to be applied to a variety of datasets that distinguish location (i.e. country, city, continent) and and are sex disaggregated.
The following libraries/tools will be used throughout the project. They will allow us to do and visualize more! If you do not have any of these libraries installed, you can install them into your development environmemt with the help of $ pip3 install [package]. You can find more information on them through their documentation linked below:
pandas, a data analysis toolkit which helps us work with data frames - docsnumpy, a package for scientific computing with special n-d arrays - docsfolium for creating map visualizations - docssklearn for regressions and modelling - docsmatplotlib for graphing - docsrequests for HTTP requests - docsBeautifulSoup for web-scraping - docs# importing required libraries and tools
import pandas as pd
import numpy as np
import folium
from sklearn import linear_model
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, validation_curve, cross_val_score
import matplotlib.pyplot as plt
import requests
from bs4 import BeautifulSoup
Globally, access to education to dependent on a variety of factors including infrastructure development, prescence of conflict, socioeconomic status, and gender. In this case, I am going to look only at how gender (and location) affects education atainment. This, of course, may not paint a full picture since multiple forces can compound to privelege or oppress educational opportunity.
Readings below offer a perspective on known trends in the gender gap in education and why it exists:
When I was learning about classification and refining hyperparameters, I used the following resources
The first step of this project is to collect data. In my case, this meant choosing a data source that provided the information I needed for analyzing gender and education. I found that UNICEF keeps publicly available data on various global development measures surrounding children, such as childhood survival rates. I was able to find data on youth and adult literacy rates (sex disaggregated).
I updated the Excel spreadsheets to only contain the data tables, making it much easier to access the data. These updated .xlsx files are available to download from my GitHub for this project. We are going to start collection with one pandas dataframe for the adult dataset and the youth data set.
# Read the data from the Excel file -- YOU MIGHT HAVE TO CHANGE THE FILE PATH
youth = pd.read_excel('Table-Youth-and-Adult-Literacy-Rate-updated-Oct.-2015_78.xlsx', sheet_name=0)
adult = pd.read_excel('Table-Youth-and-Adult-Literacy-Rate-updated-Oct.-2015_78.xlsx', sheet_name=1)
print(youth.head())
print(adult.head())
Next we need to process the data little bit more to make it clean. This means everything will be well labelled and we will handle missing data. Other times you might want to reorganize your data table to be tidy, but since our tables are fairly simple and clean I am not going to reorganize.
We can see just from head() that there are extra columns that are all NaN and that the Source column all says "UNESCO Institute for Statistics," which is not useful for our analysis. We also have to work on re-labelling the Male and Female columns, which are currently "Unnamed : 7" and "Sex." Some countries are included in the table but have no actual data, such as Andorra.
# CLEAN YOUTH DATA TABLE
# Drop the columns are are all NaNs
youth = youth.drop(axis=1, labels=['Unnamed: 4','Unnamed: 6', 'Unnamed: 8'])
# Drop the Source column
youth = youth.drop(axis=1, labels=['Source'])
# Relabel the Male/Female columns and drop the first row
youth = youth.drop(axis=0, index=0)
youth = youth.rename(columns={"Sex": "Boy", "Unnamed: 7": "Girl"})
# Drop rows that do not have Total, Male, or Female data
youth = youth[youth.Total != '-']
youth = youth[youth.Boy != '-']
youth = youth[youth.Girl != '-']
youth = youth.dropna()
# Rename the Total to Youth_Total for when we create a combined table we can keep both Total columns
youth = youth.rename(columns={"Total": "Youth_Total"})
youth.head()
Let us repeat this process on the adult data set.
# CLEAN ADULT DATA TABLE
# Drop the columns are are all NaNs
adult = adult.drop(axis=1, labels=['Unnamed: 4','Unnamed: 6', 'Unnamed: 8'])
# Drop the Source column
adult = adult.drop(axis=1, labels=['Source'])
# Relabel the Male/Female columns and drop the first row
adult = adult.drop(axis=0, index=0)
adult = adult.rename(columns={"Sex": "Male", "Unnamed: 7": "Female"})
# Drop rows that do not have Total, Male, or Female data
adult = adult[adult.Total != '-']
adult = adult[adult.Male != '-']
adult = adult[adult.Female != '-']
adult = adult.dropna()
adult.head()
As stated in the introduction, I am interested in how location and gender may interact to yield differences in education and literacy. Since each country only has one data point, it would be useful to be able to group countries geographically to see if this relationship exists. The easiest way to do this is to match each country to its respective continent. Do not worry, this is much easier than you might expect thanks to tools like BeautfulSoup.
I found a Wikipedia page) with a table which stores the country name, 2-character code, 3-character code, and continent code. I used Beautiful Soup and Pandas to create a dataframe from that table. The hardest part was using print(soup.prettify()) to find my data in HTML, however most tables are of the class "wikitable sortable," making it much easier to find and seperate out that data!
# Request Wikipedia Page with Data Table
website_url = requests.get(
'https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_by_continent_(data_file)').text
# Extract HTML for parsing
soup = BeautifulSoup(website_url,'html')
# You can print out the HTML if you need to find your table
#print(soup.prettify())
# Find the data table we want to scrape
table = soup.find('table',{'class':'wikitable sortable'})
# Use the pandas library to convert that HTML table into a pandas dataframe
countrydf = pd.read_html(str(table))[0]
countrydf.head()
Now, as you can see above, we have a dataset that can match our ISO country codes with the continent codes (EU = Europe, SA = South America, AS = Asia, OC = Oceania, AF = Africa, and NA = North America). We will now make a few changes to the countrydf so that it can be combined with our main data sets.
# Drop name, number, and 2 character code from the table, we don't need these
countrydf = countrydf.drop(axis=1, labels=['Name', '#', 'a-2'])
# Rename the a-3 column to "ISO Code." This will help with the merge!
countrydf = countrydf.rename(columns={"a-3": "ISO Code"})
# Merge country df into the main data sets (youth, adult, and the combined set)
youth = pd.merge(youth, countrydf)
adult = pd.merge(adult, countrydf)
# See below in "Creating Combined Table" for more information on "merge"
countrydf.head()
Now that both tables are tidy, I want to create a third table that will help me explore if the youth and adult scores are related for each country. However, we are not guarenteed to have all that data for every country. I am going to do a merge of youth and adult dataframes to create a new table of only countries that are represented in both datasets.
The alternate option to merging the tables would be be making a data table schema like the following:
| ISO Code | Countries and areas | Reference year(s) | Adult/Youth | Total | Male | Female |
Where Youth/Adult would be some 0/1 or Y/A value indicating whether that data would be for the youth or adult data set. This would be more in-line with tidy data theory. However, for my sake the ease of coding up visualizations, I am keeping my data in its less tidy form.
df = pd.merge(youth,adult)
df.head()
We know have tables that, for each country represented, store the ISO code, name, reference years, and literacy rates for each population/sub-population.
Now that our data is all set, we can begin analyzing it! There are a few beginning exploratory visualizations and analyses I want to do with our data, all of which will be described and demonstrated below.
This is the basic statistics for any data set which can help us see the central tendency and get a better feel for our data. I am looking at the statistics for each population category (i.e. Youth_Total, Total, Boy, Girl, Male, and Female.
# Analyze Stats for YOUTH_TOTAL
mean = np.mean(youth['Youth_Total'])
median = np.median(youth['Youth_Total'])
mini = np.min(youth['Youth_Total'])
maxi = np.max(youth['Youth_Total'])
stddev = np.std(youth['Youth_Total'])
print('SUMMARY STATS FOR YOUTH_TOTAL')
print('Mean: ', mean, '\nStd Dev: ', stddev, '\nMedian: ', median, '\nMin: ', mini, '\nMax: ', maxi)
# Analyze Stats for GIRL
mean = np.mean(youth['Girl'])
median = np.median(youth['Girl'])
mini = np.min(youth['Girl'])
maxi = np.max(youth['Girl'])
stddev = np.std(youth['Girl'])
print('SUMMARY STATS FOR GIRL')
print('Mean: ', mean, '\nStd Dev: ', stddev, '\nMedian: ', median, '\nMin: ', mini, '\nMax: ', maxi)
# Analyze Stats for BOY
mean = np.mean(youth['Boy'])
median = np.median(youth['Boy'])
mini = np.min(youth['Boy'])
maxi = np.max(youth['Boy'])
stddev = np.std(youth['Boy'])
print('SUMMARY STATS FOR BOY')
print('Mean: ', mean, '\nStd Dev: ', stddev, '\nMedian: ', median, '\nMin: ', mini, '\nMax: ', maxi)
# Analyze Stats for TOTAL
mean = np.mean(adult['Total'])
median = np.median(adult['Total'])
mini = np.min(adult['Total'])
maxi = np.max(adult['Total'])
stddev = np.std(adult['Total'])
print('SUMMARY STATS FOR TOTAL')
print('Mean: ', mean, '\nStd Dev: ', stddev, '\nMedian: ', median, '\nMin: ', mini, '\nMax: ', maxi)
# Analyze Stats for FEMALE
mean = np.mean(adult['Female'])
median = np.median(adult['Female'])
mini = np.min(adult['Female'])
maxi = np.max(adult['Female'])
stddev = np.std(adult['Female'])
print('SUMMARY STATS FOR FEMALE')
print('Mean: ', mean, '\nStd Dev: ', stddev, '\nMedian: ', median, '\nMin: ', mini, '\nMax: ', maxi)
# Analyze Stats for MALE
mean = np.mean(adult['Male'])
median = np.median(adult['Male'])
mini = np.min(adult['Male'])
maxi = np.max(adult['Male'])
stddev = np.std(adult['Male'])
print('SUMMARY STATS FOR MALE')
print('Mean: ', mean, '\nStd Dev: ', stddev, '\nMedian: ', median, '\nMin: ', mini, '\nMax: ', maxi)
From these summary statistics, a few features appear. First and foremost, all of the statistics for female/girl were lower than the corresponding male populations. Another intesting thing to notice is that the mean is always less than the median, indicating there may be some lower end outliers bringing down the mean. This knowledge might come in handy later when seeing how region relates to gender and literacy.
If we want a visual of the distrbition within the literacy rate data setsn instead of the statistics number, we can also graph the box and whisker plot, as shown below on the total, male, and female literacy rate (seperated into the youth and adult categories):
# Boxplot showing the distribution of adult literacy rates
fig1, ax1 = plt.subplots()
ax1.set_title('Literacy Rate Among Adult Population')
ax1.boxplot([adult['Total'], adult['Male'], adult['Female']], labels=['Total', 'Male','Female'])
plt.show()
# Boxplot showing the distribution of youth literacy rates
fig1, ax1 = plt.subplots()
ax1.set_title('Literacy Rate Among Youth Population')
ax1.boxplot([youth['Youth_Total'], youth['Boy'], youth['Girl']], labels=['Total', 'Male','Female'])
plt.show()
This priliminary visualization will help me visualize how literacy varies across the globe. Some countries, regions, or continents might have higher literacy rates. Seeing this mapped out will help us know later on in the in-depth analysis if it seems like location and gender interact.
I am using the Folium library to map the percentage of literacy. Note that you have to retrieve country map data in GeoJSON format before being able to map it out. Then, we can match our data's ISO codes to the IDs of countries' geographical areas to map out literacy rates visually.
# Get the GeoJSON data you need for choropleth map
url = 'https://raw.githubusercontent.com/python-visualization/folium/master/examples/data'
country_shapes = f'{url}/world-countries.json'
# VISUALIZING GLOBAL LITERACY RATE (ADULT TOTAL)
m = folium.Map(location=[0, 0], zoom_start=2)
folium.Choropleth(
geo_data=country_shapes,
name='choropleth',
data=adult,
columns=['ISO Code', 'Total'],
key_on='feature.id',
fill_color='YlGn',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Literacy Rate (%)'
).add_to(m)
folium.LayerControl().add_to(m)
m
Glancing over the map, it looks like South Asia and the Saharan/Sub-Saharan region in Africa have the lowest literacy rates. Most of Europe, Asia, and South America appear to have higher literacy rates.
While on the topic of geographic location, let's use the continent categorization of data to see if there really are differences in literacy by continent. Once again, we will make use of the box and whisker plots to see the distribution of literacy rates for each continent.
# Boxplot showing the distribution of adult literacy rates by continent
fig1, ax1 = plt.subplots()
arr = []
ax1.set_title('Distribution of Adult Literacy Rates by Continent')
continents = adult['CC'].unique()
continents = [x for x in continents if str(x) != 'nan']
for continent in continents:
temp = adult[adult.CC == continent]
arr.append(temp['Total'])
ax1.boxplot(arr, labels=continents)
plt.show()
# Boxplot showing the distribution of youth literacy rates by continent
fig1, ax1 = plt.subplots()
arr = []
ax1.set_title('Distribution of Youth Literacy Rates by Continent')
continents = youth['CC'].unique()
continents = [x for x in continents if str(x) != 'nan']
for continent in continents:
temp = youth[youth.CC == continent]
arr.append(temp['Youth_Total'])
ax1.boxplot(arr, labels=continents)
plt.show()
We can see that many of the trends we noticed before on the map visualization are true. South America and Europe have the highest literacy rates. Asia and Oceania are generally high with some countries with lower literacy rates. However, Africa has by far the largest range in literacy rates and it's upper and lower quartile ranges are between 50-80%. To me, this indicates that location may have a relationship with literacy.
I want to quickly visualize if the same geographic divides for literacy apply to the genders. For example, will both men and women's literacy be equally low in Africa, or does it impact women more? Will other countries/regions have much worse Female vs Male literacy?
# VISUALIZING GLOBAL LITERACY RATE (ADULT FEMALE)
m = folium.Map(location=[0, 0], zoom_start=2)
folium.Choropleth(
geo_data=country_shapes,
name='choropleth',
data=adult,
columns=['ISO Code', 'Female'],
key_on='feature.id',
fill_color='RdPu',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Literacy Rate (%)'
).add_to(m)
folium.LayerControl().add_to(m)
m
# VISUALIZING GLOBAL LITERACY RATE (ADULT MALE)
m = folium.Map(location=[0, 0], zoom_start=2)
folium.Choropleth(
geo_data=country_shapes,
name='choropleth',
data=adult,
columns=['ISO Code', 'Male'],
key_on='feature.id',
fill_color='Blues',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Literacy Rate (%)'
).add_to(m)
folium.LayerControl().add_to(m)
m
Focusing on gender, it appears though both genders have lower literacy in Saharan and Sub-Saharan Africa and South Asia. However, on a whole women have lower literacy and have a new low-literacy region: the Arabian peninsula. These visualizations suggest to me that both gender and region affect literacy and they likely interact together.
To answer a question raised above about the gaps between men's and women's literacy, we can map the difference between the literacy rates of both populations. For example, it is not fair to fault a country that has 55% female literacy if the men's literacy rate is only 56%. Looking at the differences in literacy might be a more precise way to identify/visualize gender disparities.
# VISUALIZING GLOBAL LITERACY RATE GENDER GAP
# add gap column to the adult dataset for this visualization
adult['gap'] = adult['Male'] - adult['Female']
# map the gap between men's and women's literacy
m = folium.Map(location=[0, 0], zoom_start=2)
folium.Choropleth(
geo_data=country_shapes,
name='choropleth',
data=adult,
columns=['ISO Code', 'gap'],
key_on='feature.id',
fill_color='RdBu',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Literacy Rate (%)'
).add_to(m)
folium.LayerControl().add_to(m)
m
From this map, we can see that most places have around equal literacy rates for men and women. However, in areas in sub-Saharan Africa and South-West Asia/the Middle East have men with as much as a 34% more literate than women.
My next phase of visualizations/data analysis is understanding whether the literacies of different population groups are correlated. For example, in the visualization directly below we want to see if boy literacy is related to girl literacy and to what extent. We would expect as boy's literacy increases, so would girl's literacy at an equal rate (i.e. a slope of 1 and, in an ideal world, an intercept at the origin). If this is not true, it could reveal gender disparities and we could use the slope to find the ratio of boy's to girl's educational attainment.
# Create a linear regression for the relationship between boy's and girl's literacy rates
X = youth['Boy'].values.reshape(-1,1)
y = youth['Girl'].values.reshape(-1,1)
reg = linear_model.LinearRegression().fit(X,y)
zto100 = np.arange(0, 101, 1).reshape(-1, 1)
pred = reg.predict(zto100)
# Graph a scatter plot of boys vs girls literacy rates
plt.figure(figsize=(10,10))
plt.plot(youth['Boy'], youth['Girl'], 'o')
plt.plot(zto100, pred)
plt.ylim((0,100))
plt.xlim((0,100))
plt.xlabel('Literacy Rate (Male Youth)')
plt.ylabel('Literacy Rate (Female Youth)')
plt.title('Youth Female vs Youth Male Literacy Rate')
plt.show()
print('Slope: ', reg.coef_, 'Intercept: ', reg.intercept_)
We can see that although the relationship appears to be positively linear, there is an interesting relationship between mens and women's literacy. The slope is about 1.35. In countries with low boy literacy rates, girl literacy starts even lower. However, when boy literacy rates get higher, girls are able to catch up because of the rate of change.
In the visualization directly below we want to see if adult male literacy is related to adult female literacy and to what extent. We would expect as men's literacy increases, so would women's literacy at an equal rate (i.e. a slope of 1 and, in an ideal world, an intercept at the origin). If this is not true, it could reveal gender disparities and we could use the slope to find the ratio of men's to women's educational attainment.
# Create a linear regression for the relationship between adult male and female literacy rates
X = adult['Male'].values.reshape(-1,1)
y = adult['Female'].values.reshape(-1,1)
reg = linear_model.LinearRegression().fit(X,y)
zto100 = np.arange(0, 101, 1).reshape(-1, 1)
pred = reg.predict(zto100)
# Graph a scatter plot of men vs women literacy rates
plt.figure(figsize=(10,10))
plt.plot(adult['Male'], adult['Female'], 'o')
plt.plot(zto100, pred)
plt.ylim((0,100))
plt.xlim((0,100))
plt.xlabel('Literacy Rate (Adult Male)')
plt.ylabel('Literacy Rate (Adult Female)')
plt.title('Adult Female vs Adult Male Literacy Rate')
plt.show()
print('Slope: ', reg.coef_, 'Intercept: ', reg.intercept_)
Adult and Youth literacy rates have an almost identical relationship. This means gender disparities are worse in countries where the male literacy is lower. This may be indicitive of other factors which vary country-to-country described in background readings related to causes for the educational gap. For example, poverty can make the gender gap in education worse, and countries that are more impoverished are likely to have lower literacy in general.
The next population difference I wanted to look into was age as a factor of analysis. Were children more or less likely to be literate? If there are major differences between the two groups, it might imply changing access to education, educational quality, or the addition/removal of barriers to education.
# Create a linear regression for the relationship between adult and youth literacy rates
X = df['Total'].values.reshape(-1,1)
y = df['Youth_Total'].values.reshape(-1,1)
reg = linear_model.LinearRegression().fit(X,y)
zto100 = np.arange(0, 101, 1).reshape(-1, 1)
pred = reg.predict(zto100)
# Graph a scatter plot of men vs boys literacy rates
plt.figure(figsize=(10,10))
plt.plot(df['Total'], df['Youth_Total'], 'o')
plt.plot(zto100, pred)
plt.ylim((0,100))
plt.xlim((0,100))
plt.xlabel('Literacy Rate (Adult)')
plt.ylabel('Literacy Rate (Youth)')
plt.title('Youth vs Adult Literacy Rate')
plt.show()
print('Slope: ', reg.coef_, 'Intercept: ', reg.intercept_)
This once again has a similar trend as the women vs men and girl vs boy correlations. Youth are more likely to be literate than adults until the gap closes as adult literacy increases. Since in general youth literacy is higher, this means that children are recieving more access to education that past generations.
If men have high literacy, will boys do as well? I wanted to do a secondary analysis of boys vs men and girls vs women, just in case the literacy trends are different for the different genders.
# Create a linear regression for the relationship between adult male and boy literacy rates
X = df['Male'].values.reshape(-1,1)
y = df['Boy'].values.reshape(-1,1)
reg = linear_model.LinearRegression().fit(X,y)
zto100 = np.arange(0, 101, 1).reshape(-1, 1)
pred = reg.predict(zto100)
# Graph a scatter plot of men vs boys literacy rates
plt.figure(figsize=(10,10))
plt.plot(df['Male'], df['Boy'], 'co')
plt.plot(zto100, pred, 'b')
plt.ylim((0,100))
plt.xlim((0,100))
plt.xlabel('Literacy Rate (Adult Male)')
plt.ylabel('Literacy Rate (Youth Male)')
plt.title('Youth Male vs Adult Male Literacy Rate')
plt.show()
print('Slope: ', reg.coef_, 'Intercept: ', reg.intercept_)
This once again has a similar trend as the adult vs youth correlations. Boys are more likely to be literate than men until the gap closes as men's literacy increases.
# Create a linear regression for the relationship between adult female and girl literacy rates
X = df['Female'].values.reshape(-1,1)
y = df['Girl'].values.reshape(-1,1)
reg = linear_model.LinearRegression().fit(X,y)
zto100 = np.arange(0, 101, 1).reshape(-1, 1)
pred = reg.predict(zto100)
# Graph a scatter plot of women vs girls literacy rates
plt.figure(figsize=(10,10))
plt.plot(df['Female'], df['Girl'], 'mo')
plt.plot(zto100, pred, 'k')
plt.ylim((0,100))
plt.xlim((0,100))
plt.xlabel('Literacy Rate (Adult Female)')
plt.ylabel('Literacy Rate (Youth Female)')
plt.title('Youth Female vs Adult Female Literacy Rate')
plt.show()
print('Slope: ', reg.coef_, 'Intercept: ', reg.intercept_)
This once again has a similar trend as the women vs men and girl vs boy correlations. Girls are more likely to be literate than women until the gap closes as women's literacy increases.
Since both boys and girls have higher literacy rates than men and women, this may indicate new generations are having better access to education, etc. than the older generations.
We understand from preliminary analysis and visualization that it appears that age, location, and gender may be related and interact together to affect literacy. To see if we are able to model this, I thought it might be fun to see if we can create a classifer that when given literacy data can accurately predict the continent of origin. This might help us understand how closely literacy data and geographic location are connected.
In order to train our dataset with the randomforest classifier, we need to convert our continent code from strings to numbers. Panda's factorize does this very quickly for us!
df['CC'], uniques = pd.factorize(df['CC'])
df = df[df['CC'] != -1]
df.head()
Hyperparameters are basically variables that control how classification models are trained/created. You can optimize the hyperparameters to increase the accuracy of your classification model.
For random forest hyperparameters, I did research through this tutorial. The author said that fully optimizing the hyperparameters could take over 25 minutes to calculate. The author also showed that many of the default settings led to high level of accuracy. I decided to follow along with how the author used a validation curve to optimize one hyperparameter at a time. It trained with different parameters and plotted the accuracy of each input for the hyperparameters. I chose the optimal values for n_estimators, max_depth, min_samples_split, and min_samples_leaf. Find more information about what these hyperparameters do in the sklearn documentation. This process took a minute for each hyperparameter and was much faster than the full optimization.
Note: There is a variation in the optimization each time! I ended up choosing values that seemed to do best when run multiple times.
The gist of what you do is test how accurately a classification model performs when generated with different settings for an isolated hyperparameter. By always choosing the hyperparameter value which yields the most accurate classifications, you are optimizing your classification model so it can be its best! For ease of identifying the best hyperparameter settings, I graph the results of the training and testing accuracies.
# isolate the literacy data to act as the inputs for our classification model
rfinput = df[['Youth_Total', 'Girl', 'Boy', 'Total', 'Male', 'Female']]
# generate training/testing data for testing our hyperparameters
X_train, X_test, y_train, y_test = train_test_split(
rfinput, df['CC'], test_size = 0.3, random_state = 0)
# Optimize n_estimators
num_est = [100, 300, 500, 750, 800, 1200]
train_scoreNum, test_scoreNum = validation_curve(
RandomForestClassifier(),
X = X_train, y = y_train,
param_name = 'n_estimators',
param_range = num_est, cv = 3)
plt.figure(figsize=(10,10))
plt.plot(num_est, np.mean(test_scoreNum, axis=1), label='Cross-Validation Score')
plt.plot(num_est, np.mean(train_scoreNum, axis=1), label='Training Score')
plt.legend()
plt.show()
# Optimize max_depth
max_depth = [5, 8, 15, 25, 30]
train_scoreNum, test_scoreNum = validation_curve(
RandomForestClassifier(n_estimators=100),
X = X_train, y = y_train,
param_name = 'max_depth',
param_range = max_depth, cv = 3)
plt.figure(figsize=(10,10))
plt.plot(max_depth, np.mean(test_scoreNum, axis=1), label='Cross-Validation Score')
plt.plot(max_depth, np.mean(train_scoreNum, axis=1), label='Training Score')
plt.legend()
plt.show()
# Optimize min_samples_split
min_samples_split = [2, 5, 10, 15, 100]
train_scoreNum, test_scoreNum = validation_curve(
RandomForestClassifier(n_estimators=100),
X = X_train, y = y_train,
param_name = 'min_samples_split',
param_range = min_samples_split, cv = 3)
plt.figure(figsize=(10,10))
plt.plot(min_samples_split, np.mean(test_scoreNum, axis=1), label='Cross-Validation Score')
plt.plot(min_samples_split, np.mean(train_scoreNum, axis=1), label='Training Score')
plt.legend()
plt.show()
# Optimize min_samples_leaf
min_samples_leaf = [1, 2, 5, 10]
train_scoreNum, test_scoreNum = validation_curve(
RandomForestClassifier(n_estimators=100),
X = X_train, y = y_train,
param_name = 'min_samples_leaf',
param_range = min_samples_leaf, cv = 3)
plt.figure(figsize=(10,10))
plt.plot(min_samples_leaf, np.mean(test_scoreNum, axis=1), label='Cross-Validation Score')
plt.plot(min_samples_leaf, np.mean(train_scoreNum, axis=1), label='Training Score')
plt.legend()
plt.show()
We can see from our optimizations that the best values for hyperparameters appear to be the following:
n-estimators = 300max_depth = 8min_samples_split = 15min_samples_leaf = 10# Create our classifer with optimized hyperparameters
rfclassifier = RandomForestClassifier(n_estimators=300, max_depth=8, min_samples_split=15, min_samples_leaf=10)
In simple terms, a 5-fold cross-validation breaks up our dataset into 5 random groups. One group will be used to test the model, the nine others will be used to train the model. You will then get 5 scores of how well the classification performed for each data set. This can help us see how accurate our classification model is! For further reading on k-fold cross-validation, check out this resource.
We had to use 5 instead of any other k because "the least populated class in y (the continent data) has only 5 members." In order for Oceania (the continent with the fewest countries represented in the data) to be in each training/testing set, you can only split the data into 5 groups.
rfscores = cross_val_score(rfclassifier, rfinput, df['CC'], cv=5)
rfscores
The highest accuracy it achieves is around 70%, while guessing at random would yield roughly 20% accuracy. While our model is not the best at classifying the continent based on the given data, it does show that there is a at least some relationship between literacy rates within continent groupings.
This imperfect classification perfomance is likely because some continents have similar data sets. An interesting follow up would be if only including certain literacy data, i.e. only using the adult literacy rates, yielded more or less accurate result.
From my analysis, a few trends become apparent:
There is an obvious gap in literacy for women and girls. However, having a background of our dataset and a feminist analysis of the use of the girl in philanphropic work, I know that adult women are not always given the same aid as girls and may even be seen as a lost cause. There can be interventions that target women and girls, even strategic partnerships that build financial indpendence, job skills, educational attainment, and leadership experience to women and girls. A prime example of this is The Pad Project which helps provide menstrual products to women in rural areas in India, enabling girls to continue education and giving women and girls the important skills described previously for social capital development.
This data analysis reveals groups and countries/geographic areas where literacy is the lowest or there are large gaps between men and women. However, my analysis cannot begin to understand why the rate is low or why there is such a difference between men and women. It might be belief systems, lack of resources, or a state of crisis. These type of issues would have to be addressed before the countries could improve literacy rates.
There are a lot of ways privilege and oppression can manifest themselves. It can be hard to fully see the impact of systems of inequality. By knowing how to collect, visualize, and analyze data, you are now able to see and communicate more about inequality & injustice. Another great example of this data-for-good approach I have seen recently is a collaboration between NPR and the Howard Center for Investigative Journalism on Heat & Health in American Cities.
By completing this tutorial, you have seen and been able to implement multiple ways of exploring how gender, age, and location are linked to educational inequality. Now it is up to you to choose data sources and problems important to you!