Headline Readability Varies By News Outlet

United_states_supreme_court_buildingOf the following two sentences, which do you think is more readable; that is, which one is easier to read and understand?

  1. Supreme Court strikes down overall political donation cap
  2. Supreme Court allows more private money in election campaigns

I don’t know about you, but I had to look up the meaning of “overall political donation cap.” In contrast, I could easily infer what “private money in election campaigns” means.

These sentences are news headlines (from the New York Times and CNN, respectively) about the Supreme Court’s decision on  McCutcheon v. FEC. In reading the news coverage of this decision, I was struck by these two headlines: they communicate the same data, but they vary so starkly in readability!

These two headlines raise the question: do different news outlets write headlines that vary systematically in their readability? For example, are headlines from the New York Times less readable, on average, than CNN’s? Or is the readability difference we see between these two McCutcheon headlines a product of chance?

To try to answer this question, we need data in the form of news headlines.

COLLECTING HEADLINES FROM GOOGLE NEWS

logo11w

Thankfully, Google News offers plenty of such data. I wrote a Python script to  scrape headlines and the names of their outlets from this website about 20 times a day for 2 weeks (April 14 to May 2, 2014).

After cleaning the results (text data are messy!), data collection yielded 9,289 unique headlines from 928 different outlets. However, we want to focus on outlets that provide a decent number of headlines. We can limit our analysis to 4,476 unique headlines from the 20 different outlets that provided at least 100 headlines.

(The data as well as the code used to collect, analyze, and visualize them can be found in the GitHub repository google-news.)

MEASURING READABILITY

How do you measure the readability of a piece of text? One famous metric is called the Flesch-Kincaid Grade Level. It uses word and sentence length to estimate the years of schooling that a reader requires to read and understand a piece of text: the longer the word or the sentence, the more schooling that is needed to read it.

For example, the New York Times and CNN headlines above have grade levels of 11.1 and 8.9, respectively. These scores suggest that 11th graders should be able to read both headlines easily whereas 9th graders may struggle reading the New York Times headline.

DISTRIBUTION OF HEADLINE READABILITY

Analysis of our Google News data revealed that news headlines are generally easy to read; with a mean grade level of 7.7 (SD = 4.3), 8th graders can read most of them. The graph below (interactive version here) shows how headline readability is distributed around this average grade level.

google_news_2

Headlines binned by Flesch-Kincaid Grade Level.

Elementary school graduates (grade levels less than 6) can read about a third of headlines (35.9%). Middle school graduates (grade levels between 6 and 9) can handle almost two thirds of them (65.2%). Finally, high school graduates should read and understand 9 out of 10 headlines on Google News.

AVERAGE HEADLINE READABILITY BY NEWS OUTLET

To answer the question that motivated this project, we can compare headline readability across the 20 different outlets. The graph below (interactive version here) shows that different news outlets write headlines that vary quite systematically in their readability.

google_news

Mean Flesch-Kincaid Grade Level by news outlet. Error bars denote +/- 1 standard error.

Voice of America wrote the least readable headlines, requiring almost a 10th grade education to read them. It was followed somewhat distantly by Fox News and BBC News, with grade levels around 8.5. The Fox News result is surprising given that its audience tends to be less educated than those of other outlets (Pew Research, 2012).

The average grade levels of the next 14 outlets ranged from 8.2 (Los Angeles Times) to 7.2 (Businessweek). That is, a single year of education encompassed the differences in average headline readability between 70% of the outlets in the sample. But most of these differences are too small and variable to claim that they are statistically important.

On the lower end of the spectrum, ESPN had the most readable headlines, which required 5 and a half years of education. It was followed by two outlets with headlines of 6th grade readability: USA Today (6.3) and ABC News (6.8).

CONCLUSION

News headlines are relatively easy to read: a high school graduate can handle 9 in 10. However, headline readability varies strikingly between outlets. The least and most readable outlets differ in average grade level by almost 5 years. Fifth graders and above can read ESPN headlines, but it takes 10th graders to read Voice of America’s.

We began comparing two headlines: one from the New York Times (11.1 grade level) and another from CNN (8.9 grade level). If our data collection had ended there, then we would have incorrectly concluded that New York Times headlines are harder to read than CNN’s.

As our systematic data collection from Google News showed, these two headlines do not reflect the overall trend in readability between their outlets (on average, CNN headlines are harder to read than the NYT’s).

The discrepancy between these two conclusions highlights how anecdotal evidence, if untested with the systematic collection of data, can skew our understanding of how the world really works.