Compare Web Page Entities with Google NLP in Python

Estimated Read Time: 6 minute(s)

Common Topics: google, nlp, entities, data, flex

This is part 2 of a two-part series. Please see Getting Started with Google NLP API Using Python first.

For search engines and SEO, Natural Language Processing (NLP) has been a revolution. NLP is the methodology by which machines understand human language. This matters because machines perform the bulk of page evaluation. While some knowledge of the science behind NLP is useful, we now have tools that let us use NLP without a data science degree. By understanding how machines interpret our content, we can adjust for misalignment or ambiguity. Let’s go!

In this intermediate tutorial (part 2), using two web pages, I’ll show you how you can:

Compare entities and their salience between two web pages
Display missing entities between two pages

I recommend reading the full Google NLP documentation for instructions on setting up Google Cloud Platform, enabling the NLP API, and configuring authentication.

Table of Contents

Requirements and Assumptions

Python 3 is installed and you have a basic understanding of Python syntax
Access to a Linux installation (Ubuntu recommended) or Google Colab
Google Cloud Platform account
NLP API Enabled
Credentials created (service account) and JSON file downloaded

Import Modules and Set Authentication

Several modules must be installed and imported. If you use Google Colab, many are preinstalled; otherwise install the Google NLP client library.

os – setting the environment variable for credentials
google.cloud – Google’s NLP modules
pandas – for organizing data into dataframes
fake_useragent – for generating a user agent when making a request
matplotlib – for the scatter plots

Of those, two need installation: fake_useragent and a specific pandas version (Google Colab may include an older pandas). Install the packages shown below.

!pip3 install fake_useragent

!pip3 install pandas==1.1.2

import os
from google.cloud import language_v1
from google.cloud.language_v1 import enums

from google.cloud import language
from google.cloud.language import types

import matplotlib.pyplot as plt
from matplotlib.pyplot import figure

from fake_useragent import UserAgent
import requests
import pandas as pd
import numpy as np

Next, set the environment variable that points to the credentials JSON file for the Google API. Google requires the credentials be available via an environment variable. The example below assumes Google Colab (remember to upload the file). To set the variable on Linux (Ubuntu), add the line in ~/.profile or ~/.bashrc and replace the path as needed. Keep this JSON file secure.

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "path_to_json_credentials_file"

Build NLP Function

Because the same process evaluates both pages, create a function to reduce redundant code. The function processhtml() below will:

Create a new user agent for the request header
Make the request to the web page and store the HTML content
Initialize the Google NLP
Communicate to Google that you are sending them HTML, rather than plain text
Send the request to Google NLP
Store the JSON response
Convert the JSON into a python dictionary with the entities and salience scores (adjust rounding as needed)
Convert the keys to lower case (for comparing)
Return the new dictionary to the main script

def processhtml(url):

    ua = UserAgent() 
    headers = { 'User-Agent': ua.chrome } 
    res = requests.get(url,headers=headers) 
    html_page = res.text

    url_dict = {}

    client = language_v1.LanguageServiceClient()

    type_ = enums.Document.Type.HTML

    language = "en"
    document = {"content": html_page, "type": type_, "language": language}

    encoding_type = enums.EncodingType.UTF8

    response = client.analyze_entities(document, encoding_type=encoding_type)

    for entity in response.entities:
        url_dict[entity.name] = round(entity.salience,4)

    url_dict = {k.lower(): v for k, v in url_dict.items()}

    return url_dict

Process NLP Data and Calculate Salience Difference

Now that we have the function, set the variables containing the web page URLs to compare and send them to the function.

url1 = "https://www.rocketclicks.com/seo/" 
url2 = "http://www.jenkeller.com/websitesearchengineoptimization.html" 

url1_dict = processhtml(url1)
url2_dict = processhtml(url2)

We now have NLP data for each URL. Next, compare the two entity lists. When entities match, calculate the difference in salience if the competitor’s score is higher. This code snippet will:

Create an empty dataframe with four columns (Entity, URL1, URL2, Difference). URL1 and URL2 contain the salience scores for each entity on that URL.
Compare each entity in both lists; if they match, store each salience score in variables.
If the competitor’s salience score for a keyword is greater than yours, record the difference (adjust rounding as needed).
Add the comparison data for the entity to the dataframe.
Print the dataframe after processing all matched entities.

df = pd.DataFrame([], columns=['Entity','URL1','URL2','Difference'])

for key in set(url1_dict) & set(url2_dict):
    url1_keywordnum = str(url1_dict.get(key,"n/a"))
    url2_keywordnum = str(url2_dict.get(key,"n/a"))
    
    if url2_keywordnum > url1_keywordnum:
        diff = str(round(float(url2_keywordnum) - float(url1_keywordnum),3))
    else:
        diff = "0"

    new_row = {'Keyword':key,'URL1':url1_keywordnum,'URL2':url2_keywordnum,'Difference':diff}
    
    df = df.append(new_row, ignore_index=True)

print(df.sort_values(by='Difference', ascending=False))

Example Output

This output shows entities found on both pages where Google NLP assigns higher salience on the competitor page. These are keywords worth investigating to determine whether your page can communicate those concepts more clearly. Salience scores are rounded to three decimal places; adjust the rounding to reveal finer differences.

NLP Entities

Find Difference in Named Entities

Next, it’s useful—especially for a competitor page that is outranking yours—to find entities present on their page but missing from yours. The snippet below:

Uses set() to compare entities between the two dictionaries; entities present in the competitor list but not yours are stored in diff_lists.
Because set() operates on keys, it discards values (the salience scores), so we add them back in.
Create the final_diff dictionary and convert it to a dataframe.
Print the dataframe and sort by score in descending order.

diff_lists = set(url2_dict) - set(url1_dict)

final_diff = {}

for k in diff_lists:
  for key,value in url2_dict.items():
    if k == key:
      final_diff.update({key:value})

df = pd.DataFrame(final_diff.items(), columns=['Keyword','Score'])

print(df.head(25).sort_values(by='Score', ascending=False))

Example Output

This list shows the top 25 entities by salience that appear on the competitor page but not on your page. Adjust head() to view more or fewer entries. Use this to find entity opportunities that competing pages use but yours does not.

NLP Entity Comparison

Conclusion

I hope you found this two-part series useful for getting started with NLP and comparing entities between web pages. These scripts are foundations and can be extended as needed. Explore data blending with other sources to mine further insights. Enjoy, and as always, follow me on Twitter and let me know what you think and how you’re using Google NLP!

Google NLP and Entities FAQ

How can I utilize Google NLP with Python to compare entities between two web pages?

Use the Google NLP API with Python to extract entities from two web pages and compare the entity lists for overlaps and differences.

What Python libraries are commonly used for interacting with Google NLP to compare entities?

The primary Python library for interacting with the Google NLP API is google-cloud-language. Use this library to send requests, analyze web page entities, and compare the results.

What specific entities can be compared between two web pages using Google NLP and Python?

Google NLP can identify various entities, including people, organizations, locations, and more. Python scripts can extract and compare these entities between two web pages.

Are there any limitations or considerations when comparing entities using Google NLP and Python?

Consider the accuracy of entity extraction, potential false positives or negatives, and the need for pre-processing to handle variations in entity recognition when comparing entities between web pages.

Where can I find examples and documentation for using Google NLP to compare entities between web pages with Python?

Refer to the official Google Cloud documentation for the NLP API, which includes guides, reference documentation, and examples in Python. Additionally, explore online resources and tutorials for practical examples of comparing entities using Google NLP and Python.

Author
Recent Posts

Follow me

Greg Bernhardt

Sr. SEO Specialist for Shopify. Nearly 20 years of experience in web design, web development, and web marketing. Education in Information Sciences from UW-Milwaukee. Managing the largest online US physics community. Enjoy learning about search engines, SEO, chrome tricks, Python, knowledge graphs, data science, and more!

Follow me

Latest posts by Greg Bernhardt (see all)