Python Scripts for SEOs - Daniel Heredia

Find your main relevant words with TF IDF and Python

danielheredia — Mon, 22 Nov 2021 16:28:00 +0000

Since John Muller confirmed that bold text has some SEO benefits for SEO to help Google to understand the page better there have been quite a lot of discussions about it so on this post I am going to show you how you can use a TF IDF model to find the main words from a group of pages and/or articles so that you can bold them.

But first of all, let’s try to clarify a bit what the TF IDF logic is and how the model works. In short, our TF IDF model will replace the words of the article with identifiers and it will give a higher score to those terms that appear in our page but not in the other articles. With this logic, if the sample of documents or articles is large enough, it will surely give a low score to stop words like articles and connectors and highlight the actual principal terms. If you would like to know more about TF IDF and its technicalities you can check the article this article that Koray Tuğberk recently posted.

So considering how a TF IDF model works, what we are going to do in this article is scraping the content from most of my blog posts, creating our own TF IDF model and obtaining the main keywords to be bolded. Something important to mention, it is that as the “seed” articles will al be very specialized in a field (concretely SEO and Python), it will not give a high score to Python terms, which is something positive because it will help to differentiate the articles and highlight the main “gap” terms.

Creating our own TF IDF model is a technique that will work very well for already clustered and specialized pieces of content. Otherwise, for example if we had use a generic TF IDF model, it would highlight some terms that are not so natural in the language like Python, but very common among my articles.

1.- Scraping the content from the pages

First, we will scrape the

content from the pages with cloudscraper and beautifulsoup, we will process these texts with textblob and we will store the processed texts in a list. In the code below, you would need to insert the pages that you would like to use for your model in a list format:

import cloudscraper
from bs4 import BeautifulSoup
from textblob import TextBlob as tb

list_pages = []

scraper = cloudscraper.create_scraper() 
 
list_content = []

for x in list_pages:
    content = ""
    html = scraper.get(x)
    soup = BeautifulSoup(html.text)
    
    for y in soup.find_all('p'):
            content = content + " " + y.text.lower()
            
    list_content.append(tb(content))

2.- Obtaining the main terms with TF IDF

Now that we have the content, we can compute the term frequencies, train our model and obtain the main terms for each of the pages. I learnt how to use the TF IDF model with Textblob on this article written by Steven Loria.

First we declare our functions:

import math
from textblob import TextBlob as tb

def tf(word, blob):
    return blob.words.count(word) / len(blob.words)

def n_containing(word, bloblist):
    return sum(1 for blob in bloblist if word in blob.words)

def idf(word, bloblist):
    return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))

def tfidf(word, blob, bloblist):
    return tf(word, blob) * idf(word, bloblist)

And now, we can iterate over the list with the page contents, obtain the main 5 terms and store them in a list that we will use later to export to Excel:

list_words_scores = [["URL","Word","TF-IDF score"]]
for i, blob in enumerate(list_content):
    scores = {word: tfidf(word, blob, list_content) for word in blob.words}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    for word, score in sorted_words[:5]:
        list_words_scores.append([list_pages[i],word,score])

3.- Exporting as an Excel file

Finally, we can export the results as an Excel file with Pandas:

import pandas as pd
 
df = pd.DataFrame(list_words_scores)
df.to_excel('.xlsx', header=False, index=False)

It will create an Excel file with three columns for the URL, the word and its TF IDF score (which the closer to 1 it is, the more relevant it will be).

4.- Try the Google Colab notebook

You can try the Google Colab Notebook to find your most relevant terms over here! You will be prompted to share your group pages with a sitemap and you will also need to grant access to Google Colab to your Drive to be able to export the main terms as an Excel file on Drive.

La entrada Find your main relevant words with TF IDF and Python se publicó primero en Daniel Heredia.

Using MOZ API with Python

danielheredia — Tue, 09 Nov 2021 17:12:12 +0000

In this post I am going to show you how you can extract the page authority and the domain authority among other metrics from MOZ in a bulk mode by using its API. As well as the domain authority, we will also be able to fetch the following metrics with the freemium version:

Page metatitle if available.
Equity links: meaning those links that pass page rank.
Total number of links.
URL response code.
Page Authority.
Domain Authority.
Timestamp: when the URL was last crawled.

The final output of this script will look like:

It is worth mentioning that Moz API has up to 2.500 requests for free and you can sign up over here if you do not have an account yet. On this article written by Yanis Illoul you can find more information about how you can get your key to make use of the API.

Does this sound interesting? Let’s get started then!

1.- Installing the library

First of all you need to run the following command in your terminal to install the library Mozscape:

pip install git+https://github.com/seomoz/SEOmozAPISamples.git#egg=mozscape&subdirectory=python

However, unfortunately I encountered some issues and I did not manage to install the library so what I did is executing the piece of code that defines the Mozscape function before making the request to the API in the notebook. The piece of code that I ran can be found here.

2.- Making the request

Now that we have either installed or defined our function, we can make the request and fetch the metrics from MOZ in bulk mode. Nevertheless, first of all we will import the list of domains that we want to check from an Excel file and we will create a list with them. In order to import the domains from the Excel file I am going to use Pandas:

import pandas as pd

df = pd.read_excel ('Domains.xlsx')
names = [[x] for x in df.values.tolist()]

When the domains are already imported and stored in the list, we can just iterate over the list, make the request to the API endpoint and append the API response with the metrics for each domain. Sometimes the API gets overloaded if many requests are made, so we will introduce a try and except and it will sleep for 10 seconds in case an error is returned by the API. You will need to add your key and Moz account in this piece of code:

import time

client = Mozscape('', '')


for x in names:
    try:
        print(x[0])
        domainAuthority = client.urlMetrics(x[0])
        x.append(domainAuthority)
    except Exception as e:
        print(e)
        time.sleep(10)
        domainAuthority = client.urlMetrics(x[0])
        x.append(domainAuthority)

Once we have got all the metrics, we will format them and we will export them as an Excel file. We will need to remove some deprecated metrics that are returned by the API with the method pop and we will also adapt the timestamp since the API gives it in unix format.

import datetime

list_values = [list(x[1].values()) for x in names]

for y in list_values:
    y[11] = datetime.datetime.fromtimestamp(y[11]).strftime('%Y-%m-%d %H:%M:%S')
    y.pop(4)
    y.pop(4)
    y.pop(4)
    y.pop(4)

Finally, we can export it as an Excel file again with Pandas:

import pandas as pd
pd.DataFrame(list_values).to_excel("MOZ_results.xlsx", header=["Title","URL","Equity Links","Links","Response Code","PA","DA","Last Crawled"], index=False)

This will create an Excel file that will look like the one above. That is all folks, I hope that you found this post interesting!

La entrada Using MOZ API with Python se publicó primero en Daniel Heredia.

Using Cloudflare Analytics API with Python

danielheredia — Thu, 28 Oct 2021 11:08:36 +0000

Cloudflare has become a very interesting tool for SEO in order to improve web performance optimization as it enables you to:

Create a cache version of your site that is hosted in the nearest location as possible when a request is made.
Through the Cloudflare workers you can also write and execute JS on the Cloudflare networks and they will be able to intercept and modify requests, cache content, combine third-party scripts and more. If you do not know how to set up the Cloudflare workers you can check this article written by JC Chouinard that will walk you through the process.
Improve the security of your site as it will handle DDoS attacks and avoid aggressive scraping that could bring your site down.

Due to all these advantages the use of Cloudflare is lately skyrocketing and already between 15% and 20% of the sites are using it.

On today’s post, we are going to learn how to use Graphl Cloudflare Analytics API with Python to get some data that can be very insightful to create a profile of the hits that are being made on Cloudflare. The report that we are going to generate can be done with the freemium version of Cloudflare, but there are other specific reports that require an upgraded version.

1.- Fetching the account information

In order to authenticate on Cloudflare API we will need to use two things: the email address that is attached to your Cloudflare account and a API key (I use the Global API key). Initially, we can request the account information so that we can extract from the API some data such as the website Zone ID that later on we will use to extract the Cloudflare data of the hits for a specific site.

The email address and the Global API key are sent with the headers to be able to authenticate successfully.

import requests

headers = {
    'X-Auth-Email': '',
    'X-Auth-Key': '',
    'Content-Type': 'application/json'
}

response = requests.request(
    'GET',
    'https://api.cloudflare.com/client/v4/zones',
    headers=headers
)


data = response.json()

2.- Making the request to the Analytics API

The request to the Analytics API needs to be made to a different endpoint with a POST http request. The metrics that we are interested in are sent with a GraphQL query. The GraphQL query that we are going to use replaced the deprecated Cloudflare Analytics API and returns the very same output.

We are going to use the report called httpRequests1hGroups, which is supported by the free version of Cloudflare and will return the statistics from the hits that have been made to our site for a maximum time range of 259200 seconds (3 days). It can be interesting to fetch this data and store it somewhere else because it only lets you access to data not older than 262800 seconds (around 3 days).

When making the request, we can select if we would like to receive grouped data for that time range or divided by hours. If you would like to split it by hours, you would need to add the dimension datetime to your query and iterate over the hours when you receive the JSON response object. In our case, for simplicity’s sake, we are going to make the request with the grouped data.

In the code below, you will need to introduce again your email address and global API key plus the ZONE ID of the site that you would like to extract the data from and the initial date and final date of the time range that you would like to check.

import requests


headers = {
    'X-Auth-Email': '',
    'X-Auth-Key': '',
    'Content-Type': 'application/json'
}


data = """{
  viewer {
    zones(filter: {zoneTag: }) {
      httpRequests1hGroups( limit: 100, filter: {datetime_geq: "2021-10-27T22:00:00Z", datetime_lt: "2021-10-28T20:02:00Z"}) {

        sum {
          browserMap {
            pageViews
            uaBrowserFamily
          }
          bytes
          cachedBytes
          cachedRequests
          contentTypeMap {
            bytes
            requests
            edgeResponseContentTypeName
          }
          clientSSLMap {
            requests
            clientSSLProtocol
          }
          countryMap {
            bytes
            requests
            threats
            clientCountryName
          }
          encryptedBytes
          encryptedRequests
          ipClassMap {
            requests
            ipType
          }
          pageViews
          requests
          responseStatusMap {
            requests
            edgeResponseStatus
          }
          threats
          threatPathingMap {
            requests
            threatPathingName
          }
        }
        uniq {
          uniques
        }
      }
    }
  }
}"""

response = requests.request(
    'POST',
    'https://api.cloudflare.com/client/v4/graphql',
    headers=headers,
    json={'query': data}
)

Once the request is made successfully, we can take a look at it and see what data we have got and how we can parse the JSON object that we receive with the response.

3.- What metrics have we got?

In the response we can find: the number of pageviews, the number of requests, the number of encrypted bytes, the number of encrypted requests, the number of bytes that have been served, the number of cached bytes that have been served and the number of cached requests. This can give us an idea about how well our cache policy is working and the volume of requests and bytes that are being handled by Cloudflare.

In addition, we can also find a response code map, an IP class map, a country map, a content type map, a browser map and a clients SSL map. Even if most of this data can be found on the actual user interface, it can still be good to query it with Python to store it somewhere else as it is only available for the previous 30 days and do further analyses.

3.1.- Pageviews, requests, bytes, encryption and cache

With the keys that can be found below we can get the data for the number of pageviews, the number of requests, the number of encrypted bytes, the number of encrypted requests, the number of bytes that have been served, the number of cached bytes that have been served and the number of cached requests.

pageviews = response.json()["data"]["viewer"]["zones"][0]["httpRequests1hGroups"][0]["sum"]["pageViews"]
requests_cf = response.json()["data"]["viewer"]["zones"][0]["httpRequests1hGroups"][0]["sum"]["requests"]
encrypted_bytes = response.json()["data"]["viewer"]["zones"][0]["httpRequests1hGroups"][0]["sum"]["encryptedBytes"]
encryptes_requests = response.json()["data"]["viewer"]["zones"][0]["httpRequests1hGroups"][0]["sum"]["encryptedRequests"]
bytes_cf = response.json()["data"]["viewer"]["zones"][0]["httpRequests1hGroups"][0]["sum"]["bytes"]
cached_bytes = response.json()["data"]["viewer"]["zones"][0]["httpRequests1hGroups"][0]["sum"]["cachedBytes"]
cached_requests = response.json()["data"]["viewer"]["zones"][0]["httpRequests1hGroups"][0]["sum"]["cachedRequests"]

3.2.- Response Status Map

With the code below we can extract the different status codes and how many requests each one has gotten. Moreover, with Matplotlib we can plot a bar chart that will display this info in a more visual way.

import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
response_codes = [str(x["edgeResponseStatus"]) for x in response.json()["data"]["viewer"]["zones"][0]["httpRequests1hGroups"][2]["sum"]["responseStatusMap"]]
requests = [x["requests"] for x in response.json()["data"]["viewer"]["zones"][0]["httpRequests1hGroups"][2]["sum"]["responseStatusMap"]]

for x,y in zip(response_codes,requests):

    label = "{:.2f}".format(y)
    plt.annotate(label, (x,y), textcoords="offset points",  xytext=(0,10), ha='center')

ax.bar(response_codes,requests)
plt.show()

3.3.- Browser Map

We can get the browser map and plot a bar chart with this piece of code:

import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
browser = [str(x["uaBrowserFamily"]) for x in response.json()["data"]["viewer"]["zones"][0]["httpRequests1hGroups"][2]["sum"]["browserMap"]]
pageviews = [x["pageViews"] for x in response.json()["data"]["viewer"]["zones"][0]["httpRequests1hGroups"][2]["sum"]["browserMap"]]

for x,y in zip(browser,pageviews):

    label = "{:.2f}".format(y)
    plt.annotate(label, (x,y), textcoords="offset points",  xytext=(0,10), ha='center')

ax.bar(browser,pageviews)
plt.show()

3.4.- Client SSL Map

We can get the client SSL Map and plot a bar chart with the code below:

import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ssl_protocol = [str(x["clientSSLProtocol"]) for x in response.json()["data"]["viewer"]["zones"][0]["httpRequests1hGroups"][2]["sum"]["clientSSLMap"]]
requests = [x["requests"] for x in response.json()["data"]["viewer"]["zones"][0]["httpRequests1hGroups"][2]["sum"]["clientSSLMap"]]

for x,y in zip(ssl_protocol,requests):

    label = "{:.2f}".format(y)
    plt.annotate(label, (x,y), textcoords="offset points",  xytext=(0,10), ha='center')

ax.bar(ssl_protocol,requests)
plt.show()

3.5.- IP Class Map

Same as in the previous ones, we can get the IP Class Map and plot a chart:

import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
type_ip = [str(x["ipType"]) for x in response.json()["data"]["viewer"]["zones"][0]["httpRequests1hGroups"][2]["sum"]["ipClassMap"]]
requests = [x["requests"] for x in response.json()["data"]["viewer"]["zones"][0]["httpRequests1hGroups"][2]["sum"]["ipClassMap"]]

for x,y in zip(type_ip,requests):

    label = "{:.2f}".format(y)
    plt.annotate(label, (x,y), textcoords="offset points",  xytext=(0,10), ha='center')

ax.bar(type_ip,requests)
plt.show()

3.6.- Country Map

With the piece of code below we will get the country map data and we will plot a chart with two Y axis, one for the requests and the other one for bytes.

import matplotlib.pyplot as plt
import numpy as np

fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
country = [str(x["clientCountryName"]) for x in response.json()["data"]["viewer"]["zones"][0]["httpRequests1hGroups"][2]["sum"]["countryMap"]]
bytes_request = [x["bytes"] for x in response.json()["data"]["viewer"]["zones"][0]["httpRequests1hGroups"][2]["sum"]["countryMap"]]
requests = [x["requests"] for x in response.json()["data"]["viewer"]["zones"][0]["httpRequests1hGroups"][2]["sum"]["countryMap"]]
ax.set_ylabel("Bytes",color="red",fontsize=14)


ax.plot(country, bytes_request, color="red", marker="o")
ax2=ax.twinx()
ax2.plot(country, requests,color="blue",marker="o")
ax2.set_ylabel("Requests",color="blue",fontsize=14)


plt.show()

3.7.- Content Type Map

Last but not least, we can get the content type data and plot another graph with two Y axis for the number of requests and the number of bytes.

import matplotlib.pyplot as plt
import numpy as np

fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
content_type = [str(x["edgeResponseContentTypeName"]) for x in response.json()["data"]["viewer"]["zones"][0]["httpRequests1hGroups"][2]["sum"]["contentTypeMap"]]
bytes_request = [x["bytes"] for x in response.json()["data"]["viewer"]["zones"][0]["httpRequests1hGroups"][2]["sum"]["contentTypeMap"]]
requests = [x["requests"] for x in response.json()["data"]["viewer"]["zones"][0]["httpRequests1hGroups"][2]["sum"]["contentTypeMap"]]
ax.set_ylabel("Bytes",color="red",fontsize=14)


ax.plot(content_type, bytes_request, color="red", marker="o")
ax2=ax.twinx()
ax2.plot(content_type, requests,color="blue",marker="o")
ax2.set_ylabel("Requests",color="blue",fontsize=14)


plt.show()

That is all folks, I hope you found this article interesting to get started with Cloudflare Analytics API and Python!

La entrada Using Cloudflare Analytics API with Python se publicó primero en Daniel Heredia.

Extracting keywords ideas with Python and Keyword Planner API segmented by language

danielheredia — Thu, 21 Oct 2021 15:03:37 +0000

On today’s post I am going to show you how you can use Keyword Planner API and Python to extract keyword ideas segmented by language with a practical case. From my point of view, being able to segment by language when doing a keyword research is a very interesting feature that other tools do not offer which enables you to find potential keyword niches for minority languages in very competitive markets.

The process that we will follow is:

Importing the seed terms: to be able to get the keyword ideas we need a list of seed terms that we are going to obtain from Github based on their frequency. In this way, we can obtain most of the queries that are done in a country for a specific language and create a quite extensive database. However, if you would like to do your keyword research focused on an industry rather than a generic one, then you might need to use a list of more specific terms of that industry.
Making the requests to the API: for this practical example we will insert a list of seed terms in Spanish and we will extract the keyword ideas of queries that are done in Spanish in the United States through Keyword Planner API and Python.
Exporting to Excel file: finally, we can dump all the data into an Excel file if we would like to do further analyses.

This is just a practical example, but if you would like to find more information about how to set up Keyword Planner API, you can have a read at the guide that I published to use Keyword Planner API with Python.

Generating keyword ideas with Python and Keyword Planner from Google Ads API

1.- Finding and importing the seed terms

As mentioned before, we will get the list of seed terms from Github, thanks to the amazing work that was done by the contributor Hermit Daves, who created a repository with lots of lists of language terms from OpenSubtitles based on their frequency. With Pandas, we can fetch txt and csv files directly from Github without having to download them if we use the parameter on “?raw=true” at the end of Github’s URL.

import pandas as pd

url = 'https://github.com/hermitdave/FrequencyWords/blob/master/content/2018/es/es_50k.txt?raw=true'
df = pd.read_csv(url,header=None, sep=" ")
seed_terms = [x for x in df[0]]

2.- Making the requests to the API

In order to make the requests to the API we will iterate over the list of seed terms and we will extract the keyword ideas for each of them. Keep in mind, that before proceeding with this piece of code, you need to configure the Google Ads account and create the YAML file with the credentials as explained in the guide to Keyword Planner API. Also, you will need to check these two pages to get the location and language IDS to localize the keyword research: Geotargets and Language Codes.

In the case of our practical example, the location ID for the USA is 2840 and the Spanish language ID is 1003. The total number of terms from the seed terms list is 50.000, but if we do not need to extract so many keyword ideas, we can limit it. In the case below, I cap it to the first 1.000 terms. At the end of each iteration we let the script sleep for 3 seconds to avoid the API overloading.

from google.ads.googleads.client import GoogleAdsClient
import time

client = GoogleAdsClient.load_from_storage("")

list_keywords = []
for x in seed_terms[0:1000]:
    list_keywords = list_keywords + main(client, "", ["2840"], "1003", [x] , None)
    time.sleep(3)

3.- Exporting to Excel

Finally, we can export it as an Excel file with Pandas. The output will contain seven columns with the keyword, the average monthly search, the competition level, the competition index, the searches from the past months, the past months and the queries categorizations (Branded, Non-branded, Cars, Year, etcetera). Especial mentions to Alex Papageorgiou’s because he taught me how to also extract the keyword categories.

First, we will need to adjust a bit our list with the keywords to be able to export it with Pandas and then we will export it.

list_to_excel = []
for x in range(len(list_keywords)):
    list_months = []
    list_searches = []
    list_annotations = []
    for y in list_keywords[x].keyword_idea_metrics.monthly_search_volumes:
        list_months.append(str(y.month)[12::] + " - " + str(y.year))
        list_searches.append(y.monthly_searches)
        
    for y in list_keywords[x].keyword_annotations.concepts:
        list_annotations.append(y.concept_group.name)
        
        
    list_to_excel.append([list_keywords[x].text, list_keywords[x].keyword_idea_metrics.avg_monthly_searches, str(list_keywords[x].keyword_idea_metrics.competition)[28::], list_keywords[x].keyword_idea_metrics.competition_index, list_searches, list_months, list_annotations ])
    
pd.DataFrame(list_to_excel, columns = ["Keyword", "Average Searches", "Competition Level", "Competition Index", "Searches Past Months", "Past Months", "List Annotations"]).to_excel('output.xlsx', header=True, index=False)

The final output of this exercise looks like:

Having the option of filtering by category is super convenient as we can spot terms related to your industry quite easily and fast by just filtering the Category column for the categories you are interested in the most. For instance, if I wanted to check queries about language related doubts I would only need to filter by “Language”.

When analyzing the file, you can notice that in the beginning most of the terms come from articles or connectors terms and the generated keywords are not super meaningful, however, once more specific terms are inputed, it throws much more insightful and meaningful keyword ideas.

That is all folks, I hope that you found this article interesting!

La entrada Extracting keywords ideas with Python and Keyword Planner API segmented by language se publicó primero en Daniel Heredia.

Checking metatitles rewrites with Python and Oxylab’s API

danielheredia — Mon, 04 Oct 2021 00:25:06 +0000

On today’s post I am going to show you how you can make use of the service provided by Oxylabs called the Real Time Crawler API and Python to scrape the SERPs, extract the metatitle showing up on the SERPs for a page and compare it with your on-page metatitle and H1 to analyze if Google is rewriting your metatitles. The final output of this script will return an Excel file like the one in the screenshot below (without the conditional formatting):

If you are not familiar with Oxylab’s API, you can have a read at this article where I explain how you can use Oxylab’s API and Python to scrape the SERPs. You will learn how the API works, what type of data you can obtain from it and how you can get the most out of it for SEO with some practical cases.

Scraping the Google SERPs with Python and Oxylabs’ API

Having said this, let’s get started with the metatitles checker!

1.- How the script works?

Essentially, we will use Oxylab’s API to scrape the SERPs with the command “site:” for a list of given URLs and then we will scrape the URLs themselves to extract the on-page metatitles and H1s (as Google is likely to use the H1 when not using the actual metatitle).

It might be interesting to do this exercise with a sitemap of URLs to get to know:

What URLs are not indexed at all and take action on them: in the short term a manual indexation on Google Search Console can be requested. For the long term it is possible that some other actions might be required starting from some sanity checks to make sure that the pages are readable by Googlebot, on-page optimizations, the creation of internal and/or external links to increase its page rank, etcetera.
What URLs show a different metatitle on the SERPs to the on-page title: we can analyze why Google might be changing the metatitle and what alternative it considers that is better for the user. On my site I noticed that Google was excluding in many cases the final site name “- Daniel Heredia” with the intention of shortening them because the metatitles were already quite long.
What URLs are showing the H1 as a metatitle on the SERPs: in those cases I would recommend to go over the H1s and optimize them if there is any way to make them more appealing for users. It can be an enriching process for some sites that optimized their H1s mainly for search engines including unnatural exact matches based on a H1 pattern but disregarded them from a user experience perspective.

For the script we will use the libraries:

Requests: to make the request to Oxylab’s endpoint and scrape the SERPs.
Cloudscraper: to scrape the URLs. It could also be done with requests but cloudscraper is more reliable with sites that use Cloudflare. On the guide to SEO on-page scraping with Python I introduced cloudscraper and I explained how to scrape metatitles and H1s alongside the rest of SEO elements that can be valuable for SEO.
BeautifulSoup: to parse the object that we are going to receive from our request with cloudscraper.

2.- Using the script

First, we need to import the list of URLs that we are going to check. We can use Requests and BeautifulSoup to extract them easily from a sitemap, although they could be imported from any other data source.

from bs4 import BeautifulSoup
import requests

r = requests.get("https://www.yoursite.com/sitemap.xml")
xml = r.text

soup = BeautifulSoup(xml)
urls_list = [x.text for x in soup.find_all("loc")]

After importing the URLs, we only need to run the script to scrape the SERPs with Oxylabs and the actual URLs from the list:

import cloudscraper

list_comparison = []

for url in urls_list:
    scraper = cloudscraper.create_scraper() 

    indexation = False
    metatitle_coincidence = False
    metatitle_coincidence_h1 = False

     payload = {
    'source': 'google_search',
    'domain': 'com',
    'query': 'site:' + url,
    'parse':'true' 
    }

    response = requests.request(
        'POST',
        'https://realtime.oxylabs.io/v1/queries',
        auth=('', ''),
        json=payload,
    )
 

    for x in response.json()["results"][0]["content"]["results"]["organic"]:
        try:
            if x["url"].endswith(url):
                indexation = True
                html = scraper.get(url)
                soup = BeautifulSoup(html.text)
                metatitle = (soup.find('title')).get_text()
                h1 = (soup.find('h1')).get_text()

                if x["title"] == metatitle:
                    metatitle_coincidence = True
                    
                if x["title"] == h1:
                    metatitle_coincidence_h1 = True
                    

                list_comparison.append([url,indexation,x["title"],metatitle,h1,metatitle_coincidence,metatitle_coincidence_h1])

                break
        except:
            pass
        
        
    if indexation == False:
        list_comparison.append([url,indexation,"","","",metatitle_coincidence,metatitle_coincidence_h1])

This will generate a list that will contain the URLs, their indexation statuses, the SERPs metatitles, the on-page metatitles, the on-page H1s and two boolean variables to indicate if the metatitles from the SERPs are equal to the on-page metatitle and/or the on-page H1s.

We can now export this list with Pandas as an Excel file:

import pandas as pd
 
df = pd.DataFrame(list_comparison, columns = ["URL","Indexation", "Metatitle SERPs", "Metatitle", "H1", "Metatitle Coincidence", "H1 - metatitle Coincidence"])
df.to_excel('.xlsx', header=True, index=False)

This will return an Excel file that will look like the one shown in the screenshot at the beginning of the post.

That is all folks, I hope that you found this post interesting!

La entrada Checking metatitles rewrites with Python and Oxylab’s API se publicó primero en Daniel Heredia.

Pruning with Python and htaccess for SEO

danielheredia — Mon, 06 Sep 2021 20:44:34 +0000

On today’s post I am going to show you a very easy trick to create a txt snippet that you can use in your htaccess file to set pages as not indexable based on their performances. The logic that we will use is:

We use Screaming Frog connected with Google Search Console API to crawl our website and get those pages that do not turn up on Google Search Console.
We export those pages as an Excel file.
We run a very simple Python code to create a txt snippet that we can paste in our htaccess file to serve a noindex tag in the HTTP response to prevent the Googlebot to index (and render in most cases) the underperforming pages.

Does this sound interesting? So let’s get started then!

1.- Getting the underperforming pages with Screaming Frog

The first thing that we need to do is connecting Screaming Frog with our Google Search Console account to be able to obtain a report with those URLs that are found in the crawl but not on GSC’s data. We can access this feature on the navigational menu under API Access –> Google Search Console:

After that, Screaming Frog will prompt us to connect with our Google Search Console account. We just need to click on New account and log into our account. Once we are logged in, we need to select the account that corresponds to the site that we are going to crawl.

Another important point is that we can extend the date range of the data in the Date Range tab if needed to have a bigger volume of data and avoid setting to not indexable URLs with no impressions due to seasonal effects.

Once everything is set up, we just need to run the crawl and when it is finished, we will have to export the report called “No GSC data” which can be found on the sidebar under section Search Console.

2.- Creating the TXT snippets with Python

This article from Yoast inspired me to create these txt snippets to be added on the htaccess file as I thought that with Python we could iterate over the list of pages without GSC data very easily and write an exclusion rule for each of them. However, even if this article can be of inspiration to automate the pruning task, I would recommend to those SEOs without a big technical knowledge to be helped out by a developer as the htaccess is a very sensitive file that can break your site in case something is not inserted correctly, although the logic can still be used. In case you would still like to take some risks, you can test the htaccess file with some htaccess validators as I recommended in this article about page to page redirects with Python and htaccess.

Page to page redirects with Python and htaccess

Another important thing to be mentioned when pruning your website based on underperforming pages is that some pages might not be performing well due to some technical issues or other reasons, so before deindexing the URLs take some time to have a look at the type of URLs that you are about to set as non-indexable and identify the problem why they do not rank well. Pruning is mainly recommended on those sites with lots of pages with thin content, so if your pages are not performing well due to thin content and there is no intention of extending their contents, it might be a good decision to exclude them. However, if you believe that the content on your pages have a big quality, it is very likely that there must be other issues or it is just a matter of time that Google values that content if it has been published recently.

After this small disclaimer message, let’s go over the code. First we import with Pandas the document with the URLs without GSC data and we exclude all the URLs that do not have a 200 response code because they are already not indexable.

import pandas as pd

file_name = 'search_console_no_gsc_data.xlsx' 
df = pd.read_excel(file_name)

df_200 = df.loc[df['Status Code'] == 200]
list_200 = df_200.values.tolist()

When the data is imported, we can just iterate over the URLs and create the text snippet with a for loop. We will also make use of urllib.parse so that we can break down the URLs and get only the relative path, which is what we need to add on the htaccess. The code to create the snippet for Apache servers is:

from urllib.parse import urlparse

text = ""
for x in list_200:
    text = text + '\nHeader set X-Robots-Tag "noindex"\n\n'

Whereas the piece of code to create this txt snippet for Nginx servers is:

from urllib.parse import urlparse

text = ""
for x in list_200:
    
    text = text + '''
location = ''' + urlparse(x[0]).path[1::] + ''' {
    add_header  X-Robots-Tag "noindex";
}
    '''

Finally, in both cases we can export the htaccess snippet with:

file_ = open("Prunning.txt", 'w')
file_.write(text)
file_.close()

If everything goes well, you will export a txt file that should be pasted on your htaccess file and look like:

That is all folks, I hope that you found this article useful!

La entrada Pruning with Python and htaccess for SEO se publicó primero en Daniel Heredia.

Google Alerts and Outreach with Python

danielheredia — Mon, 16 Aug 2021 22:38:36 +0000

On today’s post I am going to show you how you can make use of Google Alerts with Python and how you can set up an automated workflow to reach out to some websites that might mention your brand or a term closely related to your business but not linking to your site.

Basically, what we are going to do on this post is:

Learning how to install the library google-alerts for Python.
Setting up some alerts and parsing the RSS feed which is generated with the matches.
Downloading the matches as an Excel file.
Scraping the URLs and searching for a contact URL or an email address to make contact with these sites and ask for a link.

Does this automated workflow sound interesting? Let’s get started then!

1.- Installing google-alerts for Python

First of all, we will need to install google-alerts for Python and seed our Google Alerts session. The command that we will need to run on our terminal to install google-alerts is:

pip install google-alerts

After this, we will need to input our email address and our password by running the command:

google-alerts setup --email  --password ''

Finally to seed the Google Alerts session we will need to download the version number 84 of Chrome Driver and the version 84 of Google Chrome (be careful with not replacing the current version of Google Chrome when downloading and installing the version 84). Unfortunately, this needs to be done because this library has not been updated since 2020 and it is not compatible with the new versions of Google Chrome and Chrome Driver.

When both Chrome Driver v84 and Google Chrome v84 have been installed, we can already run the following command to seed our Google Alerts session.

google-alerts seed --driver /tmp/chromedriver --timeout 60

This command will open a Selenium webdriver session to log us into Google Alerts.

2.- Creating our first alert

Once the session is seeded, we can already use Jupyter notebook and Python to play around. We will first need to authenticate:

from google_alerts import GoogleAlerts

ga = GoogleAlerts('', '')
ga.authenticate()

When the authentication is completed, we can create our first alert. For example for the term Barcelona in Spain:

ga.create("Barcelona", {'delivery': 'RSS', "language": "es", 'monitor_match': 'ALL', 'region' : "ES"})

If the alert is created successfully, then it will return an object specifying the term, the language, the region, the match type and the RSS link for that alert:

Very sadly I have not been able to create an alert which would monitor a term for all the countries because if I leave the language and region arguments empty it sets USA and English as default region and language.

If at some point we lose track of the alerts that are active, we can list them with:

ga.list()

And if we would like to delete an alert which is no longer useful or redundant, we can delete it by using the monitor_id and running:

ga.delete("monitor_id")

3.- Parsing the RSS feed

In order to parse the RSS feed we will use requests and beautifulsoup and we will extract the ID, the title, the publication date, the update date, the URL and the abstract for each alert. This data is structured as a XML file.

import requests
from bs4 import BeautifulSoup as Soup

r = requests.get('')
soup = Soup(r.text,'xml')

id_alert = [x.text for x in soup.find_all("id")[1:len(soup.find_all("id"))]]
title_alert = [x.text for x in soup.find_all("title")[1:len(soup.find_all("title"))]]
published_alert = [x.text for x in soup.find_all("published")]
update_alert = [x.text for x in soup.find_all("updated")[1:len(soup.find_all("updated"))]]
link_alert = [[x["href"].split("url=")[1].split("&ct=")[0]] for x in soup.find_all("link")[1:len(soup.find_all("link"))]]
content_alert = [x.text for x in soup.find_all("content")]

compiled_list = [[id_alert[x], title_alert[x], published_alert[x], update_alert[x], link_alert[x], content_alert[x]] for x in range(len(id_alert))]

With this piece of code we will get an individual list for each metric and a compiled list with all the metrics by alert.

If we would like to, we can download the alerts as an Excel file with Pandas:

import pandas as pd
 
df = pd.DataFrame(compiled_list, columns = ["ID", "Title", "Published on:", "Updated on", "Link", "Content"])
df.to_excel('new_alerts.xlsx', header=True, index=False)

This will create an Excel document that will look like:

4.- Reaching out to the sites

From my point of view, using Google Alerts with Python can be specially useful when trying to automate a process to reach out to sites when they mention a brand or a specific term that can be closely related to a brand or product. With Python, we can iterate over the list of URLs, scrape them and intend to find a contact page or an email address to contact these sites. In case of finding an email address, even the delivery of an email could also be automated with Python or any other outreach tool.

We can use this piece of code to find those strings that contain “@” (very likely email addresses) and contact pages. The filter to leave out some strings that might contain “@” but not be an email address can be polished, for now I only excluded those strings which are PNG images:

import re

for iteration in link_alert:
    
    request_link = requests.get(iteration[0])
    soup = Soup(request_link.text,'html')

    body = soup.find("body").text
    match = [x for x in re.findall(r'[\w.+-]+@[\w-]+\.[\w.-]+', body) if ".png" not in x]
    
    contact_urls = []
    links = soup.find_all("a")
    for y in links:
        if "contact" in y.text.lower():
            contact_urls.append(y["href"])
    
    iteration.append([match])
    iteration.append([contact_urls])

Lastly, we can iterate over the list of email addresses and use a piece of code that I published on this article about what to do with your outputs when running Python scripts, which uses email.encoder to send emails with a message like:

from email import encoders
from email.message import Message
from email.mime.audio import MIMEAudio
from email.mime.base import MIMEBase
from email.mime.image import MIMEImage
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from email.mime.text import MIMEText
import smtplib 
 
#We enter the password, the email adress and the subject for the email
msg = MIMEMultipart()
password = ''
msg['From'] = ""
msg['To'] = ""
 
#Here we set the message. If we send an HTML we can include tags
msg['Subject'] = "Daniel Heredia - Thank you so much!"
message = "Dear lady or Sir
,

I would like to thank your for the mention of my brand on your article: " + URL + " and I would like to ask you if it were possible to include a link pointing to my website https://www.danielherediamejias.com to enable those users that are interested in my brand to get to know about me.


Thank you so much in advance!"
 
#It attaches the message and its format, in this case, HTML
msg.attach(MIMEText(message, 'html'))
 
#It creates the server instance from where the email is sent
server = smtplib.SMTP('smtp.gmail.com: 587')
server.starttls()
 
#Login Credentials for sending the mail
server.login('', password)
 
# send the message via the server.
server.sendmail(msg['From'], msg['To'], msg.as_string())
server.quit()

Unluckily I am not a very creative person, so I guess that the message to be sent could be much more appealing! That is all folks, I hope that you found this article interesting!

La entrada Google Alerts and Outreach with Python se publicó primero en Daniel Heredia.

On-page optimization with Python for SEO

danielheredia — Tue, 10 Aug 2021 18:23:20 +0000

On today’s post I am going to show you how you can use Python to find terms occurrences to improve your on-page optimization. Basically, what we are going to do is:

Using a keyword import from Semrush we will extract keywords and URLs ranking for those keywords.
We will iterate over the list of URLs, we will scrape their contents and we will search for the keywords occurrences on metatitles, metadescriptions, H1, H2 and paragraphs.
Finally, we will download this data as an Excel file which will have a conditional formatting that will help us to spot some possible optimizations, working similar to a heatmap.
From my point of view, the final cherry-picking of optimizations needs to be manually and some of the possible optimizations can be disregarded in order to not overoptimize the page or to keep it natural.

The final Excel file will look like as follows:

Does it sound interesting? Let’s get started then!

1.- Importing the data from Semrush

Initially, we will need to download a keyword level report from Semrush and import it to our notebook. For that, we will use pandas.

import pandas as pd

keywords = pd.read_excel ('.xlsx')

Secondly, with the purpose of finding the low hanging fruits and maximize the return of this exercise, we will leave out those keywords which are ranking out of the top 15, although depending on the number of keywords and the current rankings, you can make your threshold higher or lower.

low_hanging = keywords[keywords['Position'] < 15]
low_hanging_list = low_hanging.values.tolist()

To avoid having to crawl an URL several times once we iterate over them, we will adjust the format of our input, transforming the list into a dictionary where the URL will be the key and we will save the keyword, the current ranking and the number of monthly searches as its values.

dict_urls = {}
for urls in low_hanging_list:
    if urls[7] in dict_urls:
        dict_urls[urls[7]] += [[urls[0],urls[1],urls[3]]]
    else:
        dict_urls[urls[7]] = [[urls[0],urls[1],urls[3]]]

The format is ready to proceed with the web scraping and finding the term occurrences!

2.- Scraping the URLs and finding the occurrences

To scrape the URLs we will use the Python library called cloudscraper. As I have already mentioned in other posts, I really like this library, which depends on Requests + requests_toolbelt, as it enables you to scrape those sites which are using Cloudflare without being banned.

In addition, to be able to parse the HTML response we will use beautifulsoup, which will enable us to make the object parsable. If you are interested in web scraping for SEO or other purposes, you can have a read at my previous article where I explained how you can extract all the content from a page based on their selectors.

Guide to SEO on-page scraping with Python

Once we search for the term occurrences, we will aim at a broad match, meaning that we will check if the terms are present in the content separately instead of checking if the exact term combinations are present. In my opinion, this is a best approach which copes with some cases where the order of the keyword terms wouldn’t alter their meanings or when articles or prepositions are used in the natural language while when looking up on the Internet they are neglected.

However, in case you would like to search for the exact matches, you could also use that approach by tweaking a bit this piece of code.

import cloudscraper
from bs4 import BeautifulSoup

scraper = cloudscraper.create_scraper() 

for key, values in dict_urls.items():
    
    print(str(key))
    
    html = scraper.get(key, headers = {"User-agent" : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"})
    soup = BeautifulSoup(html.text)
    
    metatitle = (soup.find('title')).get_text()
    metadescription = soup.find('meta',attrs={'name':'description'})["content"]
    h1 = [a.get_text() for a in soup.find_all('h1')]
    h2 = [a.get_text() for a in soup.find_all('h2')]
    paragraph = [a.get_text() for a in soup.find_all('p')]
    
    
    for y in values:
        
        metatitle_occurrence = "True"
        metadescription_occurrence = "True"
        h1_occurrence = "True"
        h2_occurrence = "True"
        paragraph_occurrence = "True"
        
        for z in y[0].split(" "):
        
            if z not in str(metatitle).lower():
                metatitle_occurrence = "False"

            if z not in str(metadescription).lower():
                metadescription_occurrence = "False"

            if z not in str(h1).lower():
                h1_occurrence = "False"

            if z not in str(h2).lower():
                h2_occurrence = "False"

            if z not in str(paragraph).lower():
                paragraph_occurrence = "False"
            
        y.extend([metatitle_occurrence,metadescription_occurrence,h1_occurrence,h2_occurrence,paragraph_occurrence])

This piece of code will append to the dictionary a boolean value for each keyword and tag, throwing False is the keyword is not found and True if it is present.

3.- Downloading as an Excel file

Finally, we will download this dictionary as an Excel file by using the library openpyxl. This library will enable us to add the conditional formatting shown on the initial screenshot.

from openpyxl import Workbook
from openpyxl.formatting import Rule
from openpyxl.styles import Font, PatternFill, Border
from openpyxl.styles.differential import DifferentialStyle

wb=Workbook()
dest_filename = 'new_document.xlsx'
ws1 = wb.active

number=2

for key, values in dict_urls.items():
    
    ws1.cell(row=1,column=1).value= "URL"
    ws1.cell(row=1,column=2).value= "Keyword"
    ws1.cell(row=1,column=3).value= "Ranking"
    ws1.cell(row=1,column=4).value= "Searches"
    ws1.cell(row=1,column=5).value= "Metatitle Occurrence"
    ws1.cell(row=1,column=6).value= "Metadescription Occurrence"
    ws1.cell(row=1,column=7).value= "H1 Occurrence"
    ws1.cell(row=1,column=8).value= "H2 Occurrence"
    ws1.cell(row=1,column=9).value= "Paragraph Occurrence"
    
    for list_values in values:
        ws1.cell(row=number,column=1).value= key
        column = 2
        for iteration in list_values:
            ws1.cell(row=number, column=column).value = iteration
            column +=1
        number += 1
    

red_text = Font(color="9C0006")
red_fill = PatternFill(bgColor="FFC7CE")
green_text = Font(color="FFFFFF")
green_fill = PatternFill(bgColor="009c48")

dxf = DifferentialStyle(font=red_text, fill=red_fill)
dxf2 = DifferentialStyle(font=green_text, fill=green_fill)

rule = Rule(type="containsText", operator="containsText", formula=['A1:N' + str(number) + '= "False"'], dxf=dxf)
rule2 = Rule(type="containsText", operator="containsText", formula=['A1:N' + str(number) + '= "True"'], dxf=dxf2)

ws1.conditional_formatting.add('A1:N' + str(number), rule)
ws1.conditional_formatting.add('A1:N' + str(number), rule2)


wb.save(filename = dest_filename)

That is it, this will make the magic happen and export the data as an Excel file with the conditional formatting!

4.- Getting the data from Google Search Console

Alternatively, you can also use the data from Google Search Console to run this piece of code. You would only need to extract the data by using GSC API or you can also just download the data with an Excel file and import it to your notebook with pandas as done before with the export from Semrush.

If you would like to give a try to GSC API and you are not familiar with it, I highly recommend you to have a look at the amazing guide that JC Chouinard created to walk you through almost every single step of the set-up process.

When retrieving the data from Google Search Console API, it is possible that you will need to make some tweaks as the data is fetched day by day. In my case what I did is grouping the number of impressions and the average position by keyword and URL, simulating the export from Semrush.

sum_df = df.groupby(['query','page']).agg({'impressions': 'sum', 'avg_position': 'mean'})
sum_df.avg_position = sum_df.avg_position.round(0)
sum_df = sum_df.sort_values(by=['impressions'], ascending=False)

After grouping the impressions and the average position, you can use the rest of the piece of code for the Semrush export with the data from Google Search Console.

That is all folks, I hope that you found this blog post interesting!

La entrada On-page optimization with Python for SEO se publicó primero en Daniel Heredia.

Guide to SEO on-page scraping with Python

danielheredia — Tue, 29 Jun 2021 10:40:41 +0000

Python can be a very useful resource to scrape and extract on-page data from a page. On this post I am going to share with you the most common Python keys to extract the most important information from a page from an SEO perspective.

1.- Making the web request

First, before starting to parse the HTML code from a page in order to obtain the data which is of our interest, we need to make the request to the URL that we would like to scrape. The library that I usually use for these type of requests is cloudscraper, which works in a very similar way to Requests but it is much better at accessing websites which use Cloudflare without being banned. If you are interested in web scraping with Python, you can also have a read at this article where I explain 6 tricks for basic web scraping with Python.

Once we access the URL with cloudscraper, we will use BeautifulSoup to parse the HTML code and obtain the SEO data.

import cloudscraper
from bs4 import BeautifulSoup

scraper = cloudscraper.create_scraper() 

html = scraper.get("")
soup = BeautifulSoup(html.text)

2.- Metas scraping

2.1.- Metatitle

The metatitle is one of the main metas as we all the SEOs are already aware of as it is displayed on the search snippet and it can be optimized for determined keywords. The metatitle can be obtained with this key:

metatitle = (soup.find('title')).get_text()

2.2.- Metadescription

Metadescriptions can be used to briefly explain what your page is about and they will also appear on the search snippet. The metadescription can be obtained with this key:

metadescription = soup.find('meta',attrs={'name':'description'})["content"]

2.3.- Robots

The meta robots is a very important SEO tag as it specifies if the page can be indexed and it will give some directives about how the page is to be shown on the SERPs. Noindex, index, follow, nofollow, noarchive, nosnippet, notranslate, noimageindex or unavailable_after directives can be found here among others. If you are not familiar with some of these robots directives, you can find them all with their explanations over here.

The key that we would need to use to extract this data is as follows and it will return a list with all the content directives which are separated by commas:

robots_directives = soup.find('meta',attrs={'name':'robots'})["content"].split(",")

2.4.- Viewport

The meta viewport will indicate the visible part of the page and it is mainly important for mobile friendliness purposes. The meta viewport can be extracted with this key:

viewport = soup.find('meta',attrs={'name':'viewport'})["content"]

2.6.- Charset

The meta charset will indicate to the search engine bots the character encoding. It is specially useful for those pages which are not written in English and might contain special characters not available in English when working on international SEO.

The meta charset can be found with this key:

charset = soup.find('meta',attrs={'charset':True})["charset"]

2.7.- HTML language

The html language is not a meta itself, but it can be an interesting element in case of working on international SEO as it gives to search engine bots some hints about what country the page is intended to target.

We can get this element with they key:

html_language = soup.find('html')["lang"]

3.- Alternates and canonicals scraping

3.1.- Canonical

Rel=”canonical” indicates to Google what is the page that should be indexed. It can point to itself in case the own page is to be indexed or to another page if another page is meant to be indexed instead. It is important to mention that Google takes canonicals as recommendations, so it might be possible that it might disregard the canonical indication.

Canonical URLs can be obtained with this key:

canonical = soup.find('link',attrs={'rel':'canonical'})["href"]

3.2.- Hreflangs

Hreflangs are specially important when having a website with different language versions in order to indicate to search engines what version should be indexed and showcased in each country’s SERPs.

The key that needs to be used to extract hreflangs is as follows and it will return a list with the provided pages for each language and their country codes:

list_hreflangs = [[a['href'], a["hreflang"]] for a in soup.find_all('link', href=True, hreflang=True)]

3.3.- Mobile alternates

They are used when a page has a mobile version which is hosted in a different page. This is the mainly the case of some non-responsive websites which do not use dynamic serving and have a mobile subdomain.

The mobile alternates can be extracted with:

mobile_alternate = soup.find('link',attrs={'media':'only screen and (max-width: 640px)'})["href"]

4.- Schema mark-up scraping

4.1.- Quick schema mark-up overview

It is especially easy to extract and analyze schema mark-up injections if they have been inserted in a JSON format. We can extract the whole script and then we can analyze it as if it were a Python dictionary. First, we need to find and parse the script with Beautifulsoup.

import json

json_schema = soup.find('script',attrs={'type':'application/ld+json'})
json_file = json.loads(json_schema.get_text())

Then after this, we can iterate over the schema mark-up and see at first glance which type of mark-ups are being used by that page:

for x in json_file["@graph"]:
    print(x["@type"])

For instance:

4.2.- Breadcrumbs

If the page has a breadcrumb list schema mark-up, we can use Python to extract the provided URLs in the mark-up and have an idea about which are its parental pages and its internal structure depth:

breadcrumb_urls = [[x["position"],x["item"]] if "item" in str(x) else [x["position"],"Final URL"] for x in json_file["@graph"][3]["itemListElement"]]
breadcumb_depth = len(breadcrumb_urls)

5.- Content scraping

5.1.- Text

5.1.1.- Paragraphs

The key which is as follows will scrape the paragraph texts (“p”) and will return a list with all the paragraph texts. In addition, we can calculate the number of characters from these texts.

paragraph = [a.get_text() for a in soup.find_all('p')]
#Text length
text_length = sum([len(a) for a in paragraph])

5.1.2.- Headings

Beautifulsoup enables us to extract a specific type of headers, let’s say for instance H1, but it also enables us to extract different type of tags by using the method find_all and introducing a list with all the tags that we would like to extract. So translating this into the code, it would be:

h1 = [a.get_text() for a in soup.find_all('h1')]
headers = soup.find_all(["h1","h2","h3","h4","h5","h6"])

#Cleaning the headers list to get the tag and the text as different elements in a list
list_headers = [[str(x)[1:3], x.get_text()] for x in headers]

With Jupyter Notebook we can print HTML code and we can actually see how the headings hierarchy of that page looks like at first sight as the size of the font will be smaller or larger depending on the heading types:

from IPython.core.display import display, HTML
for x in headers:
    display(HTML(str(x)))

For example, the headers structure of my article about scraping the SERPs with Oxylabs API is represented as:

5.2.- Images

Something that might be interesting is scraping the image URLs from a page to have an idea about how many images are being used and their alt texts to see if they are properly optimized.

With this piece of code we can get all the image URLs from that page and their alt texts in a list format:

images = [[a["src"],a["alt"]] if "alt" in str(a) else [a["src"],""] for a in soup.find_all('img')]

5.3.- Links

We can also extract the links and obtain two lists that will contain the internal and external links with their anchor texts and whether they are follow or nofollow.

internal_links = [[a.get_text(), a["href"], "nofollow"] if "nofollow" in str(a) else [a.get_text(), a["href"], "follow"] for a in soup.find_all('a', href=True) if "" in a["href"] or a["href"].startswith("/")]
external_links = [[a.get_text(), a["href"], "nofollow"] if "nofollow" in str(a) else [a.get_text(), a["href"], "follow"] for a in soup.find_all('a', href=True) if "" not in a["href"] and not a["href"].startswith("/")]

#To get the number of links
number_internal_links = len(internal_links)
number_external_links = len(external_links)

On the other hand, we can also differentiate our links depending on where they are nested. So for instance, if the links are nested under a paragraph or a heading tag, we can assume that they are contextual links whereas if they are nested under div or span tags, we can assume that they will not be so.

contextual_links = [[a.get_text(), a["href"], "nofollow"] if "nofollow" in str(a) else [a.get_text(), a["href"], "follow"] for x in soup.find_all(["p", "h1", "h2", "h3", "h4", "h5", "h6"]) for a in x.find_all('a', href=True)]
div_links = [[a.get_text(), a["href"], "nofollow"] if "nofollow" in str(a) else [a.get_text(), a["href"], "follow"] for x in soup.find_all(["div","span"]) for a in x.find_all('a', href=True)]

6.- Open Graph scraping

Last but not least, we can scrape the open graph and create a list that will contain the type of open graph and their specifications:

open_graph = [[a["property"].replace("og:",""),a["content"]] for a in soup.select("meta[property^=og]")]

For example:

That is all folks, I hope that you find this guide helpful and if you happen to think of an element that I might have neglected, just let me know and I will add it to the guide!

La entrada Guide to SEO on-page scraping with Python se publicó primero en Daniel Heredia.

Scraping the Google SERPs with Python and Oxylabs’ API

danielheredia — Mon, 14 Jun 2021 18:29:38 +0000

On today’s post I am going to show you how you can scrape the SERPs with Python and Oxylabs, which has a Real Time Crawler API that uses a global proxy pool that will prevent Google from banning your IP. In addition, I will also show you some cases where scraping the SERPs can be useful for several purposes such as indexation analyses, getting the number of indexed results, finding partners and/or sales opportunities, etcetera.

Does it sound interesting? Let’s get started then!

1.- How does Oxylabs work?

The Real Time Crawler service works in a very easy way, you will only need to make a simple post HTTP request with the library Requests to their endpoint and then you will be able to retrieve the data from the SERPs.

For example:

import requests

payload = {
    'source': 'google_search',
    'domain': 'com',
    'query': 'adidas',
}

response = requests.request(
    'POST',
    'https://realtime.oxylabs.io/v1/queries',
    auth=('', ''),
    json=payload,
)

print(response.json())

With this query we would scrape the SERPs for the query adidas from google.com.

It is worth mentioning that the requests support several parameters which can help us to customize our SERPs scraping. The parameters that I usually work with to customize the SERPs scraping are:

domain: you can introduce any specific Google cctld to extract the results from different country versions.
query: this enables us to introduce our query. It also accepts Google’s commands such as “site:”, “intitle:” and so on as we will see in the practical examples.
start_page: the page from which we would like to start the SERPs scraping.
pages: number of pages to be scraped.
limit: how many results will be displayed in each page.
geo_location: the geographical location where the results will be adapted for. You can find and download as a CSV file all the geolocations from the Google Adwords geo-targets documentation.
user_agent_type: you can insert the type of device. The main values that I usually use in this parameter are either “desktop” or “mobile”, although you can also specify a browser as shown in this page which compiles are the supported user agents.
render: you can render the SERPs, which is a feature that is specially useful to get some of the features that are only accessible once the SERPs are rendered such as the carrousel of Google News and other rich snippets. The values that you can insert for this parameter are “html” or “png” (in case you would like to get a base64-encoded screenshot about how the rendered SERPs look like).
parse: if this value is “true” the response will be structured in a JSON format. If not, it will be returned in a HTML format.

It is important to clarify that if we need to render the results, we will need to use a different endpoint (https://data.oxylabs.io/v1/queries) and we will receive an URL in the response from which we will be able to access the results by using a Get request to the URL obtained from the key: [“_links”][1][“href”].

Here we can find an example about how our code would need to be in order to retrieve the results from a rendered SERPs page:

import requests

payload = {
    'source': 'google_search',
    'domain': 'com',
    'query': 'adidas',
    'render':'html',
    'parse':'true'
}

# Get response.
response = requests.request(
    'POST',
    'https://data.oxylabs.io/v1/queries',
    auth=('', ''),
    json=payload,
)

response2 = requests.request(
                'GET',
                response.json()["_links"][1]["href"],
                auth=('', ''),
                json=payload,
            )

response_rendering = response2.json()["results"]

From my point of view, rendering the SERPs is the best way to get the most out of this tool, although if rendering is not necessary, we can save some time if we only parse the raw HTML code as rendering takes a bit of time. In fact, to avoid breaking the code due to the page not being rendered yet once the request is made, it is recommendable to make use of a while loop and time to get the response once it is eventually ready:

import requests
import time

payload = {
    'source': 'google_search',
    'domain': 'com',
    'query': 'adidas',
    'render':'html',
    'parse':'true'
}

# Get response.
response = requests.request(
    'POST',
    'https://data.oxylabs.io/v1/queries',
    auth=('', ''),
    json=payload,
)

response_rendering = None
while response_rendering is None:
    try:
        response2 = requests.request(
            'GET',
            response.json()["_links"][1]["href"],
            auth=('', ''),
            json=payload,
        )

        response_rendering = response2.json()["results"]
    except Exception as e:
        print("trying again")
        time.sleep(10)
        pass

2.- What can you get by default?

If we render the SERPs and we ask the data to be delivered in the structured data format we can get by default:

Paid results: with their positions, URLs, descriptions and titles. Key: response_rendering[0][“content”][“results”][“paid”].
Organic results: with their positions, URLs, descriptions and titles. Key: response_rendering[0][“content”][“results”][“organic”].
Video results: with their positions, URLs, titles and authors. Key: response_rendering[0][“content”][“results”][“videos”].
Top stories: with their positions, sources, URLs and headlines. Key: response_rendering[0][“content”][“results”][“top_stories”].
Related searches. Key: response_rendering[0][“content”][“results”][“related_searches”].
Related questions: with their answers, source URLs and source titles. Key: response_rendering[0][“content”][“results”][“related_questions”].

2.1.- Creating a list with the paid results: URLs, titles and descriptions

We will use a for loop to iterate over the JSON file and transform the results into a list:

paid = []
for x in response_rendering[0]["content"]["results"]["paid"]:
    paid.append([x["pos"],x["url"],x["title"], x["desc"]])

2.2.- Creating a list with the organic results: URL, titles and description

Same logic is used to get the organic results:

organic_results = []
for x in response_rendering[0]["content"]["results"]["organic"]:
    organic_results.append([x["pos"], x["url"], x["title"], x["desc"]])

2.3.- Creating a list with the video results: URLs, titles and authors

In this case, some of the keys are slightly different as we will extract the authors and the overall positions.

video_results = []
for x in response_rendering[0]["content"]["results"]["videos"]:
       video_results.append([x["pos_overall"], x["url"], x["title"], x["author"]])

2.4.- Creating a list with the top stories results: URLs, headlines and sources

Some of the keys are also different as we extract the headlines and the overall positions.

top_stories = []
for x in response_rendering[0]["content"]["results"]["top_stories"]:
       top_stories.append([x["pos_overall"], x["url"], x["headline"], x["source"]])

2.5.- Creating a list with the related searches

We can also extract the related searches and create a list with them:

related_searches = []
related_searches.append(response_rendering[0]["content"]["results"]["related_searches"]["related_searches"])

2.6.- Creating a list with the related questions: questions, answers and source URLs

Finally, we can get the related questions, their answers and their source URLs.

related_questions = []
for x in response_rendering[0]["content"]["results"]["related_questions"]:
    related_questions.append([x["pos"],x["question"],x["answer"], x["source"]["url"]])

3.- Some practical cases

SERPs scraping can be used for an endless number of tasks and activities. Some of the tasks that I perform frequently by scraping the SERPs are checking the number of indexed results, indexation analyses, finding new partners, influencers and/or sales opportunities and finding guest-posting opportunities.

3.1.- Number of indexed results

As JC Chouinard explained in this article, it might be interesting to scrape the SERPs to extract the number of pages that are indexed for a query. If we combine this method with some Google commands like “intitle” or “inurl”, we can have an idea about not only the number of indexed pages for a query, but the number of pages which contain the keyword that we would like to target in their metatitles.

Theoretically, if we include a keyword exact match in our metatitle we might have many more chances to rank for that specific keyword than those pages where the keyword is not mentioned in the title. Therefore, in order to evaluate how competitive a keyword is, it might make sense to make use of the command “intitle” and extract the number of indexed results.

This is something we can do with the Real Time Crawler without having to render the SERPs page and using BeautifulSoup to create an object that can be parsed as shown below:

import requests
from bs4 import BeautifulSoup

payload = {
    'source': 'google_search',
    'domain': 'com',
    'query': 'intitle:"buy white shoes"',
}

response = requests.request(
    'POST',
    'https://realtime.oxylabs.io/v1/queries',
    auth=('', ''),
    json=payload,
)

soup = BeautifulSoup(response.json()["results"][0]["content"], "html.parser")
div_results = soup.find("div", {"id": "result-stats"})
indexed_results = int(div_results.text.split("About")[1].split("results")[0].replace(" ","").replace(",",""))

With this technique, we can analyze the competitiveness of lots of keywords in a bulk mode to allocate our efforts on those not so competitive keywords whose ROI might be higher.

3.2.- Indexation analyses

Another task that can be done with proxies and the command “site:” is indexation analyses (only a few indexed URLs from a site are provided on Google Search Console unfortunately). To proceed with these analyses, we will not need to render the SERPs and we will basically use the parameter “parse” to get the data as a JSON file. If there are no indexed URLs, then the organic results will be blank.

For instance, with this piece of code I was able to retrieve all the results that are indexed from my site:

import requests

payload = {
    'source': 'google_search',
    'domain': 'com',
    'query': 'site:danielherediamejias.com',
    'parse':'true',
    'start_page':1,
    'limit':100,
    'pages':3
    
}

response = requests.request(
    'POST',
    'https://realtime.oxylabs.io/v1/queries',
    auth=('', ''),
    json=payload,
)

number_indexed_results = len(response.json()["results"][0]["content"]["results"]["organic"])

indexed_urls = []
for x in response.json()["results"][0]["content"]["results"]["organic"]:
    indexed_urls.append([x["url"]])

We can also introduce a bunch of URLs from a sitemap for example and we can check whether they are indexed or not in order to understand what is the indexation coverage from that sitemap. For this, we will need to retrieve the results from the SERPs and check if they exactly match with the original URL since there could be some cases where the root URL might not be indexed but there could be some indexed subdirectories.

import requests

url = "" 

payload = {
    'source': 'google_search',
    'domain': 'com',
    'query': 'site:' + url,
    'parse':'true',
    'start_page':1,
    'limit':100,
    'pages':3
    
}

response = requests.request(
    'POST',
    'https://realtime.oxylabs.io/v1/queries',
    auth=('', ''),
    json=payload,
)


indexation = False
for x in response.json()["results"][0]["content"]["results"]["organic"]:
    try:
        if x["url"].endswith(url):
            indexation = True
            break
    except:
        pass

Regarding these indexation analyses, we can also introduce an initial domain, use the command site: and extract all the provided URLs by Google (around 300), store them and iterate again over all the URLs that have been extracted. This is specially useful in very large websites and it will give you an idea about all the URLs that are indexed from a site and you might be able to discover pages and sections that were not supposed to be indexed.

Basically, what we will do is creating a list with all the URLs and append more URLs as we iterate over the URLs and get more results from Google’s index:

import requests

list_url = [""]

for iteration in list_url:
    
    print(iteration)
    payload = {
        'source': 'google_search',
        'domain': 'com',
        'query': 'site:' + iteration,
        'parse':'true',
        'start_page':1,
        'limit':100,
        'pages':3

    }

    response = requests.request(
        'POST',
        'https://realtime.oxylabs.io/v1/queries',
        auth=('', ''),
        json=payload,
    )


    for x in response.json()["results"][0]["content"]["results"]["organic"]:
        present = False
        try:
            for y in list_url:
                if x["url"] == y:
                    present = True
            
            if present == False:
                list_url.append(x["url"])
        except:
            pass

This process is quite proxy-consuming because essentially we will use the command site for all the pages that are found with the proxies and in some cases it can be inaccurate if the number of pages which are indexed from a directory exceed the 300 results (which is usually the maximum number of results that Google will return for a query).

3.3.- Finding new partners, influencers and/or sales opportunities

The proxies can be used not only for SEO purposes, but also for finding email addresses to collaborate with new partners and/or influencers or to find new sales opportunities.

How can we do this? We can use some commands like site and intext to search for indexed results from a website with email addresses. For instance, if we would like to find Youtube channels with email addresses that might be open to collaborations we could use the search pattern: “site:youtube.com inurl:/about/ intext:@gmail.com”.

First we use the proxies and we extract the URLs:

import requests

payload = {
    'source': 'google_search',
    'domain': 'com',
    'query': 'site:youtube.com inurl:/about/ intext:@gmail.com',
    'parse':'true',
    'limit':100

}

response = requests.request(
    'POST',
    'https://realtime.oxylabs.io/v1/queries',
    auth=('', ''),
    json=payload,
)

list_url = []
for x in response.json()["results"][0]["content"]["results"]["organic"]:
    present = False
    try:
        list_url.append([x["url"]])
    except:
        pass

Then, once we extract the indexed URLs, we can go over them with requests and beautifulsoup and extract the email addresses to build our own database and contact them.

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import time
import re

driver = webdriver.Chrome(ChromeDriverManager().install())

for iteration in range (len(list_url)):
    
    
    driver.get(list_url[iteration][0])
    
    if iteration == 0:
        input = driver.find_element_by_xpath('/html/body/c-wiz/div/div/div/div[2]/div[1]/div[4]/form/div[1]/div/button/span')
        input.click()

    html = driver.page_source
    email_address = re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", html)
    email_address  = list(dict.fromkeys(email_address))
    list_url[iteration].append(email_address)
    
    time.sleep(2)
    
driver.close()

3.4.- Finding Guest-posting opportunities

Another activity that can be automated with proxies is the the discovery of new guest-posting opportunities. This task is pretty straightforward, you will only need to extract the URLs returned from queries such as “write with us”, “write for us”, “guest posting policy”, “guest posting rules”, “guest posting guidelines”, “we accept guest post”, “submit a guest post”…

If you would like to narrow down the results, you can also add a representative term about the topic of the websites that you are looking for:

import requests

payload = {
    'source': 'google_search',
    'domain': 'com',
    'query': '"write with us" digital marketing',
    'parse':'true',
    'limit':100

}

response = requests.request(
    'POST',
    'https://realtime.oxylabs.io/v1/queries',
    auth=('', ''),
    json=payload,
)

list_url = []
for x in response.json()["results"][0]["content"]["results"]["organic"]:
    present = False
    try:
        list_url.append([x["url"]])
    except:
        pass

As an extra step, something that can be done is checking if those websites have Doubleclick ads. In such a case, it is very possible that they are monetizing their website and they might be open to guest-posting in their sites.

You can search for Google Doubleclick ads with:

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import time

driver = webdriver.Chrome(ChromeDriverManager().install())

for iteration in range (len(list_url)):
       
    driver.get(list_url[iteration][0])
    html = driver.page_source
    
    guest_posting = False
    if "doubleclick" in html:
        guest_posting = True
    
    list_url[iteration].append(guest_posting)
    
    time.sleep(2)
    
driver.close()

In addition, if you would like to categorize the extracted websites based on their topics in a bulk mode, you can have a read at this post where I explain how you can use Python and Google NLP API for website categorization.

3.5.- Recruitment

Last but not least, proxies can also be used for recruitment and finding suitable candidates for some open positions. If for instance, we would be searching for a candidate to cover an SEO position in Barcelona who needs to be able to code with Python, we could use the command: site:es.linkedin.com/in/ intitle:”seo” intext:barcelona intext:python.

Therefore, getting the profiles from Google’s index would be quite easy with:

import requests

payload = {
    'source': 'google_search',
    'domain': 'com',
    'query': 'site:es.linkedin.com/in/ intitle:"seo" intext:barcelona intext:python',
    'parse':'true',
    'limit':100

}

response = requests.request(
    'POST',
    'https://realtime.oxylabs.io/v1/queries',
    auth=('', ''),
    json=payload,
)

list_url = []
for x in response.json()["results"][0]["content"]["results"]["organic"]:
    present = False
    try:
        list_url.append([x["url"]])
    except:
        pass

4.- Alternatives to scrape the SERPs for free

There are some alternatives to scrape the SERPs for free, although unfortunately they will not offer as many features as Oxylabs is able to offer.

4.1.- Googlesearch Python library

Mario Vilas created a library that enables you to scrape the SERPs without using proxies. However, after around 20 requests, it is possible that Google will ban your IP and you will not be able to keep the scraping up.

This piece of code will return the URLs showing up for the query adidas with the Spanish browser:

from googlesearch import search
for url in search('adidas', tld='es', lang='es', stop=20):
    print(url)

4.2.- Google Custom Search API

Koray Tuğberk explained on this article how to Google Custom Search API to retrieve the results turning up for a specific query via advertools. You will only need to create a project on Google Developer Console, get your credentials and enable the Google Custom Search API.

Once you have obtain your credentials, you can make use of this API by using this piece of code:

import advertools as adv
api_key, cse_id = "YOUR API KEY", "YOUR CSE ID"
adv.serp_goog(key=api_key, cx=cse_id, q="Example Query", gl=["example country code"])

Unfortunately this API does not support most of the SERPs rich snippets or commands, so if you would like to make a more extensive analysis, you might need to use premium tool like the Real Time Crawler from Oxylabs.

4.3.- Chrome extensions

If you would need to scrape the SERPs but you do not really need to to it for many queries, you can also make use of Chrome extensions to extract the indexed results. You can use Web Scraper as I explained in this article to scrape the SERPs with a Google Chrome extension.

5.- FAQ section

Can I use the Real Time Crawler from Oxylabs with Python?

Yes, the Real Time Crawler supports Python language, so you can use Python to scrape the Google SERPs as explained in this blog post.

What can I do with proxies for SEO?

Rankings monitoring for organic, paid, top stories and video results, indexation analyses, getting the number of indexed pages for a query, finding guest-posting opportunities…

How long will it take me to use Oxylabs with Python?

Not long, as you can use most of my code samples although you might need to make small changes.

La entrada Scraping the Google SERPs with Python and Oxylabs’ API se publicó primero en Daniel Heredia.