Collect Domain Security Information with Python

Estimated Read Time: 6 minute(s)

Common Topics: domain, data, replace, return, enlighter

In this tutorial, we will learn how to automate the collection of various domain-related technical information using Python. The script gathers data such as WHOIS details, DNS records, SSL certificates, reverse IP lookup, blacklist status, robots.txt, and more. Using the pandas library, we also show how to store the collected data in a CSV file. This tutorial is ideal for anyone interested in automating domain monitoring or data collection tasks. Typical roles include an SEO specialist, webmaster, web host, or competitive data analyst.

Table of Contents

Requirements

Before you begin, ensure Python is installed on your system or notebook. You’ll also need the following Python packages:

pandas – For working with data and saving results to CSV.
requests – To handle HTTP requests to external APIs and websites.
dnspython – For querying DNS records.
pyOpenSSL – For retrieving SSL certificate information.
whois – To fetch WHOIS details for a domain.

You can install these libraries using pip:

Load Domain Data

Start by loading a CSV file that lists the domains you want to process. The CSV should include a single column named “urls” containing the domain URLs. The resulting DataFrame lets you iterate over each domain to collect information.

csv_file = "client_data.csv"
df = pd.read_csv(csv_file)

Collect WHOIS Information

WHOIS data provides domain details such as the registrar, creation date, and expiration date. The whois module fetches this data. The function below extracts the relevant fields and returns the registrar, creation date, and expiry date.

def get_whois_info(domain):
    w = whois.whois(domain)
    string = json.dumps(w.text)
    
    registrar = re.search("Registrar:\s[^R]*", string)
    registrar = registrar.group(0).replace('\\r\\n', '') if registrar else "n/a"

    creation_date = re.search("Creation\sDate:\s[^A-Z]*", string)
    creation_date = creation_date.group(0).replace('\\r\\n', '') if creation_date else "n/a"

    expiry_date = re.search("Registry\sExpiry\sDate:\s[^A-Z]*", string)
    expiry_date = expiry_date.group(0).replace('\\r\\n', '') if expiry_date else "n/a"

    return registrar, creation_date, expiry_date

Fetch DNS Records

DNS records — MX (mail servers), NS (name servers), and TXT (text records) — help you understand how a domain is configured. We use dnspython for this task. The function below queries MX, NS, and TXT records and returns them as strings.

def get_dns_records(domain):
    mailservers = ""
    try:
        for x in dns.resolver.resolve(domain, 'MX'):
            mailservers += x.to_text() + "<br>"
    except:
        mailservers = "n/a"

    dnsrecords = ""
    try:
        myResolver = dns.resolver.Resolver()
        myAnswers = myResolver.resolve(domain, "NS")
        for rdata in myAnswers:
            dnsrecords += str(rdata) + "<br>"
    except:
        dnsrecords = "n/a"

    textrecords = ""
    try:
        myAnswers = myResolver.resolve(domain, "TXT")
        for rdata in myAnswers:
            textrecords += str(rdata) + "<br>"
    except:
        textrecords = "n/a"

    return mailservers, dnsrecords, textrecords

Step 4: SSL Certificate Information

SSL certificates ensure secure communication between users and servers. Using pyOpenSSL, this step retrieves the certificate expiration date and issuer. The function extracts the expiration date and issuer details from the certificate.

def get_ssl_info(domain):
    try:
        cert = ssl.get_server_certificate((domain, 443))
        x509 = OpenSSL.crypto.load_certificate(OpenSSL.crypto.FILETYPE_PEM, cert)
        expiredate = str(x509.get_notAfter())
        date = f"{expiredate[4:6]}-{expiredate[6:8]}-{expiredate[:4]}"  # Format to MM-DD-YYYY
        issuer = str(x509.get_issuer())
        issuer = re.search("CN=[a-zA-Z0-9\s'-]+", issuer).group(0).replace("'", "") if issuer else "n/a"
        return date, issuer
    except Exception as e:
        return "n/a", str(e)

Step 5: Blacklist Check

Domain and IP blacklist checks are important for security. The script queries external APIs to determine whether a domain or its associated IP is blacklisted. The function below checks both and returns the blacklist status. Don’t forget to register for your Hentrix Tools API key.

def get_blacklist_status(domainip, domain):
    hendrix-tools-api-key = '' #replace with your api key
    ipblacklist_url = f"https://api.hetrixtools.com/v2/{hendrix-tools-api-key}/blacklist-check/ipv4/{domainip}/"
    domainblacklist_url = f"https://api.hetrixtools.com/v2/{hendrix-tools-api-key}/blacklist-check/domain/{domain}/"
    try:
        ipresponse = requests.get(ipblacklist_url)
        domainresponse = requests.get(domainblacklist_url)

        ipdata = json.loads(ipresponse.text)
        domaindata = json.loads(domainresponse.text)

        ipblacklist = json.dumps(ipdata['blacklisted_on']).replace("[{", "").replace("{", "").replace("}","").replace("}]", "").replace("null", "none")
        domainblacklist = json.dumps(domaindata['blacklisted_on']).replace("[{", "").replace("{", "").replace("}", "").replace("}]", "").replace("null", "none")

        return f"<b>By IP:</b> {ipblacklist} <br><b>By Domain:</b> {domainblacklist}"
    except:
        return "n/a"

Step 6: Reverse IP Lookup

Reverse IP lookup helps find other domains hosted on the same server. We use an external API for this. Note that this may not work for domains behind Cloudflare, which obscures the true IP. The Hackertarget API is free but please avoid abusive usage to prevent being blocked.

def get_reverse_ip(domainip):
    rip_url = f"https://api.hackertarget.com/reverseiplookup/?q={domainip}"
    try:
        rip_response = requests.get(rip_url)
        reverseip = rip_response.text.strip()
        return reverseip.replace("b'", "").replace("''", "'")
    except:
        return "n/a"

Fetching Robots.txt

The robots.txt file provides guidelines for search engines and web crawlers. The function below fetches a domain’s robots.txt file and formats it for easier viewing.

def get_robots_txt(url):
    try:
        robots_url = url + "/robots.txt"
        response = requests.get(robots_url, verify=False)
        return response.text.replace("\n", "<br>").replace("'", "''")
    except:
        return "n/a"

SSL Error and TLS Version

Checking for SSL errors and the TLS version is important for secure communication. The function below attempts an SSL connection, returns the TLS version on success, and captures any SSL errors encountered.

def get_ssl_error_and_tls(domain):
    context = ssl.create_default_context()
    try:
        with socket.create_connection((domain, 443)) as sock:
            with context.wrap_socket(sock, server_hostname=domain) as ssock:
                tls = ssock.version().replace("TLSv", "")
                sslerror = "0"
    except Exception as e:
        sslerror = str(e)
        tls = "0"
    
    return sslerror, tls

Collect Domain Technology Information with BuiltWith API

In addition to technical data, you can fetch the technology stack and social media links associated with a domain. We use the BuiltWith API to retrieve technologies and social links detected on a site.

This function:

Queries the BuiltWith API for technology stack information.
Retrieves the associated social media links for the domain.

def get_technology_info(domain):
    builtwith-api-key = '' #replace with your api key
    url = f"https://api.builtwith.com/v14/api.json?KEY={builtwith-api-key}&liveonly=yes&LOOKUP={domain}"
    response = requests.get(url)
    data = response.json()

    technology = ""
    social = ""
    
    for result in data["Results"][0]["Result"]["Paths"]:
        for tech in result["Technologies"]:
            technology += f"<a href='{tech['Link']}'><b>{tech['Name']}</b></a><br>{tech['Description']}<br><br>"

    try:
        for value in data["Results"][0]["Meta"]["Social"]:
            social += f"<a href='{value}'>{value}</a><br>"
    except:
        social = "n/a"

    return technology, social

Looping Through Domains and Collecting Data

Now we loop through the list of domains, run each function, and store the results in a new DataFrame.

# Loop through records and process data
for index, row in df.iterrows():
    domain = row['url'].replace("https://", "").replace("www.", "").replace("/", "")
    domainip = socket.gethostbyname(domain)
    
    # Collect domain data
    registrar, creation_date, expiry_date = get_whois_info(domain)
    mailservers, dnsrecords, textrecords = get_dns_records(domain)
    ssl_expiry, ssl_issuer = get_ssl_info(domain)
    blacklist = get_blacklist_status(domainip, domain)
    technology, social = get_technology_info(domain)
    reverseip = get_reverse_ip(domainip)
    robots = get_robots_txt(row['url'])
    sslerror, tls = get_ssl_error_and_tls(domain)
    
    # Collect all data in a dictionary
    client_data = {
        'clientid': row['clientid'],
        'date': datetime.now().strftime('%m/%d/%Y'),
        'domainip': domainip,
        'mailservers': mailservers,
        'whois': f"{registrar}<br>{creation_date}<br>{expiry_date}",
        'dnsrecords': dnsrecords,
        'textrecords': textrecords,
        'sslinfo': f"<b>Expiry Date:</b> {ssl_expiry} <br><b>Issuer:</b> {ssl_issuer}",
        'blacklist': blacklist,
        'tech': technology,
        'social': social,
        'robots': robots,
        'reverseip': reverseip,
        'sslerror': sslerror,
        'tls': tls
    }
    
    # Append the collected data to the result DataFrame
    if 'result_df' not in locals():
        # If the new DataFrame doesn't exist, create it
        result_df = pd.DataFrame(columns=client_data.keys())
    
    result_df = result_df.append(client_data, ignore_index=True)

# Display the resulting DataFrame
result_df 

# Optionally save the result to a CSV file
result_df.to_csv("collected_domain_info.csv", index=False)

Conclusion

These functions automate collection of important technical domain data: WHOIS, DNS records, SSL certificate information, reverse IP lookup results, blacklist status, technology stack, and social media links via the BuiltWith API. The script stores all results in a structured CSV, making it straightforward to analyze, monitor, and report on domain status.

By leveraging Python’s libraries such as pandas, requests, dnspython, and pyOpenSSL, this script automates domain monitoring tasks and helps you stay informed about the technical health and configuration of domains you manage or monitor.

Follow me at: https://www.linkedin.com/in/gregbernhardt/

Author
Recent Posts

Follow me

Greg Bernhardt

Sr. SEO Specialist for Shopify. Nearly 20 years of experience in web design, web development, and web marketing. Education in Information Sciences from UW-Milwaukee. Managing the largest online US physics community. Enjoy learning about search engines, SEO, chrome tricks, Python, knowledge graphs, data science, and more!

Follow me

Latest posts by Greg Bernhardt (see all)