Collect Domain Security Information with Python
In this tutorial, we will learn how to automate the collection of various domain-related technical information using Python. The script gathers data such as WHOIS details, DNS records, SSL certificates, reverse IP lookup, blacklist status, robots.txt, and more. Using the pandas library, we also show how to store the collected data in a CSV file. This tutorial is ideal for anyone interested in automating domain monitoring or data collection tasks. Typical roles include an SEO specialist, webmaster, web host, or competitive data analyst.
Table of Contents
Requirements
Before you begin, ensure Python is installed on your system or notebook. You’ll also need the following Python packages:
- pandas – For working with data and saving results to CSV.
- requests – To handle HTTP requests to external APIs and websites.
- dnspython – For querying DNS records.
- pyOpenSSL – For retrieving SSL certificate information.
- whois – To fetch WHOIS details for a domain.
You can install these libraries using pip:
Load Domain Data
Start by loading a CSV file that lists the domains you want to process. The CSV should include a single column named “urls” containing the domain URLs. The resulting DataFrame lets you iterate over each domain to collect information.
csv_file = "client_data.csv" df = pd.read_csv(csv_file)
Collect WHOIS Information
WHOIS data provides domain details such as the registrar, creation date, and expiration date. The whois module fetches this data. The function below extracts the relevant fields and returns the registrar, creation date, and expiry date.
def get_whois_info(domain):
w = whois.whois(domain)
string = json.dumps(w.text)
registrar = re.search("Registrar:\s[^R]*", string)
registrar = registrar.group(0).replace('\\r\\n', '') if registrar else "n/a"
creation_date = re.search("Creation\sDate:\s[^A-Z]*", string)
creation_date = creation_date.group(0).replace('\\r\\n', '') if creation_date else "n/a"
expiry_date = re.search("Registry\sExpiry\sDate:\s[^A-Z]*", string)
expiry_date = expiry_date.group(0).replace('\\r\\n', '') if expiry_date else "n/a"
return registrar, creation_date, expiry_date
Fetch DNS Records
DNS records — MX (mail servers), NS (name servers), and TXT (text records) — help you understand how a domain is configured. We use dnspython for this task. The function below queries MX, NS, and TXT records and returns them as strings.
def get_dns_records(domain):
mailservers = ""
try:
for x in dns.resolver.resolve(domain, 'MX'):
mailservers += x.to_text() + "<br>"
except:
mailservers = "n/a"
dnsrecords = ""
try:
myResolver = dns.resolver.Resolver()
myAnswers = myResolver.resolve(domain, "NS")
for rdata in myAnswers:
dnsrecords += str(rdata) + "<br>"
except:
dnsrecords = "n/a"
textrecords = ""
try:
myAnswers = myResolver.resolve(domain, "TXT")
for rdata in myAnswers:
textrecords += str(rdata) + "<br>"
except:
textrecords = "n/a"
return mailservers, dnsrecords, textrecords
Step 4: SSL Certificate Information
SSL certificates ensure secure communication between users and servers. Using pyOpenSSL, this step retrieves the certificate expiration date and issuer. The function extracts the expiration date and issuer details from the certificate.
def get_ssl_info(domain):
try:
cert = ssl.get_server_certificate((domain, 443))
x509 = OpenSSL.crypto.load_certificate(OpenSSL.crypto.FILETYPE_PEM, cert)
expiredate = str(x509.get_notAfter())
date = f"{expiredate[4:6]}-{expiredate[6:8]}-{expiredate[:4]}" # Format to MM-DD-YYYY
issuer = str(x509.get_issuer())
issuer = re.search("CN=[a-zA-Z0-9\s'-]+", issuer).group(0).replace("'", "") if issuer else "n/a"
return date, issuer
except Exception as e:
return "n/a", str(e)
Step 5: Blacklist Check
Domain and IP blacklist checks are important for security. The script queries external APIs to determine whether a domain or its associated IP is blacklisted. The function below checks both and returns the blacklist status. Don’t forget to register for your Hentrix Tools API key.
def get_blacklist_status(domainip, domain):
hendrix-tools-api-key = '' #replace with your api key
ipblacklist_url = f"https://api.hetrixtools.com/v2/{hendrix-tools-api-key}/blacklist-check/ipv4/{domainip}/"
domainblacklist_url = f"https://api.hetrixtools.com/v2/{hendrix-tools-api-key}/blacklist-check/domain/{domain}/"
try:
ipresponse = requests.get(ipblacklist_url)
domainresponse = requests.get(domainblacklist_url)
ipdata = json.loads(ipresponse.text)
domaindata = json.loads(domainresponse.text)
ipblacklist = json.dumps(ipdata['blacklisted_on']).replace("[{", "").replace("{", "").replace("}","").replace("}]", "").replace("null", "none")
domainblacklist = json.dumps(domaindata['blacklisted_on']).replace("[{", "").replace("{", "").replace("}", "").replace("}]", "").replace("null", "none")
return f"<b>By IP:</b> {ipblacklist} <br><b>By Domain:</b> {domainblacklist}"
except:
return "n/a"
Step 6: Reverse IP Lookup
Reverse IP lookup helps find other domains hosted on the same server. We use an external API for this. Note that this may not work for domains behind Cloudflare, which obscures the true IP. The Hackertarget API is free but please avoid abusive usage to prevent being blocked.
def get_reverse_ip(domainip):
rip_url = f"https://api.hackertarget.com/reverseiplookup/?q={domainip}"
try:
rip_response = requests.get(rip_url)
reverseip = rip_response.text.strip()
return reverseip.replace("b'", "").replace("''", "'")
except:
return "n/a"
Fetching Robots.txt
The robots.txt file provides guidelines for search engines and web crawlers. The function below fetches a domain’s robots.txt file and formats it for easier viewing.
def get_robots_txt(url):
try:
robots_url = url + "/robots.txt"
response = requests.get(robots_url, verify=False)
return response.text.replace("\n", "<br>").replace("'", "''")
except:
return "n/a"
SSL Error and TLS Version
Checking for SSL errors and the TLS version is important for secure communication. The function below attempts an SSL connection, returns the TLS version on success, and captures any SSL errors encountered.
def get_ssl_error_and_tls(domain):
context = ssl.create_default_context()
try:
with socket.create_connection((domain, 443)) as sock:
with context.wrap_socket(sock, server_hostname=domain) as ssock:
tls = ssock.version().replace("TLSv", "")
sslerror = "0"
except Exception as e:
sslerror = str(e)
tls = "0"
return sslerror, tls
Collect Domain Technology Information with BuiltWith API
In addition to technical data, you can fetch the technology stack and social media links associated with a domain. We use the BuiltWith API to retrieve technologies and social links detected on a site.
This function:
- Queries the BuiltWith API for technology stack information.
- Retrieves the associated social media links for the domain.
def get_technology_info(domain):
builtwith-api-key = '' #replace with your api key
url = f"https://api.builtwith.com/v14/api.json?KEY={builtwith-api-key}&liveonly=yes&LOOKUP={domain}"
response = requests.get(url)
data = response.json()
technology = ""
social = ""
for result in data["Results"][0]["Result"]["Paths"]:
for tech in result["Technologies"]:
technology += f"<a href='{tech['Link']}'><b>{tech['Name']}</b></a><br>{tech['Description']}<br><br>"
try:
for value in data["Results"][0]["Meta"]["Social"]:
social += f"<a href='{value}'>{value}</a><br>"
except:
social = "n/a"
return technology, social
Looping Through Domains and Collecting Data
Now we loop through the list of domains, run each function, and store the results in a new DataFrame.
# Loop through records and process data
for index, row in df.iterrows():
domain = row['url'].replace("https://", "").replace("www.", "").replace("/", "")
domainip = socket.gethostbyname(domain)
# Collect domain data
registrar, creation_date, expiry_date = get_whois_info(domain)
mailservers, dnsrecords, textrecords = get_dns_records(domain)
ssl_expiry, ssl_issuer = get_ssl_info(domain)
blacklist = get_blacklist_status(domainip, domain)
technology, social = get_technology_info(domain)
reverseip = get_reverse_ip(domainip)
robots = get_robots_txt(row['url'])
sslerror, tls = get_ssl_error_and_tls(domain)
# Collect all data in a dictionary
client_data = {
'clientid': row['clientid'],
'date': datetime.now().strftime('%m/%d/%Y'),
'domainip': domainip,
'mailservers': mailservers,
'whois': f"{registrar}<br>{creation_date}<br>{expiry_date}",
'dnsrecords': dnsrecords,
'textrecords': textrecords,
'sslinfo': f"<b>Expiry Date:</b> {ssl_expiry} <br><b>Issuer:</b> {ssl_issuer}",
'blacklist': blacklist,
'tech': technology,
'social': social,
'robots': robots,
'reverseip': reverseip,
'sslerror': sslerror,
'tls': tls
}
# Append the collected data to the result DataFrame
if 'result_df' not in locals():
# If the new DataFrame doesn't exist, create it
result_df = pd.DataFrame(columns=client_data.keys())
result_df = result_df.append(client_data, ignore_index=True)
# Display the resulting DataFrame
result_df
# Optionally save the result to a CSV file
result_df.to_csv("collected_domain_info.csv", index=False)
Conclusion
These functions automate collection of important technical domain data: WHOIS, DNS records, SSL certificate information, reverse IP lookup results, blacklist status, technology stack, and social media links via the BuiltWith API. The script stores all results in a structured CSV, making it straightforward to analyze, monitor, and report on domain status.
By leveraging Python’s libraries such as pandas, requests, dnspython, and pyOpenSSL, this script automates domain monitoring tasks and helps you stay informed about the technical health and configuration of domains you manage or monitor.
Follow me at: https://www.linkedin.com/in/gregbernhardt/
- Evaluate Subreddit Posts in Bulk Using GPT4 Prompting - December 12, 2024
- Calculate Similarity Between Article Elements Using spaCy - November 13, 2024
- Audit URLs for SEO Using ahrefs Backlink API Data - November 11, 2024















