Web Scraping notes
A melange of crawlers, spiders and web testing stuff
.
Table of contents
- Resources
- scrapely: scraping without selectors
- puppeteer: web scraping with Node.js
- pyppeteer: python puppeteer port
- Web Alert: Android Website Monitor
- Last Hit: testing automation GUI
- Cypress.io: Easy testing with GUI
Resources
| Name | Description |
|---|---|
| AutoCrawler | Google, Naver multiprocess image web crawler (Selenium) |
| gremlins.js | Monkey testing library for web apps and Node.js |
| hakrawler | Quick discovery of endpoints and assets within a web application |
| httplab | HTTPLabs let you inspect HTTP requests and forge responses |
| Jaeles | Web Application Scanner framework written in Go |
| OpenBullet | .Net webtesting suite to perform requests towards a target webapp |
| OWASP Cheat Sheet Series | Sollection of information on specific application security topics |
| OWASP-Web-Checklist | OWASP Web Application Security Testing Checklist |
| owtf | Framework which tries to unite great tools and make pen testing |
| Puppetry | Web testing solution for non-developers on top of Puppeteer and Jest |
| Robot Framework | Automation framework for acceptance testing and RPA |
| Robotcorder | Chrome extension that generates RobotFramework test scripts |
| SeleniumBase | Easy Web Automation and Testing with Python |
| Splinter | Python test framework for web applications |
| stubby4j | HTTP stub server testing interactions of SOA apps with web services |
| TestCafe | Node.js tool to automate end-to-end web testing |
| The Pappy Proxy | Intercepting proxy for performing web application security testing |
| TIDoS-Framework | Offensive web application audit framework |
| web-ext | Mozillas CLI to help build, run, and test web extensions |
scrapely: scraping without selectors
Web Scrapping for dummies (well, myself).
>>> from scrapely import Scraper
>>> s = Scraper()
>>> url1 = 'http://pypi.python.org/pypi/w3lib/1.1'
>>> data = {'name': 'w3lib 1.1', 'author': 'Scrapy project', 'description': 'Library of web-related functions'}
>>> s.train(url1, data)
>>> url2 = 'http://pypi.python.org/pypi/Django/1.3'
>>> s.scrape(url2)
[{u'author': [u'Django Software Foundation <foundation at djangoproject com>'],
u'description': [u'A high-level Python Web framework that encourages rapid development and clean, pragmatic design.'],
u'name': [u'Django 1.3']}]
Source:
scrapy/scrapely
puppeteer: web scraping with Node.js
More info:
puppeteer/puppeteer
pyppeteer: python puppeteer port
#!/usr/bin/env python3
import asyncio
import configparser
import random
import mintotp
from pyppeteer import launch
# protonmail login site url
url = 'https://mail.protonmail.com/login'
# parsing values from secrets.cfg file
config = configparser.ConfigParser()
config.read('.secrets.cfg')
username = config.get("credentials", "username")
password = config.get("credentials", "password")
# otp_seet = None if no value has been set on secrets.cfg
try:
otp_seed = config.get("credentials", "otp_seed")
except Exception:
otp_seed = None
# css selectors definition
username_selector = '#username'
password_selector = '#password'
otp_selector = '#twoFactorCode'
loginBtn_selector = '#login_btn'
loginBtn2fa_selector = '#login_btn_2fa'
firstMessage_selector = 'div.conversation:nth-child(1) > \
div:nth-child(5) > h4:nth-child(1)'
# screenshot file name settings
screenshotName = 'protonmail'
screenshotPath = "./screenshots/"
screenshotFileName = None
screenshotCount = 1
# pyppeteer settings
windowWidth = 1768
windowHeigth = 1024
minClickTime = 700 # Min click delay (ms)
maxClickTime = 2000 # Max click delay (ms)
def randomNum(minTime, maxTime):
clickDelayMs = random.randrange(minClickTime, maxClickTime, 1)
return clickDelayMs
def takeScreenshot():
global screenshotCount
global screenshotFileName
screenshotFileName = screenshotPath + str(screenshotCount) + \
'-' + screenshotName + '.png'
print(' > Screenshot ' + str(screenshotCount) +
': ' + screenshotFileName)
screenshotCount = screenshotCount + 1
return
async def main():
browser = await launch(headless=True)
page = await browser.newPage()
await page.setViewport({'width': windowWidth, 'height': windowHeigth})
await page.emulateMedia('screen')
# Open login site
await page.goto(url)
takeScreenshot()
await page.screenshot({'path': screenshotFileName,
'fullPage': False,
'webkit-print-color-adjust': True})
# Enter username
await page.click(username_selector,
delay=randomNum(minClickTime, maxClickTime))
await page.keyboard.type(username)
# Enter password
await page.click(password_selector,
delay=randomNum(minClickTime, maxClickTime))
await page.keyboard.type(password)
# Click login button
await page.click(loginBtn_selector,
delay=randomNum(minClickTime, maxClickTime))
# Enters otp if opt_seed properly configured
if otp_seed is not None:
await page.waitForSelector(otp_selector)
await page.click(otp_selector,
delay=randomNum(minClickTime, maxClickTime))
currentOtp = mintotp.totp(otp_seed)
await page.keyboard.type(currentOtp)
print(" > OTP: " + currentOtp)
takeScreenshot()
await page.screenshot({'path': screenshotFileName,
'fullPage': False,
'webkit-print-color-adjust': True})
await page.click(loginBtn2fa_selector,
delay=randomNum(minClickTime, maxClickTime))
else:
print(" > No otp_seed found")
# Opening inbox
await page.waitForSelector(firstMessage_selector)
takeScreenshot()
await page.screenshot({'path': screenshotFileName,
'fullPage': False,
'webkit-print-color-adjust': True})
await browser.close()
print(" > Browser closed")
asyncio.get_event_loop().run_until_complete(main())
More info:
miyakogi/pyppeteer
.secrets.cfg
[credentials]
username = <USER>@protonmail.ch
password = <PASSWORD>
otp_seed = <OTP_SEED> # <-- Only required if 2fa is enabled.
Example: pyppeteer - Log in to Protonmail with 2fa
jorgerance/pyppeteer-protonmail
Using pyppeteer, an uofficial Python port of puppeteer JavaScript (headless) chrome/chromium browser automation library, to log into Protonmail
Packages
| Package | Version | Description |
|---|---|---|
| pyppeteer | 0.0.25 | Headless chrome/chromium automation library (unofficial port of puppeteer) |
| mintotp | 0.2.0 | Minimal TOTP Generator |
.secrets.cfg
Enter your login credentials in a .secrets.cfg file following an .ini format, which will be parsed by configparser, as in the example below:
[credentials]
username = user@protonmail.com
password = user_password
otp_seed = 1234567890QWERTYUIOPASDFGHJKLZXCV
Running main.py
0 β steve@hal9000 ~/repos/pyppeteer-protonmail $ ./main.py
> Screenshot 1: ./screenshots/1-protonmail.png
> OTP: 123456
> Screenshot 2: ./screenshots/2-protonmail.png
> Screenshot 3: ./screenshots/3-protonmail.png
> Browser closed
0 β steve@hal9000 ~/repos/pyppeteer-protonmail $
Expected output files
0 β steve@hal9000 ~/repos/pyppeteer-protonmail/screenshots $ ls -l
total 1234
-rw-r--r-- 1 steve bluejeans 123456 Jun 28 11:11 1-protonmail.png
-rw-r--r-- 1 steve bluejeans 123456 Jun 28 11:11 2-protonmail.png
-rw-r--r-- 1 steve bluejeans 12345 Jun 28 11:11 3-protonmail.png
0 β steve@hal9000 ~/repos/pyppeteer-protonmail/screenshots $
Screenshot: 1-protonmail.png

Screenshot: 2-protonmail.png

Screenshot: 3-protonmail.png

Web Alert: Android Website Monitor

Web Alert lets you monitor any website (or specific parts of it) you wish in order to be notified when it is updated. It even works when a login, a form post or password prompt is necessary to access the site. For example, get notified when a price changes, a new article is published, you receive exam results or an answer in a forum, a registration period has opened, etc. You can also check if your own website is currently online and working correctly, or use it for UI testing and web monitoring.
Last Hit: testing automation GUI

Last-hit is an automation testing solution aimed at development and operations teams. It is focused on web test, gives you broad, deep and exact control over your web apps automation testing
A quick guide to set up and start your first automation test with last-hit, a free test automation tool built on top of electron and puppeteer. You now can begin automation testing on web & mobile with the least amount of effort.
Source:
last-hit-aab/last-hit
Cypress.io: Easy testing with GUI

Fast, easy and reliable testing for anything that runs in a browser.
Source:
cypress-io/cypress