Link

Web Scraping notes

A melange of crawlers, spiders and web testing stuff :earth_asia: .

Table of contents

  1. Resources
  2. scrapely: scraping without selectors
  3. puppeteer: web scraping with Node.js
  4. pyppeteer: python puppeteer port
    1. .secrets.cfg
    2. Example: pyppeteer - Log in to Protonmail with 2fa
      1. Packages
      2. .secrets.cfg
      3. Running main.py
  5. Web Alert: Android Website Monitor
  6. Last Hit: testing automation GUI
  7. Cypress.io: Easy testing with GUI

Resources

Name Description
AutoCrawler Google, Naver multiprocess image web crawler (Selenium)
gremlins.js Monkey testing library for web apps and Node.js
hakrawler Quick discovery of endpoints and assets within a web application
httplab HTTPLabs let you inspect HTTP requests and forge responses
Jaeles Web Application Scanner framework written in Go
OpenBullet .Net webtesting suite to perform requests towards a target webapp
OWASP Cheat Sheet Series Sollection of information on specific application security topics
OWASP-Web-Checklist OWASP Web Application Security Testing Checklist
owtf Framework which tries to unite great tools and make pen testing
Puppetry Web testing solution for non-developers on top of Puppeteer and Jest
Robot Framework Automation framework for acceptance testing and RPA
Robotcorder Chrome extension that generates RobotFramework test scripts
SeleniumBase Easy Web Automation and Testing with Python
Splinter Python test framework for web applications
stubby4j HTTP stub server testing interactions of SOA apps with web services
TestCafe Node.js tool to automate end-to-end web testing
The Pappy Proxy Intercepting proxy for performing web application security testing
TIDoS-Framework Offensive web application audit framework
web-ext Mozillas CLI to help build, run, and test web extensions

scrapely: scraping without selectors

Web Scrapping for dummies (well, myself).

>>> from scrapely import Scraper
>>> s = Scraper()
>>> url1 = 'http://pypi.python.org/pypi/w3lib/1.1'
>>> data = {'name': 'w3lib 1.1', 'author': 'Scrapy project', 'description': 'Library of web-related functions'}
>>> s.train(url1, data)
>>> url2 = 'http://pypi.python.org/pypi/Django/1.3'
>>> s.scrape(url2)
[{u'author': [u'Django Software Foundation <foundation at djangoproject com>'],
  u'description': [u'A high-level Python Web framework that encourages rapid development and clean, pragmatic design.'],
  u'name': [u'Django 1.3']}]

Source: :octocat: scrapy/scrapely


puppeteer: web scraping with Node.js

More info: :octocat: puppeteer/puppeteer


pyppeteer: python puppeteer port

#!/usr/bin/env python3

import asyncio
import configparser
import random

import mintotp
from pyppeteer import launch

# protonmail login site url
url = 'https://mail.protonmail.com/login'

# parsing values from secrets.cfg file
config = configparser.ConfigParser()
config.read('.secrets.cfg')
username = config.get("credentials", "username")
password = config.get("credentials", "password")

# otp_seet = None if no value has been set on secrets.cfg
try:
    otp_seed = config.get("credentials", "otp_seed")
except Exception:
    otp_seed = None

# css selectors definition
username_selector = '#username'
password_selector = '#password'
otp_selector = '#twoFactorCode'
loginBtn_selector = '#login_btn'
loginBtn2fa_selector = '#login_btn_2fa'
firstMessage_selector = 'div.conversation:nth-child(1) > \
    div:nth-child(5) > h4:nth-child(1)'

# screenshot file name settings
screenshotName = 'protonmail'
screenshotPath = "./screenshots/"
screenshotFileName = None
screenshotCount = 1

# pyppeteer settings
windowWidth = 1768
windowHeigth = 1024
minClickTime = 700     # Min click delay (ms)
maxClickTime = 2000    # Max click delay (ms)


def randomNum(minTime, maxTime):
    clickDelayMs = random.randrange(minClickTime, maxClickTime, 1)
    return clickDelayMs


def takeScreenshot():
    global screenshotCount
    global screenshotFileName
    screenshotFileName = screenshotPath + str(screenshotCount) + \
        '-' + screenshotName + '.png'
    print(' > Screenshot ' + str(screenshotCount) +
          ': ' + screenshotFileName)
    screenshotCount = screenshotCount + 1
    return


async def main():
    browser = await launch(headless=True)
    page = await browser.newPage()
    await page.setViewport({'width': windowWidth, 'height': windowHeigth})
    await page.emulateMedia('screen')

    # Open login site
    await page.goto(url)
    takeScreenshot()
    await page.screenshot({'path': screenshotFileName,
                           'fullPage': False,
                           'webkit-print-color-adjust': True})

    # Enter username
    await page.click(username_selector,
                     delay=randomNum(minClickTime, maxClickTime))
    await page.keyboard.type(username)

    # Enter password
    await page.click(password_selector,
                     delay=randomNum(minClickTime, maxClickTime))
    await page.keyboard.type(password)

    # Click login button
    await page.click(loginBtn_selector,
                     delay=randomNum(minClickTime, maxClickTime))

    # Enters otp if opt_seed properly configured
    if otp_seed is not None:
        await page.waitForSelector(otp_selector)
        await page.click(otp_selector,
                         delay=randomNum(minClickTime, maxClickTime))

        currentOtp = mintotp.totp(otp_seed)
        await page.keyboard.type(currentOtp)
        print(" > OTP: " + currentOtp)
        takeScreenshot()
        await page.screenshot({'path': screenshotFileName,
                               'fullPage': False,
                               'webkit-print-color-adjust': True})
        await page.click(loginBtn2fa_selector,
                         delay=randomNum(minClickTime, maxClickTime))
    else:
        print(" > No otp_seed found")

    # Opening inbox
    await page.waitForSelector(firstMessage_selector)
    takeScreenshot()
    await page.screenshot({'path': screenshotFileName,
                           'fullPage': False,
                           'webkit-print-color-adjust': True})
    await browser.close()
    print(" > Browser closed")

asyncio.get_event_loop().run_until_complete(main())

More info: :octocat: miyakogi/pyppeteer

.secrets.cfg

[credentials]
username = <USER>@protonmail.ch
password = <PASSWORD>
otp_seed = <OTP_SEED> # <-- Only required if 2fa is enabled.

Example: pyppeteer - Log in to Protonmail with 2fa

:octocat: jorgerance/pyppeteer-protonmail

Using pyppeteer, an uofficial Python port of puppeteer JavaScript (headless) chrome/chromium browser automation library, to log into Protonmail


Packages

Package Version Description
pyppeteer 0.0.25 Headless chrome/chromium automation library (unofficial port of puppeteer)
mintotp 0.2.0 Minimal TOTP Generator

.secrets.cfg

Enter your login credentials in a .secrets.cfg file following an .ini format, which will be parsed by configparser, as in the example below:

[credentials]
username = user@protonmail.com
password = user_password
otp_seed = 1234567890QWERTYUIOPASDFGHJKLZXCV

Running main.py

0 βœ“ steve@hal9000 ~/repos/pyppeteer-protonmail $ ./main.py
 > Screenshot 1: ./screenshots/1-protonmail.png
 > OTP: 123456
 > Screenshot 2: ./screenshots/2-protonmail.png
 > Screenshot 3: ./screenshots/3-protonmail.png
 > Browser closed
0 βœ“ steve@hal9000 ~/repos/pyppeteer-protonmail $
Expected output files
0 βœ“ steve@hal9000 ~/repos/pyppeteer-protonmail/screenshots $ ls -l
total 1234
-rw-r--r--  1 steve  bluejeans  123456 Jun 28 11:11 1-protonmail.png
-rw-r--r--  1 steve  bluejeans  123456 Jun 28 11:11 2-protonmail.png
-rw-r--r--  1 steve  bluejeans   12345 Jun 28 11:11 3-protonmail.png
0 βœ“ steve@hal9000 ~/repos/pyppeteer-protonmail/screenshots $
Screenshot: 1-protonmail.png

Screenshot: 2-protonmail.png

Screenshot: 3-protonmail.png


Web Alert: Android Website Monitor

Web Alert screenshot Google Play.

Web Alert lets you monitor any website (or specific parts of it) you wish in order to be notified when it is updated. It even works when a login, a form post or password prompt is necessary to access the site. For example, get notified when a price changes, a new article is published, you receive exam results or an answer in a forum, a registration period has opened, etc. You can also check if your own website is currently online and working correctly, or use it for UI testing and web monitoring.


Last Hit: testing automation GUI

Web Alert screenshot Google Play.

Last-hit is an automation testing solution aimed at development and operations teams. It is focused on web test, gives you broad, deep and exact control over your web apps automation testing

A quick guide to set up and start your first automation test with last-hit, a free test automation tool built on top of electron and puppeteer. You now can begin automation testing on web & mobile with the least amount of effort.

Source: :octocat: last-hit-aab/last-hit


Cypress.io: Easy testing with GUI

Fast, easy and reliable testing for anything that runs in a browser.


Fast, easy and reliable testing for anything that runs in a browser.

Source: :octocat: cypress-io/cypress