Skip to content

baymac/x-article-pdf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

x-scraper

x-scraper is a focused CLI for exporting X longform articles to PDF. It reuses a locally saved X browser session, loads the article in Playwright, extracts the article body into a print-safe standalone document, inlines user-added media, and renders the result to PDF.

Setup

uv is a fast Python package and environment manager. It handles dependency installation, virtualenv management, and command execution for this project. The recommended workflow uses uv.

Install uv first:

curl -LsSf https://astral.sh/uv/install.sh | sh

Then set up the project:

uv sync
uv run playwright install chromium

Commands

Log into X

uv run x-scraper login

This opens a real browser window to https://x.com/login. After you complete login, the tool saves the browser state to auth/x_browser_state.json.

Export an article to PDF

uv run x-scraper article-pdf "https://x.com/i/articles/your-article-id"

If you do not pass --output, PDFs are written to out/ automatically.

You can also pass an explicit output path:

uv run x-scraper article-pdf "https://x.com/handle/article/123" -o ./article.pdf

Supported URL shapes:

  • https://x.com/i/articles/...
  • https://x.com/<handle>/article/...
  • https://x.com/<handle>/status/... when the loaded page actually renders longform article content

Architecture

Runtime flow

article-pdf follows a narrow pipeline:

  1. x_scraper.cli validates that a saved X session exists and forwards the URL to the exporter.
  2. x_scraper.x_scraper.x_article_to_pdf() launches Chromium with the saved storage_state.
  3. The exporter opens the target URL and rejects it if X redirects to login or if the rendered page does not look like longform content.
  4. _prepare_media_for_print() scrolls the page and waits for lazy assets to hydrate before extraction.
  5. _extract_article_document() clones the highest-signal article container from the live DOM, removes X chrome, filters out avatars/icons/emoji, and inlines the remaining article media as data URLs.
  6. A second Playwright page receives the standalone HTML document and prints it to PDF.

Module layout

src/x_scraper/
├── __main__.py
├── cli.py
├── config.py
└── x_scraper.py

Responsibilities:

Session model

  • Login is interactive by design. The exporter does not automate credentials.
  • Session state is stored in auth/x_browser_state.json.
  • The browser profile directory lives under auth/x_profile/.
  • Default PDF output goes to out/, which is created on demand.
  • If X redirects to /login or /i/flow/login, the command fails fast and asks for x-scraper login.

Article detection model

The exporter does not trust the URL path alone. It accepts article-like URLs, then validates the rendered DOM:

  • waits for article or main article content to exist
  • checks for longform signals such as an Article label, substantial text length, or enough paragraph nodes
  • rejects ordinary tweet/status pages that do not render longform content

This is why /status/... links can work when X uses that route for longform pages.

Media extraction model

The PDF is not generated from X's live layout directly. Instead, the exporter builds its own print document:

  • forces lazy images to load
  • walks the article subtree and selects the highest-value content container
  • keeps likely article images, primarily large pbs.twimg.com/media/... assets and tweetPhoto media
  • removes non-content assets such as avatars, emoji sprites, SVG icons, and other X interface chrome
  • fetches remote media through the logged-in browser context and converts them to data URLs so the PDF renderer has self-contained assets

This architecture is why article images survive printing more reliably than a plain browser printToPDF call against the live X page.

Failure modes

The command intentionally fails early in a few cases:

  • no saved X session
  • expired X session
  • target URL is not on x.com
  • rendered page does not appear to be an X longform article
  • X DOM changes enough that the article container or media selectors are no longer valid

When the last case happens, the fix usually belongs in the heuristics inside src/x_scraper/x_scraper.py.

About

Generates pdf from X article

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages