x-scraper

x-scraper is a focused CLI for exporting X longform articles to PDF. It reuses a locally saved X browser session, loads the article in Playwright, extracts the article body into a print-safe standalone document, inlines user-added media, and renders the result to PDF.

Setup

uv is a fast Python package and environment manager. It handles dependency installation, virtualenv management, and command execution for this project. The recommended workflow uses uv.

Install uv first:

curl -LsSf https://astral.sh/uv/install.sh | sh

Then set up the project:

uv sync
uv run playwright install chromium

Commands

Log into X

uv run x-scraper login

This opens a real browser window to https://x.com/login. After you complete login, the tool saves the browser state to auth/x_browser_state.json.

Export an article to PDF

uv run x-scraper article-pdf "https://x.com/i/articles/your-article-id"

If you do not pass --output, PDFs are written to out/ automatically.

You can also pass an explicit output path:

uv run x-scraper article-pdf "https://x.com/handle/article/123" -o ./article.pdf

Supported URL shapes:

https://x.com/i/articles/...
https://x.com/<handle>/article/...
https://x.com/<handle>/status/... when the loaded page actually renders longform article content

Architecture

Runtime flow

article-pdf follows a narrow pipeline:

x_scraper.cli validates that a saved X session exists and forwards the URL to the exporter.
x_scraper.x_scraper.x_article_to_pdf() launches Chromium with the saved storage_state.
The exporter opens the target URL and rejects it if X redirects to login or if the rendered page does not look like longform content.
_prepare_media_for_print() scrolls the page and waits for lazy assets to hydrate before extraction.
_extract_article_document() clones the highest-signal article container from the live DOM, removes X chrome, filters out avatars/icons/emoji, and inlines the remaining article media as data URLs.
A second Playwright page receives the standalone HTML document and prints it to PDF.

Module layout

src/x_scraper/
├── __main__.py
├── cli.py
├── config.py
└── x_scraper.py

Responsibilities:

src/x_scraper/cli.py Defines the Typer app and the public CLI commands: login and article-pdf.
src/x_scraper/config.py Holds project paths and browser/login timeouts.
src/x_scraper/x_scraper.py Implements X session management, article detection, media preparation, DOM extraction, asset inlining, and PDF rendering.
src/x_scraper/main.py Module entrypoint for the package.

Session model

Login is interactive by design. The exporter does not automate credentials.
Session state is stored in auth/x_browser_state.json.
The browser profile directory lives under auth/x_profile/.
Default PDF output goes to out/, which is created on demand.
If X redirects to /login or /i/flow/login, the command fails fast and asks for x-scraper login.

Article detection model

The exporter does not trust the URL path alone. It accepts article-like URLs, then validates the rendered DOM:

waits for article or main article content to exist
checks for longform signals such as an Article label, substantial text length, or enough paragraph nodes
rejects ordinary tweet/status pages that do not render longform content

This is why /status/... links can work when X uses that route for longform pages.

Media extraction model

The PDF is not generated from X's live layout directly. Instead, the exporter builds its own print document:

forces lazy images to load
walks the article subtree and selects the highest-value content container
keeps likely article images, primarily large pbs.twimg.com/media/... assets and tweetPhoto media
removes non-content assets such as avatars, emoji sprites, SVG icons, and other X interface chrome
fetches remote media through the logged-in browser context and converts them to data URLs so the PDF renderer has self-contained assets

This architecture is why article images survive printing more reliably than a plain browser printToPDF call against the live X page.

Failure modes

The command intentionally fails early in a few cases:

no saved X session
expired X session
target URL is not on x.com
rendered page does not appear to be an X longform article
X DOM changes enough that the article container or media selectors are no longer valid

When the last case happens, the fix usually belongs in the heuristics inside src/x_scraper/x_scraper.py.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
auth		auth
src/x_scraper		src/x_scraper
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

x-scraper

Setup

Commands

Log into X

Export an article to PDF

Architecture

Runtime flow

Module layout

Session model

Article detection model

Media extraction model

Failure modes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

x-scraper

Setup

Commands

Log into X

Export an article to PDF

Architecture

Runtime flow

Module layout

Session model

Article detection model

Media extraction model

Failure modes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages