Extract clean, structured content from web pages with automatic short-link expansion and lightweight page-type detection.
xtractr fetches a URL, follows redirects, parses the HTML, converts the main content to Markdown, and returns normalized metadata you can use in apps, pipelines, and AI workflows.
It is primarily intended for use inside a Cloudflare Worker runtime.
- Expands shortened URLs and returns the full redirect chain
- Extracts readable page content with defuddle
- Converts extracted HTML content to Markdown
- Detects content/page type using:
- Open Graph (
og:type) - JSON-LD (
@type) with nested traversal - domain and file-extension fallback rules
- Open Graph (
- Enforces a max page size (5MB) for safer fetching
- Works in ESM/CJS builds with TypeScript declarations
npm install @mrmartineau/xtractr
bun add @mrmartineau/xtractr
pnpm add @mrmartineau/xtractr
yarn add @mrmartineau/xtractrimport { xtract } from '@mrmartineau/xtractr'
const result = await xtract('https://bit.ly/example')
console.log(result.title)
console.log(result.content) // markdown
console.log(result.redirectUrls)
console.log(result.pageType)import { xtract } from '@mrmartineau/xtractr'
export default {
async fetch(request: Request): Promise<Response> {
const { searchParams } = new URL(request.url)
const target = searchParams.get('url')
if (!target) {
return new Response('Missing "url" query parameter', { status: 400 })
}
const data = await xtract(target)
return Response.json(data)
},
}Fetches, parses, and extracts structured content from a URL.
title: string- extracted page titleauthor: string- extracted author (if found)published: string- published date string (if found)description: string- summary/description (if found)domain: string- source domaincontent: string- extracted main content as MarkdownwordCount: number- estimated word count from extracted contentsource: string- original input URLurl: string- final fetched URLresolvedUrl: string- resolved URL after unshortening/fetch redirectsredirectUrls: string[]- full redirect chainurlType: LinkType- detected type for the URLpageType: LinkType- detected type for the extracted pagefavicon?: string- favicon URL if availableimage?: string- representative image URL if availablesite?: string- site/publication name if available
type LinkType =
| 'link'
| 'video'
| 'audio'
| 'recipe'
| 'image'
| 'document'
| 'article'
| 'game'
| 'book'
| 'event'
| 'product'
| 'note'
| 'file'- Non-HTML responses are rejected.
- Responses larger than 5MB are rejected.
- Redirect chasing is capped (currently 20 hops).
- Intended runtime: Cloudflare Workers.
- Also works in other runtimes that provide
fetch(Node 18+ recommended).
# Build (CJS + ESM + DTS)
npm run build
# Watch mode
npm run dev
# Lint/format
npm run checkMIT