Stream

Rules & query syntax

Create, read, update, and delete the queries attached to a tap — and the full query language they're written in.

View as Markdown

A rule is a query attached to a tap, with an optional tag label. A page is delivered to the tap's stream if it matches any rule on the tap. All rule endpoints authenticate with a tap token (fh_).

Rule object

FieldTypeDescription
idstringRule identifier
valuestringThe query (required)
tagstringOptional label, max 255 chars
nsfwbooleanInclude adult content. Default false
qualitybooleanApply quality filters. Default true

List rules

curl -s https://api.firehose.com/v1/rules \
  -H "Authorization: Bearer $FIREHOSE_TAP_TOKEN"
{
  "data": [
    { "id": "1", "value": "tesla", "tag": "brand-mentions" },
    { "id": "2", "value": "\"site explorer\"", "tag": "product" }
  ],
  "meta": { "count": 2 }
}

Create a rule

curl -s -X POST https://api.firehose.com/v1/rules \
  -H "Authorization: Bearer $FIREHOSE_TAP_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"value": "tesla OR \"electric vehicle\"", "tag": "ev"}'

Returns 201 with the created rule.

Update a rule

Partial updates are supported.

curl -s -X PUT https://api.firehose.com/v1/rules/1 \
  -H "Authorization: Bearer $FIREHOSE_TAP_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"tag": "new-tag", "nsfw": true}'

Delete a rule

curl -s -X DELETE https://api.firehose.com/v1/rules/1 \
  -H "Authorization: Bearer $FIREHOSE_TAP_TOKEN"

Returns 204 with no content.


Query syntax

A rule's value is written in Firehose query syntax, which is Lucene-compatible. Queries are evaluated against indexed fields extracted from each crawled page.

Indexed fields

FieldTypeCaseDescription
addedtextinsensitiveDefault field. Text from inserted diff chunks
removedtextinsensitiveText from deleted diff chunks
added_anchortextinsensitiveAnchor text from inserted links
removed_anchortextinsensitiveAnchor text from deleted links
titletextinsensitivePage title
urlkeywordsensitiveFull URL as one exact token
domainkeywordsensitiveDomain extracted from the URL
publish_timekeywordsensitiveISO-8601 local datetime
page_categorykeywordsensitiveML category label, e.g. /News
page_typekeywordsensitiveML type label, e.g. /Article/How_to
languagekeywordsensitiveDetected language code from a fixed set, e.g. en, fr, zh-cn, zh-tw
drnumberDomain Rating (0–100) of the page's domain
recentfilterRecency filter (see below)

Text fields are tokenized and lowercased (case-insensitive). Keyword fields are stored as a single exact, case-sensitive token. The Number field (dr) is an integer matched with range queries (see below). Null/empty fields are absent and never match. Multi-valued fields match if any value matches.

Terms and phrases

tesla                        # "tesla" anywhere in added content (default field)
title:tesla                  # "tesla" in the title
"quick brown fox"            # exact phrase in content
title:"breaking news"        # exact phrase in title

Boolean operators

java AND programming
title:tesla OR added:"electric vehicle"
title:tesla AND NOT malware    # tesla, excluding pages that mention malware
title:tesla AND added:earnings
removed:"old feature"          # term appeared in deleted content

NOT excludes — it can't stand alone. A rule needs at least one positive term to match, so a query made only of NOT clauses (like NOT malware) is rejected; pair it with a term to keep, as above.

URL and domain filtering

url and domain are exact, case-sensitive tokens. url matches three ways — exact, wildcard (*, ?), and regex (/pattern/); domain matches exact or wildcard, but not regex. Forward slashes are special and must be escaped with \.

url:"https://example.com/news/article-1"   # exact
domain:techcrunch.com                       # exact domain
url:*\/category\/*                          # wildcard: contains /category/
url:/.*\/page\/[0-9]+.*/                     # regex: pagination URLs

Excluding junk URLs is the most common pattern:

title:tesla AND language:"en"
  AND NOT url:/.*\/page\/[0-9]+.*/
  AND NOT url:*\/category\/*
  AND NOT url:*\/tag\/*

JSON double-escaping. In a JSON request body, \/ is just /. To send a literal backslash before a slash in the query, write \\/ in JSON. For example the query url:*\/abs\/* must be sent as "url:*\\/abs\\/*".

Filtering on url narrows which crawled pages match — it does not tell Firehose to crawl that URL. A tap only ever sees pages the crawler visits, on the crawler's own schedule, so a change to a specific page won't surface until (and unless) the crawler re-crawls it. To monitor a specific page for changes on a cadence you control, use URL Watch instead.

Date ranges on publish_time

Colons in timestamps must be escaped with \\:

publish_time:[2025-01-01T00\\:00\\:00 TO 2025-12-31T23\\:59\\:59]   # inclusive
publish_time:{2025-01-01T00\\:00\\:00 TO 2025-12-31T23\\:59\\:59}   # exclusive

Authority ranges on dr

dr (Domain Rating, 0–100) is numeric. Match it with Lucene range syntax — inclusive [min TO max], exclusive {min TO max}, an open end with *, or an exact value:

dr:[50 TO 100]                 # domain rating 50–100 (inclusive)
dr:[70 TO *]                   # DR 70 or higher
dr:{0 TO 50}                   # below 50 (exclusive)
title:tesla AND dr:[60 TO *]   # tesla in title, strong domains only

A page whose dr is unknown has no value for the field and never matches a numeric query, so a dr: constraint also drops pages we couldn't score.

recent — recency filter

A query-level filter (not an indexed field). Format: a positive integer followed by h, d, or mo.

recent:1h                      # published in the last hour
recent:7d                      # last 7 days
title:tesla AND recent:24h     # tesla in title, last 24 hours

nsfw — adult content

A boolean on the rule object, not in the query. false (default) excludes adult content; true includes it.

{ "value": "title:tesla", "nsfw": true }

quality — quality filter

A boolean on the rule object (default true). When on, results are limited to pages published in the last 7 days, with no pagination, tag/category index, or query-parameter URLs — removing low-value and duplicate pages.

{ "value": "domain:\"example.com\"", "quality": false }

Category, type, and language values

page_category and page_type accept a large fixed vocabulary (25 top-level categories with 700+ subcategories, and 110+ page types); the complete lists live in the canonical /skill.md reference. language is a fixed set of detected codes (e.g. en, fr, pt, zh-cn, zh-tw). Chinese is zh-cn or zh-tw — there is no bare zh. Every category and type value begins with /, so quote it — an unquoted /… is read as a regex.

page_category:"/News"
page_category:"/Sports/Winter_Sports/Skiing_and_Snowboarding"
page_type:"/Article/How_to"
page_type:"/Document/White_Paper"
language:"en"

A value outside these sets is caught when you save the rule, with the closest valid value suggested — see Query validation.

Query validation

Firehose checks a rule's query when you create or update it and rejects an invalid one with 422 (see Errors & limits) instead of saving a rule that can't work. A query is rejected when it:

  • has a syntax error, or is empty;
  • names an unknown field (the error suggests the closest real field);
  • uses a wildcard or regex where it isn't allowed — wildcards work only on url and domain, regex only on url — or matches everything (*:* or field:*);
  • can never match a page — a query made only of NOT clauses, or one whose only route to a match is a page_category, page_type, or language value that isn't in the vocabulary.

A misspelt category, type, or language value is reported with the closest valid value (for example page_type:"/Artical" suggests /Article). If it sits in an OR beside a clause that can still match, the rule is accepted with a warning; if it's the rule's only way to match, the rule is rejected.

Next steps