Rules & query syntax
Create, read, update, and delete the queries attached to a tap — and the full query language they're written in.
A rule is a query attached to a tap, with an optional tag label. A page is delivered to the tap's
stream if it matches any rule on the tap. All rule endpoints authenticate with a tap token
(fh_).
Rule object
| Field | Type | Description |
|---|---|---|
id | string | Rule identifier |
value | string | The query (required) |
tag | string | Optional label, max 255 chars |
nsfw | boolean | Include adult content. Default false |
quality | boolean | Apply quality filters. Default true |
List rules
curl -s https://api.firehose.com/v1/rules \
-H "Authorization: Bearer $FIREHOSE_TAP_TOKEN"{
"data": [
{ "id": "1", "value": "tesla", "tag": "brand-mentions" },
{ "id": "2", "value": "\"site explorer\"", "tag": "product" }
],
"meta": { "count": 2 }
}Create a rule
curl -s -X POST https://api.firehose.com/v1/rules \
-H "Authorization: Bearer $FIREHOSE_TAP_TOKEN" \
-H "Content-Type: application/json" \
-d '{"value": "tesla OR \"electric vehicle\"", "tag": "ev"}'Returns 201 with the created rule.
Update a rule
Partial updates are supported.
curl -s -X PUT https://api.firehose.com/v1/rules/1 \
-H "Authorization: Bearer $FIREHOSE_TAP_TOKEN" \
-H "Content-Type: application/json" \
-d '{"tag": "new-tag", "nsfw": true}'Delete a rule
curl -s -X DELETE https://api.firehose.com/v1/rules/1 \
-H "Authorization: Bearer $FIREHOSE_TAP_TOKEN"Returns 204 with no content.
Query syntax
A rule's value is written in Firehose query syntax, which is Lucene-compatible. Queries are
evaluated against indexed fields extracted from each crawled page.
Indexed fields
| Field | Type | Case | Description |
|---|---|---|---|
added | text | insensitive | Default field. Text from inserted diff chunks |
removed | text | insensitive | Text from deleted diff chunks |
added_anchor | text | insensitive | Anchor text from inserted links |
removed_anchor | text | insensitive | Anchor text from deleted links |
title | text | insensitive | Page title |
url | keyword | sensitive | Full URL as one exact token |
domain | keyword | sensitive | Domain extracted from the URL |
publish_time | keyword | sensitive | ISO-8601 local datetime |
page_category | keyword | sensitive | ML category label, e.g. /News |
page_type | keyword | sensitive | ML type label, e.g. /Article/How_to |
language | keyword | sensitive | Detected language code from a fixed set, e.g. en, fr, zh-cn, zh-tw |
dr | number | — | Domain Rating (0–100) of the page's domain |
recent | filter | — | Recency filter (see below) |
Text fields are tokenized and lowercased (case-insensitive). Keyword fields are stored as a
single exact, case-sensitive token. The Number field (dr) is an integer matched with
range queries (see below). Null/empty fields are absent and never match. Multi-valued
fields match if any value matches.
Terms and phrases
tesla # "tesla" anywhere in added content (default field)
title:tesla # "tesla" in the title
"quick brown fox" # exact phrase in content
title:"breaking news" # exact phrase in titleBoolean operators
java AND programming
title:tesla OR added:"electric vehicle"
title:tesla AND NOT malware # tesla, excluding pages that mention malware
title:tesla AND added:earnings
removed:"old feature" # term appeared in deleted contentNOT excludes — it can't stand alone. A rule needs at least one positive term to match, so a
query made only of NOT clauses (like NOT malware) is rejected; pair it with a term to keep, as above.
URL and domain filtering
url and domain are exact, case-sensitive tokens. url matches three ways — exact, wildcard
(*, ?), and regex (/pattern/); domain matches exact or wildcard, but not regex. Forward
slashes are special and must be escaped with \.
url:"https://example.com/news/article-1" # exact
domain:techcrunch.com # exact domain
url:*\/category\/* # wildcard: contains /category/
url:/.*\/page\/[0-9]+.*/ # regex: pagination URLsExcluding junk URLs is the most common pattern:
title:tesla AND language:"en"
AND NOT url:/.*\/page\/[0-9]+.*/
AND NOT url:*\/category\/*
AND NOT url:*\/tag\/*JSON double-escaping. In a JSON request body, \/ is just /. To send a literal backslash
before a slash in the query, write \\/ in JSON. For example the query url:*\/abs\/*
must be sent as "url:*\\/abs\\/*".
Filtering on url narrows which crawled pages match — it does not tell Firehose to crawl that
URL. A tap only ever sees pages the crawler visits, on the crawler's own schedule, so a change to a
specific page won't surface until (and unless) the crawler re-crawls it. To monitor a specific page
for changes on a cadence you control, use URL Watch instead.
Date ranges on publish_time
Colons in timestamps must be escaped with \\:
publish_time:[2025-01-01T00\\:00\\:00 TO 2025-12-31T23\\:59\\:59] # inclusive
publish_time:{2025-01-01T00\\:00\\:00 TO 2025-12-31T23\\:59\\:59} # exclusiveAuthority ranges on dr
dr (Domain Rating, 0–100) is numeric. Match it
with Lucene range syntax — inclusive [min TO max], exclusive {min TO max}, an open end with *,
or an exact value:
dr:[50 TO 100] # domain rating 50–100 (inclusive)
dr:[70 TO *] # DR 70 or higher
dr:{0 TO 50} # below 50 (exclusive)
title:tesla AND dr:[60 TO *] # tesla in title, strong domains onlyA page whose dr is unknown has no value for the field and never matches a numeric
query, so a dr: constraint also drops pages we couldn't score.
recent — recency filter
A query-level filter (not an indexed field). Format: a positive integer followed by h, d, or mo.
recent:1h # published in the last hour
recent:7d # last 7 days
title:tesla AND recent:24h # tesla in title, last 24 hoursnsfw — adult content
A boolean on the rule object, not in the query. false (default) excludes adult content;
true includes it.
{ "value": "title:tesla", "nsfw": true }quality — quality filter
A boolean on the rule object (default true). When on, results are limited to pages published
in the last 7 days, with no pagination, tag/category index, or query-parameter URLs — removing
low-value and duplicate pages.
{ "value": "domain:\"example.com\"", "quality": false }Category, type, and language values
page_category and page_type accept a large fixed vocabulary (25 top-level categories with
700+ subcategories, and 110+ page types); the complete lists live in the canonical
/skill.md reference. language is a fixed set of detected
codes (e.g. en, fr, pt, zh-cn, zh-tw). Chinese is zh-cn or zh-tw — there is no bare
zh. Every category and type value begins with /, so quote it — an unquoted /… is read as a
regex.
page_category:"/News"
page_category:"/Sports/Winter_Sports/Skiing_and_Snowboarding"
page_type:"/Article/How_to"
page_type:"/Document/White_Paper"
language:"en"A value outside these sets is caught when you save the rule, with the closest valid value suggested — see Query validation.
Query validation
Firehose checks a rule's query when you create or update it and rejects an invalid one with 422
(see Errors & limits) instead of saving a rule that can't work.
A query is rejected when it:
- has a syntax error, or is empty;
- names an unknown field (the error suggests the closest real field);
- uses a wildcard or regex where it isn't allowed — wildcards work only on
urlanddomain, regex only onurl— or matches everything (*:*orfield:*); - can never match a page — a query made only of
NOTclauses, or one whose only route to a match is apage_category,page_type, orlanguagevalue that isn't in the vocabulary.
A misspelt category, type, or language value is reported with the closest valid value (for example
page_type:"/Artical" suggests /Article). If it sits in an OR beside a clause that can still
match, the rule is accepted with a warning; if it's the rule's only way to match, the rule is rejected.
Next steps
Streaming (SSE)
Open the connection and receive matches from your rules.
Match payload
Every field on a delivered document.
Filters & domain lists
Save a fragment once and reuse it across many rules.