fix: validateSlug should only reject .. as a path component#31
Closed
orendi84 wants to merge 1 commit intogarrytan:masterfrom
Closed
fix: validateSlug should only reject .. as a path component#31orendi84 wants to merge 1 commit intogarrytan:masterfrom
.. as a path component#31orendi84 wants to merge 1 commit intogarrytan:masterfrom
Conversation
Closes a false positive in validateSlug() where filenames containing
a literal ellipsis ("...") were classified as path traversal.
## The bug
`validateSlug` uses a regex to reject path-traversal slugs:
if (!slug || /\.\./.test(slug) || /^\//.test(slug)) { ... }
The intent is clear from the comment above it ("reject empty, path
traversal (..), and leading slash"), but `/\.\./.test()` matches
the substring `..` anywhere in the slug, not just when `..` is a
complete path component. Any filename containing three or more
consecutive dots gets caught as a false positive:
- `ted-talks/i got 99 problems... palsy is just one`
- `ted-ed-originals/how turtle shells evolved... twice`
- `huberman-lab/how playing sports benefits your body ... and your brain`
These are legal filenames. Ellipsis in titles is common in YouTube
transcripts, TED talks, podcast episode titles, news headlines, and
any imported content where the original author used "..." as a
rhetorical device.
## How I found it
Importing a 7,910-file YouTube transcript corpus (TED, Huberman,
Lex Fridman, Lenny, etc.) into a single brain produced 64 failures,
all with `Invalid slug: ""`. A quick `grep '\.\.' missing.txt`
showed every failure contained a literal ellipsis. 1.2% silent loss
rate, 100% caused by the same regex mismatch.
Note that v0.5.1's `slugifyPath` / `slugifySegment` preserve ASCII
dots (`[^a-z0-9.\s_-]` keeps dots in the keep-set), so even with
the new slugify pipeline, ASCII `...` passes through to validateSlug
unchanged and still fails. The bug survives v0.5.1.
## The fix
Replace `/\.\./` with `/(^|\/)\.\.(\/|$)/`:
if (!slug || /(^|\/)\.\.(\/|$)/.test(slug) || /^\//.test(slug)) { ... }
This matches `..` only when it sits as a complete path component:
at the start of the slug, between slashes, or at the end. Real
traversal patterns are still rejected:
- `..` -> rejected (bare traversal token)
- `../etc/passwd` -> rejected (leading `..` component)
- `foo/..` -> rejected (trailing `..` component)
- `foo/../bar` -> rejected (middle `..` component)
- `foo/bar/../baz` -> rejected (middle `..` component)
- `notes/../../etc` -> rejected (still has `/..` component)
And legitimate filenames now pass:
- `ted-talks/... palsy is just one` -> accepted
- `channel/foo..bar` -> accepted (two dots inside a word)
- `notes/v1.0.0` -> accepted (already worked)
- `notes/file.name.md` -> accepted (already worked)
## Why not slugify the ellipsis away upstream?
Option considered: make `slugifySegment` strip or collapse consecutive
dots. Rejected because:
1. `v1.0.0`, `file.name.md`, and other legitimate multi-dot names
would be harder to preserve correctly. `v1.0.0` works today
precisely because no two dots touch.
2. Stripping ellipsis loses title fidelity. `I got 99 problems...
palsy is just one` becoming `I got 99 problems palsy is just one`
reads worse.
3. The real bug is in `validateSlug`: the comment says "path
traversal (..)" but the regex says "any two-dot substring."
Fix the regex to match the documented intent.
## Tests
Adds regression tests for:
- literal ellipsis in various channel/title combinations
- `..` embedded in the middle of a path component (legal)
- `..` as a full component at start, middle, and end (rejected)
All 34 tests in `test/slug-validation.test.ts` pass.
5 tasks
Owner
|
Included in the community fix wave (PR #38, v0.6.1)! Your regex fix was spot on, and the 7,910-file YouTube transcript corpus analysis was incredibly valuable for understanding the real-world impact. We refined the regex slightly (the outside voice caught that |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
validateSluginsrc/core/postgres-engine.tsuses/\.\./.test(slug)to reject path-traversal slugs, but the regex matches..as a substring anywhere in the slug. Any filename containing a literal ellipsis like"..."or". . ."gets caught as a false positive, even though the only real intent is to block navigation tokens like../fooorfoo/../bar.This PR replaces the substring check with
/(^|\/)\.\.(\/|$)/, which matches..only when it sits as a complete path component. Real traversal patterns are still rejected; filenames with ellipsis are now accepted.The bug in action
I found this while importing a 7,910-file YouTube transcript corpus (TED Talks, Huberman Lab, Lex Fridman, Lenny's Podcast, etc.) into a single brain. 64 imports failed, all with
Invalid slug: "". A quickgrep '\.\.' missing.txtshowed every single failing path contained an ellipsis. A 1.2% silent loss rate, 100% caused by the same regex mismatch.Real examples from the failure list:
ted-talks/i got 99 problems... palsy is just one - maysoon zayidted-ed-originals/how turtle shells evolved... twice - judy cebra thomashuberman-lab/how playing sports benefits your body ... and your brain - leah lagos and jaspal ricky singhlex-friedman/... and how we can save it - patricia medicited-talks/jane mcgonigal- massively multi-player... thumb-wrestlingThese are legitimate filenames. Ellipsis in titles is common in YouTube transcript exports, podcast episode titles, TED talk summaries, news headlines, and any imported content where the original author used "..." as a rhetorical device.
Why v0.5.1's slugify doesn't catch this
I checked whether the v0.5.1 path slugifier (
slugifyPath/slugifySegmentinsrc/core/sync.ts) already handles this. It doesn't. The keep-set is:ASCII dots are explicitly preserved so that legitimate dotted filenames like
v1.0.0andfile.name.mdsurvive. That means ASCII...passes through slugify unchanged and still failsvalidateSlug. The Unicode ellipsis…is stripped (it's not in the keep-set), but the three-ASCII-dot form is common in content-management exports and is what bit me in practice.The fix
function validateSlug(slug: string): void { // Git is the system of record — slugs are lowercased repo-relative paths. // Only reject empty, path traversal (..), and leading slash. - if (!slug || /\.\./.test(slug) || /^\//.test(slug)) { + // + // The path-traversal check matches `..` only as a complete path component + // (at the start, between slashes, or at the end). A previous version used + // /\.\./.test(slug) which also rejected any substring of two dots, so + // filenames containing a literal ellipsis like "I got 99 problems..." or + // "How turtle shells evolved... twice" failed to import. Those are legal + // path components; only `..` as a navigation token should be rejected. + if (!slug || /(^|\/)\.\.(\/|$)/.test(slug) || /^\//.test(slug)) { throw new Error(`Invalid slug: "${slug}". Slugs cannot be empty, start with /, or contain path traversal.`); } }One regex change plus a comment explaining the intent.
Still rejected (real path traversal)
..../etc/passwdfoo/..foo/../barfoo/bar/../baznotes/../../etcPreviously-rejected legitimate filenames (now accepted)
ted-talks/i got 99 problems... palsy is just oneted-ed-originals/how turtle shells evolved... twicenotes/this... that... the otherchannel/foo..barchannel/a..zUnchanged (still accepted)
notes/v1.0.0notes/file.name.mdpeople/sarah-chenapple-notes/2017-05-03 ohmygreennotes/日本語テストWhy not fix this in slugifyPath instead
Considered it and ruled it out:
"I got 99 problems... palsy is just one"into"I got 99 problems palsy is just one". That reads worse and drops signal from the original title.v1.0.0,file.name.md, version tags, IP addresses, and similar content would need special-case logic.validateSlug. The comment above the regex literally says "path traversal (..)" but the regex says "any two-dot substring." The fix is to make the code match the documented intent.The slugify pipeline is doing the right thing by preserving dots.
validateSlugis the one that over-rejects.Tests
Added regression tests in
test/slug-validation.test.tsunder the existingdescribe('validateSlug (widened for any filename chars)')block:accepts filenames with literal ellipsis— covers the exact failing casesaccepts \..` when embedded inside a path component—foo..bar,a..z`rejects \..` as a complete path component—..,foo/..,foo/../bar,foo/bar/../baz`The mirrored
validateSlugfunction in the test file is also updated to the new regex so the tests exercise the real logic. Full file runs 34 tests, 0 failures (bun test test/slug-validation.test.ts).Impact
Anyone importing content with ellipsis in filenames silently loses those files today. Common affected corpora: YouTube transcripts, TED talks, podcast episode exports, news clipping archives, many Apple Notes that use
...to leave thoughts unfinished. After this fix the same imports succeed end-to-end.I backfilled the 64 missing files locally after applying this patch and the corpus now imports cleanly at 7,910/7,910.