Skip to content

fix: validateSlug should only reject .. as a path component#31

Closed
orendi84 wants to merge 1 commit intogarrytan:masterfrom
orendi84:fix/slug-validator-path-component
Closed

fix: validateSlug should only reject .. as a path component#31
orendi84 wants to merge 1 commit intogarrytan:masterfrom
orendi84:fix/slug-validator-path-component

Conversation

@orendi84
Copy link
Copy Markdown
Contributor

Summary

validateSlug in src/core/postgres-engine.ts uses /\.\./.test(slug) to reject path-traversal slugs, but the regex matches .. as a substring anywhere in the slug. Any filename containing a literal ellipsis like "..." or ". . ." gets caught as a false positive, even though the only real intent is to block navigation tokens like ../foo or foo/../bar.

This PR replaces the substring check with /(^|\/)\.\.(\/|$)/, which matches .. only when it sits as a complete path component. Real traversal patterns are still rejected; filenames with ellipsis are now accepted.

The bug in action

I found this while importing a 7,910-file YouTube transcript corpus (TED Talks, Huberman Lab, Lex Fridman, Lenny's Podcast, etc.) into a single brain. 64 imports failed, all with Invalid slug: "". A quick grep '\.\.' missing.txt showed every single failing path contained an ellipsis. A 1.2% silent loss rate, 100% caused by the same regex mismatch.

Real examples from the failure list:

  • ted-talks/i got 99 problems... palsy is just one - maysoon zayid
  • ted-ed-originals/how turtle shells evolved... twice - judy cebra thomas
  • huberman-lab/how playing sports benefits your body ... and your brain - leah lagos and jaspal ricky singh
  • lex-friedman/... and how we can save it - patricia medici
  • ted-talks/jane mcgonigal- massively multi-player... thumb-wrestling

These are legitimate filenames. Ellipsis in titles is common in YouTube transcript exports, podcast episode titles, TED talk summaries, news headlines, and any imported content where the original author used "..." as a rhetorical device.

Why v0.5.1's slugify doesn't catch this

I checked whether the v0.5.1 path slugifier (slugifyPath / slugifySegment in src/core/sync.ts) already handles this. It doesn't. The keep-set is:

.replace(/[^a-z0-9.\s_-]/g, '')

ASCII dots are explicitly preserved so that legitimate dotted filenames like v1.0.0 and file.name.md survive. That means ASCII ... passes through slugify unchanged and still fails validateSlug. The Unicode ellipsis is stripped (it's not in the keep-set), but the three-ASCII-dot form is common in content-management exports and is what bit me in practice.

The fix

 function validateSlug(slug: string): void {
   // Git is the system of record — slugs are lowercased repo-relative paths.
   // Only reject empty, path traversal (..), and leading slash.
-  if (!slug || /\.\./.test(slug) || /^\//.test(slug)) {
+  //
+  // The path-traversal check matches `..` only as a complete path component
+  // (at the start, between slashes, or at the end). A previous version used
+  // /\.\./.test(slug) which also rejected any substring of two dots, so
+  // filenames containing a literal ellipsis like "I got 99 problems..." or
+  // "How turtle shells evolved... twice" failed to import. Those are legal
+  // path components; only `..` as a navigation token should be rejected.
+  if (!slug || /(^|\/)\.\.(\/|$)/.test(slug) || /^\//.test(slug)) {
     throw new Error(`Invalid slug: "${slug}". Slugs cannot be empty, start with /, or contain path traversal.`);
   }
 }

One regex change plus a comment explaining the intent.

Still rejected (real path traversal)

Input Old New
.. reject reject
../etc/passwd reject reject
foo/.. reject reject
foo/../bar reject reject
foo/bar/../baz reject reject
notes/../../etc reject reject

Previously-rejected legitimate filenames (now accepted)

Input Old New
ted-talks/i got 99 problems... palsy is just one reject accept
ted-ed-originals/how turtle shells evolved... twice reject accept
notes/this... that... the other reject accept
channel/foo..bar reject accept
channel/a..z reject accept

Unchanged (still accepted)

Input Old New
notes/v1.0.0 accept accept
notes/file.name.md accept accept
people/sarah-chen accept accept
apple-notes/2017-05-03 ohmygreen accept accept
notes/日本語テスト accept accept

Why not fix this in slugifyPath instead

Considered it and ruled it out:

  1. Loses title fidelity. Stripping ellipsis turns "I got 99 problems... palsy is just one" into "I got 99 problems palsy is just one". That reads worse and drops signal from the original title.
  2. Legitimate multi-dot names get harder to preserve. v1.0.0, file.name.md, version tags, IP addresses, and similar content would need special-case logic.
  3. The real bug is in validateSlug. The comment above the regex literally says "path traversal (..)" but the regex says "any two-dot substring." The fix is to make the code match the documented intent.

The slugify pipeline is doing the right thing by preserving dots. validateSlug is the one that over-rejects.

Tests

Added regression tests in test/slug-validation.test.ts under the existing describe('validateSlug (widened for any filename chars)') block:

  • accepts filenames with literal ellipsis — covers the exact failing cases
  • accepts \..` when embedded inside a path componentfoo..bar, a..z`
  • rejects \..` as a complete path component.., foo/.., foo/../bar, foo/bar/../baz`

The mirrored validateSlug function in the test file is also updated to the new regex so the tests exercise the real logic. Full file runs 34 tests, 0 failures (bun test test/slug-validation.test.ts).

Impact

Anyone importing content with ellipsis in filenames silently loses those files today. Common affected corpora: YouTube transcripts, TED talks, podcast episode exports, news clipping archives, many Apple Notes that use ... to leave thoughts unfinished. After this fix the same imports succeed end-to-end.

I backfilled the 64 missing files locally after applying this patch and the corpus now imports cleanly at 7,910/7,910.

Closes a false positive in validateSlug() where filenames containing
a literal ellipsis ("...") were classified as path traversal.

## The bug

`validateSlug` uses a regex to reject path-traversal slugs:

    if (!slug || /\.\./.test(slug) || /^\//.test(slug)) { ... }

The intent is clear from the comment above it ("reject empty, path
traversal (..), and leading slash"), but `/\.\./.test()` matches
the substring `..` anywhere in the slug, not just when `..` is a
complete path component. Any filename containing three or more
consecutive dots gets caught as a false positive:

- `ted-talks/i got 99 problems... palsy is just one`
- `ted-ed-originals/how turtle shells evolved... twice`
- `huberman-lab/how playing sports benefits your body ... and your brain`

These are legal filenames. Ellipsis in titles is common in YouTube
transcripts, TED talks, podcast episode titles, news headlines, and
any imported content where the original author used "..." as a
rhetorical device.

## How I found it

Importing a 7,910-file YouTube transcript corpus (TED, Huberman,
Lex Fridman, Lenny, etc.) into a single brain produced 64 failures,
all with `Invalid slug: ""`. A quick `grep '\.\.' missing.txt`
showed every failure contained a literal ellipsis. 1.2% silent loss
rate, 100% caused by the same regex mismatch.

Note that v0.5.1's `slugifyPath` / `slugifySegment` preserve ASCII
dots (`[^a-z0-9.\s_-]` keeps dots in the keep-set), so even with
the new slugify pipeline, ASCII `...` passes through to validateSlug
unchanged and still fails. The bug survives v0.5.1.

## The fix

Replace `/\.\./` with `/(^|\/)\.\.(\/|$)/`:

    if (!slug || /(^|\/)\.\.(\/|$)/.test(slug) || /^\//.test(slug)) { ... }

This matches `..` only when it sits as a complete path component:
at the start of the slug, between slashes, or at the end. Real
traversal patterns are still rejected:

- `..`                     -> rejected (bare traversal token)
- `../etc/passwd`           -> rejected (leading `..` component)
- `foo/..`                  -> rejected (trailing `..` component)
- `foo/../bar`              -> rejected (middle `..` component)
- `foo/bar/../baz`          -> rejected (middle `..` component)
- `notes/../../etc`         -> rejected (still has `/..` component)

And legitimate filenames now pass:

- `ted-talks/... palsy is just one`  -> accepted
- `channel/foo..bar`                 -> accepted (two dots inside a word)
- `notes/v1.0.0`                     -> accepted (already worked)
- `notes/file.name.md`               -> accepted (already worked)

## Why not slugify the ellipsis away upstream?

Option considered: make `slugifySegment` strip or collapse consecutive
dots. Rejected because:

1. `v1.0.0`, `file.name.md`, and other legitimate multi-dot names
   would be harder to preserve correctly. `v1.0.0` works today
   precisely because no two dots touch.
2. Stripping ellipsis loses title fidelity. `I got 99 problems...
   palsy is just one` becoming `I got 99 problems palsy is just one`
   reads worse.
3. The real bug is in `validateSlug`: the comment says "path
   traversal (..)" but the regex says "any two-dot substring."
   Fix the regex to match the documented intent.

## Tests

Adds regression tests for:
- literal ellipsis in various channel/title combinations
- `..` embedded in the middle of a path component (legal)
- `..` as a full component at start, middle, and end (rejected)

All 34 tests in `test/slug-validation.test.ts` pass.
@garrytan
Copy link
Copy Markdown
Owner

Included in the community fix wave (PR #38, v0.6.1)! Your regex fix was spot on, and the 7,910-file YouTube transcript corpus analysis was incredibly valuable for understanding the real-world impact. We refined the regex slightly (the outside voice caught that $ needed to be outside the capture group) but your diagnosis drove the whole fix. Thank you @orendi84! 🙌

@garrytan garrytan closed this Apr 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants