Skip to content

New script to break markdown in to Algolia search records and publish.#2961

Merged
NWylynko merged 27 commits intomainfrom
nick/algolia-search-records
Jan 16, 2026
Merged

New script to break markdown in to Algolia search records and publish.#2961
NWylynko merged 27 commits intomainfrom
nick/algolia-search-records

Conversation

@NWylynko
Copy link
Contributor

@NWylynko NWylynko commented Jan 13, 2026

🔎 Previews:

What does this solve?

  • The current setup with Algolia has a crawler run after changes are deployed live, it has various issues
    • The editor on the dashboard seamingly recently fails to validate / run the crawler
    • It fails to crawl docs, but can't be debugged from the issue above
    • The current setup has us using git to attempt to guess the recently changed files to re-crawl

What changed?

  • Converts the crawler from examining html to create records to being markdown to records
  • Moves the crawler logic to a script in source control in our repo instead of managed in Algolia dashboard
  • Makes updating the search records a step of the docs merges / deployments

@vercel
Copy link

vercel bot commented Jan 13, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
clerk-docs Ready Ready Preview Jan 16, 2026 7:22pm

Request Review

@NWylynko NWylynko marked this pull request as ready for review January 14, 2026 20:33
@NWylynko NWylynko requested a review from a team as a code owner January 14, 2026 20:33
@NWylynko NWylynko requested a review from manovotny January 14, 2026 20:33
}

function getGitBranch(): string {
return 'main'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want this hardcoded return. Won't this delete main records from a preview branch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is just temp so the preview deployment works, it will be removed before merge

Allows running the script without pushing changes to Algolia.
In dry run mode, outputs records to .algolia/records.json and
stale record IDs to .algolia/stale.json for inspection.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@NWylynko
Copy link
Contributor Author

There's 7961 records with "content": null. Should we skip adding these to the index?

Example record:

{
    "objectID": "main-0-/docs/maintenance-mode#main",
    "branch": "main",
    "anchor": "main",
    "content": null,
    "type": "lvl1",
    "keywords": [],
    "availableSDKs": [
      "all"
    ],
    "canonical": "/docs/maintenance-mode",
    "weight": {
      "pageRank": 0,
      "level": 90,
      "position": 0
    },
    "hierarchy": {
      "lvl0": "Documentation",
      "lvl1": "Maintenance Mode",
      "lvl2": null,
      "lvl3": null,
      "lvl4": null,
      "lvl5": null,
      "lvl6": null
    },
    "distinct_group": "/docs/maintenance-mode#main",
    "record_batch": "729f829e-2658-48ab-b45e-e1e6274eff31"
  },

"content": null is an expected record, every heading in the docs, including the title will have a content of null, and Algolia uses the hierarchy object as a searchable content.

@NWylynko
Copy link
Contributor Author

Tooltips are getting duplicated/smooshed together.
CleanShot 2026-01-14 at 16 46 05@2x

Example record:

{
    "objectID": "main-6-/docs/guides/customizing-clerk/email-sms-templates#revolvapp-wysiwyg-email-editor-plugin",
    "branch": "main",
    "anchor": "revolvapp-wysiwyg-email-editor-plugin",
    "content": "The email editor uses the Revolvapp WYSIWYGWYSIWYG stands for 'What You See Is What You Get'. This term is used for editors and design tools that allow you to create content or layouts in a visual manner (without requiring you to edit the underlying markup) so that you can instantly see how the result will display to your users. email template editor plugin by Imperavi. To acquaint yourself with the template markup syntax, consult Imperavi's docs.",
    "type": "content",
    "keywords": [],
    "availableSDKs": [
      "all"
    ],
    "canonical": "/docs/guides/customizing-clerk/email-sms-templates",
    "weight": {
      "pageRank": 0,
      "level": 0,
      "position": 6
    },
    "hierarchy": {
      "lvl0": "Documentation",
      "lvl1": "Email and SMS templates",
      "lvl2": "Before you start",
      "lvl3": "Revolvapp WYSIWYG email editor plugin",
      "lvl4": null,
      "lvl5": null,
      "lvl6": null
    },
    "distinct_group": "/docs/guides/customizing-clerk/email-sms-templates#revolvapp-wysiwyg-email-editor-plugin",
    "record_batch": "729f829e-2658-48ab-b45e-e1e6274eff31"
  },

I decided that since tooltip content can't be seen directly on the screen to just remove it from the record. My other thought was to create a new separate record with just the content of the tooltip, but I think that will confusing people.

@NWylynko
Copy link
Contributor Author

Did we want to index properties? I can't remember what we decided in our team meeting.

Example record:

{
    "objectID": "main-4-/docs/fastify/guides/customizing-clerk/appearance-prop/variables#properties",
    "branch": "main",
    "anchor": "properties",
    "content": "string",
    "type": "content",
    "keywords": [],
    "availableSDKs": [
      "astro",
      "chrome-extension",
      "expo",
      "nextjs",
      "nuxt",
      "react",
      "react-router",
      "remix",
      "tanstack-react-start",
      "vue",
      "js-frontend",
      "fastify",
      "expressjs",
      "js-backend",
      "go",
      "ruby"
    ],
    "canonical": "/docs/:sdk:/guides/customizing-clerk/appearance-prop/variables",
    "weight": {
      "pageRank": 0,
      "level": 0,
      "position": 4
    },
    "hierarchy": {
      "lvl0": "Documentation",
      "lvl1": "Variables prop",
      "lvl2": "Properties",
      "lvl3": null,
      "lvl4": null,
      "lvl5": null,
      "lvl6": null
    },
    "distinct_group": "/docs/:sdk:/guides/customizing-clerk/appearance-prop/variables#properties",
    "record_batch": "729f829e-2658-48ab-b45e-e1e6274eff31"
  },

Example record pulling in complex types:

{
    "objectID": "main-127-/docs/guides/customizing-clerk/elements/reference/common#properties-8",
    "branch": "main",
    "anchor": "properties-8",
    "content": "({ value, status }: { value: string; status: 'none' | 'selected' | 'cursor' | 'hovered' }) => React.ReactNode",
    "type": "content",
    "keywords": [],
    "availableSDKs": [
      "all"
    ],
    "canonical": "/docs/guides/customizing-clerk/elements/reference/common",
    "weight": {
      "pageRank": 0,
      "level": 0,
      "position": 127
    },
    "hierarchy": {
      "lvl0": "Documentation",
      "lvl1": "Common components",
      "lvl2": "<Input>",
      "lvl3": "<Input type=\"otp\">",
      "lvl4": "Properties",
      "lvl5": null,
      "lvl6": null
    },
    "distinct_group": "/docs/guides/customizing-clerk/elements/reference/common#properties-8",
    "record_batch": "729f829e-2658-48ab-b45e-e1e6274eff31"
  },

I've combined the records that each property was being split in to, so now each property in a properties table gets just one record. Same as the production records.

I'm wondering if we want to weigh these records down so they need to be searched for specifically.

example:

{
    "objectID": "main-4-/docs/nuxt/guides/customizing-clerk/appearance-prop/variables#properties",
    "branch": "main",
    "anchor": "properties",
    "content": "colorDanger (string) - The color used for error states. CSS variable: --clerk-color-danger",
    "type": "content",
    "keywords": [],
    "availableSDKs": [
      "astro",
      "chrome-extension",
      "expo",
      "nextjs",
      "nuxt",
      "react",
      "react-router",
      "remix",
      "tanstack-react-start",
      "vue",
      "js-frontend",
      "fastify",
      "expressjs",
      "js-backend",
      "go",
      "ruby"
    ],
    "canonical": "/docs/:sdk:/guides/customizing-clerk/appearance-prop/variables",
    "weight": {
      "pageRank": 0,
      "level": 0,
      "position": 4
    },
    "hierarchy": {
      "lvl0": "Documentation",
      "lvl1": "Variables prop",
      "lvl2": "Properties",
      "lvl3": null,
      "lvl4": null,
      "lvl5": null,
      "lvl6": null
    },
    "distinct_group": "/docs/:sdk:/guides/customizing-clerk/appearance-prop/variables#properties",
    "record_batch": "625815f7-9153-4b43-a966-c06e2ebb0b8e"
  },

@NWylynko NWylynko requested a review from manovotny January 15, 2026 15:09
@manovotny
Copy link
Contributor

"content": null is an expected record, every heading in the docs, including the title will have a content of null, and Algolia uses the hierarchy object as a searchable content.

Ah, gotcha. This could be a new schema improvement, but thanks for helping me understand how it currently works.

@manovotny
Copy link
Contributor

I decided that since tooltip content can't be seen directly on the screen to just remove it from the record. My other thought was to create a new separate record with just the content of the tooltip, but I think that will confusing people.

Just to make sure we're on the same page, I would expect WYSIWYGWYSIWYG becomes WYSIWYG.

@manovotny
Copy link
Contributor

I've combined the records that each property was being split in to, so now each property in a properties table gets just one record. Same as the production records.

Much better!

I'm wondering if we want to weigh these records down so they need to be searched for specifically.

I would say yes. They could be helpful, but they're also going to add a lot of noise. I'd deprioritize.

@NWylynko NWylynko merged commit 9dc9328 into main Jan 16, 2026
8 checks passed
@NWylynko NWylynko deleted the nick/algolia-search-records branch January 16, 2026 19:22
NWylynko added a commit that referenced this pull request Jan 16, 2026
#2961)

Co-authored-by: Michael Novotny <manovotny@gmail.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
NWylynko added a commit that referenced this pull request Jan 16, 2026
#2961)

Co-authored-by: Michael Novotny <manovotny@gmail.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants