Skip to content

Conversation

@G0maa
Copy link
Collaborator

@G0maa G0maa commented Nov 27, 2025

Description of change

Currently typesnese index workflow fails a lot due to limited memory, this PR helps optimize memory usage a bit by:

  1. Deleting old (failed) collections before syncing.
  2. Deleting unused facets. Facets are used as parameters in search bar, we currently do not support that.

Fixes #11799

I've tested this on my own free-tier Typesense cloud deployment and it worked, but please note that this is not guaranteed to fix OUT_OF_MEMORY issues, we may eventually need to increase memory 😄

Pull-Request Checklist

  • Code is up-to-date with the master branch
  • This pull request links relevant issues as Fixes #00000
  • There are new or updated unit tests validating the change N\A
  • Documentation has been updated to reflect this change N\A

- Refactor workflow to delete temp collections.
- Delete unsued facets from docsearch scraper config.
@qodo-free-for-open-source-projects

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🧪 No relevant tests
🔒 Security concerns

Sensitive information exposure:
The script uses TYPESENSE_API_KEY from secrets, which is appropriate. However, curl commands with -s flag suppress errors but still output responses that might contain sensitive data in workflow logs. Consider using -f flag to fail on HTTP errors and redirect output to /dev/null for DELETE operations to minimize log exposure. Also, ensure that error messages from failed API calls don't leak the API key or other sensitive information in the workflow logs.

⚡ Recommended focus areas for review

Error Handling

The bash script lacks error handling for curl commands. If API requests fail (network issues, authentication problems), the script continues silently. Consider adding error checks after curl commands and using set -e to exit on errors.

ALIAS_COLLECTION=$(curl -s -H "X-TYPESENSE-API-KEY: $TYPESENSE_API_KEY" \
  "$TYPESENSE_PROTOCOL://$TYPESENSE_HOST:$TYPESENSE_PORT/aliases/typeorm-docs" \
  | jq -r '.collection_name')

if [ "$ALIAS_COLLECTION" = "null" ] || [ -z "$ALIAS_COLLECTION" ]; then
  echo "Alias does not exist; skipping collection cleanup."
  exit 0
fi

echo "Alias currently points to: $ALIAS_COLLECTION"

COLLECTIONS=$(curl -s -H "X-TYPESENSE-API-KEY: $TYPESENSE_API_KEY" \
  "$TYPESENSE_PROTOCOL://$TYPESENSE_HOST:$TYPESENSE_PORT/collections" \
  | jq -r '.[].name')

for col in $COLLECTIONS; do
  if [ "$col" != "$ALIAS_COLLECTION" ]; then
    echo "Deleting unaliased collection: $col"
    curl -s -X DELETE -H "X-TYPESENSE-API-KEY: $TYPESENSE_API_KEY" \
      "$TYPESENSE_PROTOCOL://$TYPESENSE_HOST:$TYPESENSE_PORT/collections/$col"
  fi
Race Condition

There's a potential race condition between fetching the alias collection and deleting collections. If the alias changes between the two operations, the currently aliased collection could be deleted. Consider re-checking the alias before each deletion.

for col in $COLLECTIONS; do
  if [ "$col" != "$ALIAS_COLLECTION" ]; then
    echo "Deleting unaliased collection: $col"
    curl -s -X DELETE -H "X-TYPESENSE-API-KEY: $TYPESENSE_API_KEY" \
      "$TYPESENSE_PROTOCOL://$TYPESENSE_HOST:$TYPESENSE_PORT/collections/$col"
  fi
done

@qodo-free-for-open-source-projects
Copy link

qodo-free-for-open-source-projects bot commented Nov 27, 2025

PR Code Suggestions ✨

Latest suggestions up to eecce94

CategorySuggestion                                                                                                                                    Impact
General
Add API error handling

Add error handling for the curl command that fetches the alias information. The
script should check the HTTP status code and exit if the API call fails.

.github/workflows/docsearch.yml [22-24]

-ALIAS_COLLECTION=$(curl -s -H "X-TYPESENSE-API-KEY: $TYPESENSE_API_KEY" \
-  "$TYPESENSE_PROTOCOL://$TYPESENSE_HOST:$TYPESENSE_PORT/aliases/typeorm-docs" \
-  | jq -r '.collection_name')
+ALIAS_RESPONSE=$(curl -s -w "\n%{http_code}" -H "X-TYPESENSE-API-KEY: $TYPESENSE_API_KEY" \
+  "$TYPESENSE_PROTOCOL://$TYPESENSE_HOST:$TYPESENSE_PORT/aliases/typeorm-docs")
+HTTP_CODE=$(echo "$ALIAS_RESPONSE" | tail -n1)
+ALIAS_COLLECTION=$(echo "$ALIAS_RESPONSE" | head -n -1 | jq -r '.collection_name')
 
+if [ "$HTTP_CODE" -ne 200 ] && [ "$HTTP_CODE" -ne 404 ]; then
+  echo "Error: Failed to fetch alias information (HTTP $HTTP_CODE)"
+  exit 1
+fi
+
  • Apply / Chat
Suggestion importance[1-10]: 8

__

Why: This is a valuable suggestion that adds crucial error handling for the initial API call. Failing to get the alias information correctly could lead to deleting the wrong collections, so exiting on failure makes the script much safer.

Medium
Add deletion error handling

Add error handling to the curl DELETE request. Check the HTTP status code and
log a warning if deleting a collection fails to prevent silent failures.

.github/workflows/docsearch.yml [37-43]

 for col in $COLLECTIONS; do
   if [ "$col" != "$ALIAS_COLLECTION" ]; then
     echo "Deleting unaliased collection: $col"
-    curl -s -X DELETE -H "X-TYPESENSE-API-KEY: $TYPESENSE_API_KEY" \
-      "$TYPESENSE_PROTOCOL://$TYPESENSE_HOST:$TYPESENSE_PORT/collections/$col"
+    RESPONSE=$(curl -s -w "\n%{http_code}" -X DELETE -H "X-TYPESENSE-API-KEY: $TYPESENSE_API_KEY" \
+      "$TYPESENSE_PROTOCOL://$TYPESENSE_HOST:$TYPESENSE_PORT/collections/$col")
+    HTTP_CODE=$(echo "$RESPONSE" | tail -n1)
+    if [ "$HTTP_CODE" -ne 200 ] && [ "$HTTP_CODE" -ne 404 ]; then
+      echo "Warning: Failed to delete collection $col (HTTP $HTTP_CODE)"
+    fi
   fi
 done
  • Apply / Chat
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly points out that the curl command silences errors, and adds robust error handling by checking the HTTP status code, which improves the reliability of the cleanup script.

Medium
  • More

Previous suggestions

Suggestions up to commit 17ab414
CategorySuggestion                                                                                                                                    Impact
Possible issue
Filter collections by name pattern

Filter the list of collections to be deleted to only include those starting with
the typeorm-docs prefix. This prevents the accidental deletion of unrelated
collections that might exist in the same Typesense instance.

.github/workflows/docsearch.yml [33-37]

 COLLECTIONS=$(curl -s -H "X-TYPESENSE-API-KEY: $TYPESENSE_API_KEY" \
   "$TYPESENSE_PROTOCOL://$TYPESENSE_HOST:$TYPESENSE_PORT/collections" \
-  | jq -r '.[].name')
+  | jq -r '.[].name | select(startswith("typeorm-docs"))')
 
 for col in $COLLECTIONS; do
Suggestion importance[1-10]: 9

__

Why: This suggestion addresses a critical flaw where the script could delete unrelated collections in a shared Typesense instance. Filtering collections by a specific prefix (typeorm-docs) is a crucial safeguard against accidental data loss.

High
General
Add deletion error handling

Add error handling to the curl command that deletes Typesense collections. Check
the HTTP response code to ensure the deletion was successful and log an error if
it fails.

.github/workflows/docsearch.yml [37-43]

 for col in $COLLECTIONS; do
   if [ "$col" != "$ALIAS_COLLECTION" ]; then
     echo "Deleting unaliased collection: $col"
-    curl -s -X DELETE -H "X-TYPESENSE-API-KEY: $TYPESENSE_API_KEY" \
-      "$TYPESENSE_PROTOCOL://$TYPESENSE_HOST:$TYPESENSE_PORT/collections/$col"
+    RESPONSE=$(curl -s -w "\n%{http_code}" -X DELETE -H "X-TYPESENSE-API-KEY: $TYPESENSE_API_KEY" \
+      "$TYPESENSE_PROTOCOL://$TYPESENSE_HOST:$TYPESENSE_PORT/collections/$col")
+    HTTP_CODE=$(echo "$RESPONSE" | tail -n1)
+    if [ "$HTTP_CODE" != "200" ]; then
+      echo "Failed to delete collection $col (HTTP $HTTP_CODE)"
+    fi
   fi
 done
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly points out that the curl command to delete collections lacks error handling, which could lead to silent failures. Adding a check for the HTTP status code makes the cleanup script more robust and easier to debug.

Medium
Add API error handling

Add explicit error handling for the curl command that fetches the alias
information. Check the HTTP status code and exit with an error if the API call
fails for reasons other than a 404 Not Found.

.github/workflows/docsearch.yml [22-24]

-ALIAS_COLLECTION=$(curl -s -H "X-TYPESENSE-API-KEY: $TYPESENSE_API_KEY" \
-  "$TYPESENSE_PROTOCOL://$TYPESENSE_HOST:$TYPESENSE_PORT/aliases/typeorm-docs" \
-  | jq -r '.collection_name')
+ALIAS_RESPONSE=$(curl -s -w "\n%{http_code}" -H "X-TYPESENSE-API-KEY: $TYPESENSE_API_KEY" \
+  "$TYPESENSE_PROTOCOL://$TYPESENSE_HOST:$TYPESENSE_PORT/aliases/typeorm-docs")
+HTTP_CODE=$(echo "$ALIAS_RESPONSE" | tail -n1)
+ALIAS_COLLECTION=$(echo "$ALIAS_RESPONSE" | head -n-1 | jq -r '.collection_name')
 
+if [ "$HTTP_CODE" != "200" ] && [ "$HTTP_CODE" != "404" ]; then
+  echo "Failed to fetch alias information (HTTP $HTTP_CODE)"
+  exit 1
+fi
+
Suggestion importance[1-10]: 6

__

Why: The suggestion improves the script's robustness by adding explicit error handling for the API call that fetches the alias. While the existing code has a check that implicitly handles some failures, this change makes the error handling more explicit and reliable, causing the job to fail on unexpected API errors.

Low
Suggestions up to commit 16b9b5b
CategorySuggestion                                                                                                                                    Impact
Possible issue
Use a while loop for safer iteration

Replace the for loop with a while read loop to safely iterate over collection
names. This prevents potential issues if collection names contain spaces or
special characters.

.github/workflows/docsearch.yml [37-43]

-for col in $COLLECTIONS; do
-  if [ "$col" != "$ALIAS_COLLECTION" ]; then
+echo "$COLLEctions" | while IFS= read -r col; do
+  if [ -n "$col" ] && [ "$col" != "$ALIAS_COLLECTION" ]; then
     echo "Deleting unaliased collection: $col"
     curl -s -X DELETE -H "X-TYPESENSE-API-KEY: $TYPESENSE_API_KEY" \
       "$TYPESENSE_PROTOCOL://$TYPESENSE_HOST:$TYPESENSE_PORT/collections/$col"
   fi
 done
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies a potential bug where collection names with spaces would be processed incorrectly due to shell word splitting, and proposes a more robust iteration method.

Medium

@G0maa G0maa requested review from gioboa and pkuczynski and removed request for gioboa November 27, 2025 08:11
@G0maa G0maa changed the title fix: typesense fix: typesense doc sync Nov 27, 2025
Copy link
Collaborator

@gioboa gioboa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @G0maa it looks fine to me. Let's try this step in the main pipeline. 💪

@G0maa G0maa requested review from alumni, dlhck and mguida22 November 28, 2025 08:27
@alumni alumni requested a review from naorpeled November 28, 2025 10:28
Copy link
Collaborator

@alumni alumni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I'm not familiar with Typesense and our account setup. @dlhck and @naorpeled would be the best candidates to review :)

@dlhck dlhck merged commit d0b5454 into typeorm:master Nov 28, 2025
10 checks passed
@alumni
Copy link
Collaborator

alumni commented Nov 28, 2025

Nice @G0maa, seems the indexing is now running without failing 👏🏻

Does anyone have any idea how the docs are published? Seems that on typeorm.io the docs stopped being published sometime after v0.3.27. Is that right?

Also: is Typesense indexing our markdown files or the website?

Maybe in the next meeting we can discuss docs too - we should talk about publishing + indexing for 2-3 versions (0.3, master and next).

@G0maa
Copy link
Collaborator Author

G0maa commented Nov 28, 2025

@alumni

Nice @G0maa, seems the indexing is now running without failing 👏🏻

For now 🙈

Does anyone have any idea how the docs are published? Seems that on typeorm.io the docs stopped being published sometime after v0.3.27. Is that right?

Sadly no idea.

Also: is Typesense indexing our markdown files or the website?

Based on these lines it's indexing the website. I think I saw it somewhere that Typesense supports indexing docusaurus directly (but will have to verify).

Maybe in the next meeting we can discuss docs too - we should talk about publishing + indexing for 2-3 versions (0.3, master and next).

Totally agree.

Copy link
Collaborator

@gioboa gioboa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typesense-docsearch-scraper is the command for indexing the website. we have it in the workflow.
In the latest run there is the out of memory error as well.

@G0maa
Copy link
Collaborator Author

G0maa commented Nov 28, 2025

In the latest run there is the out of memory error as well

That's not good 😅

@alumni
Copy link
Collaborator

alumni commented Nov 28, 2025

@gioboa I noticed that, but it's in the part where it displays the recent logs (so I though that log might be from the previous runs, but I might be wrong).

@gioboa
Copy link
Collaborator

gioboa commented Nov 28, 2025

@gioboa I noticed that, but it's in the part where it displays the recent logs (so I though that log might be from the previous runs, but I might be wrong).

I run the task again, same error. :(

@G0maa
Copy link
Collaborator Author

G0maa commented Nov 28, 2025

Hmmmm... I'll try to figure out a workaround for this.

I believe we shouldn't need to increase memory, TypeORM docs aren't that massive.
Although I'm unaware what increases memory usage, collection size is a factor, but what about e.g. searches happening while writing a new collection? 🤔

Going through Typesense docs there's a way to limit the batch_size see. So I might modify that next, say 10 instead of 40.

@alumni
Copy link
Collaborator

alumni commented Nov 28, 2025

@G0maa that might actually be a good idea, since we go OOM, not out of disk space. Some of the pages are long (but still, they should not take MBs).

ThbltLmr pushed a commit to ThbltLmr/typeorm that referenced this pull request Dec 2, 2025
Co-authored-by: Giorgio Boa <35845425+gioboa@users.noreply.github.com>
Copy link
Collaborator

@gioboa gioboa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I increased the typesense RAM, it's working fine now

Image

mgohin pushed a commit to mgohin/typeorm that referenced this pull request Jan 15, 2026
Co-authored-by: Giorgio Boa <35845425+gioboa@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

"Index Docs To Typesense" workflow should delete old/errored collections by default

5 participants