Skip to content

issue: S3 Vectors has_collection() fails for indexes beyond first page due to missing pagination #19233

@shargyle

Description

@shargyle

Check Existing Issues

  • I have searched for any existing and/or related issues.
  • I have searched for any existing and/or related discussions.
  • I have also searched in the CLOSED issues AND CLOSED discussions and found no related items (your issue might already be addressed on the development branch!).
  • I am using the latest version of Open WebUI.

Installation Method

Docker

Open WebUI Version

v0.6.36

Ollama Version (if applicable)

No response

Operating System

AWS ECS

Browser (if applicable)

No response

Confirmation

  • I have read and followed all instructions in README.md.
  • I am using the latest version of both Open WebUI and Ollama.
  • I have included the browser console logs.
  • I have included the Docker container logs.
  • I have provided every relevant configuration, setting, and environment variable used in my setup.
  • I have clearly listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc).
  • I have documented step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation. My steps:
  • Start with the initial platform/version/OS and dependencies used,
  • Specify exact install/launch/configure commands,
  • List URLs visited, user input (incl. example values/emails/passwords if needed),
  • Describe all options and toggles enabled or changed,
  • Include any files or environmental changes,
  • Identify the expected and actual result at each stage,
  • Ensure any reasonably skilled user can follow and hit the same issue.

Expected Behavior

When using S3 Vectors for RAG storage, the has_collection() method should correctly detect if a vector index exists, regardless of how many total indexes exist in the bucket.

When uploading a file and then querying it:

  1. File uploads successfully
  2. Vector index is created
  3. RAG queries can find and use the indexed content

Actual Behavior

When the S3 Vectors bucket contains more than approximately 1,000 indexes:

  • Newly created indexes are reported as "not found"
  • RAG queries return "No sources found" despite successful index creation
  • Logs show: WARNING | Collection 'file-<uuid>' does not exist
  • The issue persists even 15+ minutes after index creation

Pattern observed:

  • ✅ Below ~1,000 total indexes: Retrieval works correctly (collection is found)
  • ❌ Above ~1,000 total indexes: Collections beyond the first ~1,000 are not found
  • ✅ Direct AWS API calls confirm indexes exist

Steps to Reproduce

Environment Setup

Platform: AWS ECS Fargate (or any Docker deployment)
Open WebUI Version: v0.6.36
Docker Image: ghcr.io/open-webui/open-webui:v0.6.36

Required Environment Variables:

# S3 Vectors Configuration (only relevant configs)
RAG_EMBEDDING_ENGINE=s3vectors
AWS_S3_VECTOR_BUCKET_NAME=your-vector-bucket-name
AWS_REGION=us-west-2  # or your preferred region
AWS_ACCESS_KEY_ID=<your-access-key>
AWS_SECRET_ACCESS_KEY=<your-secret-key>

# Standard configs (database, API keys, etc.)
DATABASE_URL=postgresql://user:pass@db-host:5432/openwebui

Docker Run Command:

docker run -d \
  --name open-webui \
  -p 3000:8080 \
  -e RAG_EMBEDDING_ENGINE=s3vectors \
  -e AWS_S3_VECTOR_BUCKET_NAME=your-vector-bucket-name \
  -e AWS_REGION=us-west-2 \
  -e AWS_ACCESS_KEY_ID=<your-access-key> \
  -e AWS_SECRET_ACCESS_KEY=<your-secret-key> \
  -e DATABASE_URL=postgresql://user:pass@db-host:5432/openwebui \
  -v open-webui:/app/backend/data \
  ghcr.io/open-webui/open-webui:v0.6.36

AWS S3 Vectors Bucket Setup:

# Create S3 Vectors bucket (if not exists)
aws s3vectors create-bucket \
  --bucket-name your-vector-bucket-name \
  --region us-west-2

IAM Permissions Required:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3vectors:*"
      ],
      "Resource": "*"
    }
  ]
}

Creating the Prerequisite State (1,000+ Indexes)

Option 1: Upload 1,000 files through UI (time-consuming but realistic)

  • Upload 1,000+ different files through the workspace
  • Each file creates one index

Option 2: Script to create test indexes (faster for reproduction)

import boto3
import uuid

client = boto3.client('s3vectors', region_name='us-west-2')
bucket_name = 'your-vector-bucket-name'

# Create 1,050 test indexes
for i in range(1050):
    index_name = f"test-index-{uuid.uuid4()}"
    client.create_index(
        vectorBucketName=bucket_name,
        indexName=index_name,
        dimension=1536,
        distanceMetric='cosine'
    )
    if i % 100 == 0:
        print(f"Created {i} indexes...")

Reproduction Steps

Step 1: Verify container is running

docker logs open-webui --tail 50
# Expected: No errors, server started on port 8080
# Actual: Server running normally

Step 2: Access Open WebUI

  • Navigate to http://localhost:3000 (or your deployment URL)
  • Log in with admin credentials
  • Expected: Dashboard loads successfully
  • Actual: Dashboard loads successfully ✓

Step 3: Create test file
Create a test file test-document.txt with content:

This is a test document for vector search.
It contains information about quantum computing and artificial intelligence.
The quantum computing section discusses qubits and superposition.

Step 4: Upload file to workspace

  • Click "Workspace" in left sidebar
  • Click "Knowledge" tab
  • Click "+ Add Knowledge" button
  • Select "Upload Files"
  • Choose test-document.txt
  • Click "Upload"
  • Expected: File uploads, shows "Processing..." then "Completed"
  • Actual: File shows "Completed" status ✓

Step 5: Monitor logs during upload

docker logs -f open-webui | grep -E "(Created S3 index|Completed insertion|Collection)"

Expected:

INFO | Created S3 index: file-abc123def456
INFO | Completed insertion of 15 vectors

Actual: Same as expected ✓

Step 6: Wait for indexing to complete

  • Wait 60 seconds (well beyond typical indexing time)
  • Expected: Index should be queryable after ~10 seconds
  • Actual: Waited 60 seconds

Step 7: Verify index exists via AWS CLI

# Get the file UUID from logs (e.g., "abc123def456")
aws s3vectors get-index \
  --vector-bucket-name your-vector-bucket-name \
  --index-name file-abc123def456 \
  --region us-west-2

Expected: Error (index doesn't exist)
Actual: Returns index details - index EXISTS! ✓

{
  "index": {
    "indexName": "file-abc123def456",
    "creationTime": "2024-01-15T10:30:00Z",
    "dimension": 1536,
    "distanceMetric": "cosine"
  }
}

Step 8: Query the uploaded file

  • Click "New Chat" in left sidebar
  • In chat input, type: What does the document say about quantum computing?
  • Press Enter
  • Expected: Response citing the uploaded document with quote about qubits
  • Actual: Response says "I don't have any specific information about quantum computing in the uploaded documents" with message "No sources found" ❌

Step 9: Check logs for warnings

docker logs open-webui | grep -i "collection.*does not exist"

Expected: No warnings
Actual: Multiple warnings ❌

WARNING | Collection 'file-abc123def456' does not exist
WARNING | Collection 'file-abc123def456' does not exist
WARNING | Collection 'file-abc123def456' does not exist

Step 10: Verify total index count

# Count total indexes in bucket
aws s3vectors list-indexes \
  --vector-bucket-name your-vector-bucket-name \
  --region us-west-2 \
  --output json | jq '.indexes | length'

Expected: Any number
Actual: Returns 1000 (or close to it), and response includes "nextToken": "..." indicating pagination ❌

Key Observations

  1. Index creation succeeds - logs confirm, AWS CLI confirms
  2. Index retrieval fails - has_collection() returns False
  3. Pattern: Only occurs when bucket has 1,000+ total indexes
  4. Root cause: has_collection() only checks first page of list_indexes() results (~1,000 indexes)
  5. Timeline: Issue persists indefinitely (tested 15+ minutes after upload)

Root Cause (Technical Details)

The has_collection() method in backend/open_webui/retrieval/vector/dbs/s3vector.py (line ~126) uses boto3's list_indexes() without pagination handling:

def has_collection(self, collection_name: str) -> bool:
    try:
        response = self.client.list_indexes(vectorBucketName=self.bucket_name)
        indexes = response.get("indexes", [])
        return any(idx.get("indexName") == collection_name for idx in indexes)
    except Exception as e:
        log.error(f"Error listing indexes: {e}")
        return False

The Problem:

  • boto3 list_indexes() returns only ~1,000 indexes by default with a nextToken for pagination
  • AWS CLI handles pagination automatically (masks the issue during manual testing)
  • The code only checks the first page of results
  • Indexes beyond position ~1,000 are never found

Evidence from boto3 testing:

import boto3
client = boto3.client("s3vectors", region_name="us-west-2")
response = client.list_indexes(vectorBucketName=bucket_name)

print(f"Indexes returned: {len(response.get('indexes', []))}")  # ~1000
print(f"Has nextToken: {'nextToken' in response}")  # True
print(f"Target found: {target in [i['indexName'] for i in response['indexes']]}")  # False

With more than ~1,000 indexes in the bucket, boto3 returns approximately 1,000 indexes with a nextToken indicating more pages exist. The code doesn't handle pagination, so indexes beyond the first page are invisible to the application.

Logs & Screenshots

Index Creation (successful):

[Timestamp] | INFO | Created S3 index: file-<uuid>
[Timestamp] | INFO | Inserted batch 1: 100 vectors into index
[Timestamp] | INFO | Completed insertion of 100 vectors

Query Attempt (minutes later):

[Timestamp] | WARNING | Collection 'file-<uuid>' does not exist
[Timestamp] | WARNING | Collection 'file-<uuid>' does not exist
[Timestamp] | WARNING | Collection 'file-<uuid>' does not exist

AWS CLI verification (proves index exists):

$ aws s3vectors get-index \
    --vector-bucket-name <bucket-name> \
    --index-name <uuid-from-logs>
{
    "index": {
        "indexName": "file-<uuid>",
        "creationTime": "<timestamp>",
        "dimension": 1536,
        "distanceMetric": "cosine"
    }
}

Additional Information

Proposed Solution

Replace list_indexes() with direct get_index() lookup:

def has_collection(self, collection_name: str) -> bool:
    """
    Check if a vector index exists using direct lookup.
    This avoids pagination issues with list_indexes() and is significantly faster.
    """
    try:
        self.client.get_index(
            vectorBucketName=self.bucket_name,
            indexName=collection_name
        )
        return True
    except self.client.exceptions.ResourceNotFoundException:
        return False
    except Exception as e:
        log.error(f"Error checking if index '{collection_name}' exists: {e}")
        return False

Benefits:

  • ✅ No pagination required (direct lookup)
  • ✅ ~8x faster (0.19s vs 1.53s in testing)
  • ✅ Correct for any number of indexes
  • ✅ Scales to millions of indexes
  • ✅ Proper exception handling

Testing Results:

  • Direct get_index() call: 0.19 seconds, found correctly ✅
  • Current list_indexes() call: 1.53 seconds, not found (beyond page 1) ❌

Impact:

  • High severity for production deployments with 1,000+ indexes
  • Breaks core RAG functionality for newly uploaded files once bucket has >1,000 indexes
  • Inconsistent behavior causes user confusion
  • Scales poorly as index count grows
  • Issue becomes more frequent as more files are uploaded (more indexes created)

Workaround:
No reliable workaround available. The issue affects all new file uploads once the bucket contains more than ~1,000 indexes.

Willing to Submit PR:
Yes, I have identified the root cause and proposed a fix with supporting evidence. I can submit a PR to address this issue, though full integration testing would require a test environment with 1,000+ indexes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions