Skip to content

NeedRetryException when creating indexes sequentially on large datasets #2702

@tae898

Description

@tae898

Description

When creating multiple indexes sequentially on large datasets, subsequent index creation commands fail with NeedRetryException because background LSMTree compaction from previous index creation is still running.

Tested on: 25.10.1-SNAPSHOT

Steps to Reproduce

  1. Import a large dataset (e.g., 33.8M records into a table)
  2. Create indexes sequentially:
    CREATE INDEX ON Movie (movieId) UNIQUE;      -- Works
    CREATE INDEX ON Rating (userId) NOTUNIQUE;   -- Works, triggers compaction on 33.8M records
    CREATE INDEX ON Rating (movieId) NOTUNIQUE;  -- FAILS: compaction still running
    CREATE INDEX ON Link (movieId) UNIQUE;       -- FAILS: compaction still running
    CREATE INDEX ON Tag (movieId) NOTUNIQUE;     -- FAILS: compaction still running

Expected Behavior

One of the following:

  1. Synchronous behavior: CREATE INDEX blocks until any running compaction completes, then creates the new index
  2. Queuing: Index creation requests are queued and executed after current compaction finishes
  3. Retry handling: The database automatically retries internally without throwing exception to client

Actual Behavior

✅ CREATE INDEX ON Movie (movieId) UNIQUE
✅ CREATE INDEX ON Rating (userId) NOTUNIQUE
   → Triggers LSMTree compaction on 33.8M records (takes 30-60 seconds)
❌ CREATE INDEX ON Rating (movieId) NOTUNIQUE
   → com.arcadedb.exception.NeedRetryException: Cannot create a new index while asynchronous tasks are running
❌ CREATE INDEX ON Link (movieId) UNIQUE
   → com.arcadedb.exception.NeedRetryException: Cannot create a new index while asynchronous tasks are running
❌ CREATE INDEX ON Tag (movieId) NOTUNIQUE
   → com.arcadedb.exception.NeedRetryException: Cannot create a new index while asynchronous tasks are running

The exception is thrown immediately even though the index creation could simply wait for compaction to finish.

Impact

  • Developer Experience: ❌ Requires manual retry logic for a common operation
  • API Consistency: ⚠️ Other database operations handle internal async tasks transparently
  • Batch Scripts: ❌ Cannot create multiple indexes in a single script without retry logic
  • Documentation Gap: ⚠️ Behavior is not documented, developers encounter unexpected failures

Environment

  • Version: 25.10.1-SNAPSHOT
  • Dataset: MovieLens ml-latest (33,832,163 ratings)
  • Interface: Python bindings via arcadedb_embedded
  • JVM Heap: 8GB
  • Mode: Embedded

Logs

Error message:

Command failed: com.arcadedb.exception.NeedRetryException: 
Cannot create a new index while asynchronous tasks are running

Compaction logs show LSMTree operations taking time:

LSMTreeIndex 'Rating[userId]' compacted 50 pages, remaining 0 pages 
(totalKeys=289037 totalValues=2251732)

Subsequent attempts during compaction window:

2025-10-25 14:04:14.318 WARNI [LSMTreeIndex] Index 'Rating_14_175471788647837' 
compacted in 174ms (keys=289166 values=2251404 mutablePages=2 immutablePages=51)

Reproduction Script

import arcadedb_embedded as arcadedb
import time

# After importing large dataset...
indexes = [
    ("Movie", "movieId", "UNIQUE"),
    ("Rating", "userId", "NOTUNIQUE"),
    ("Rating", "movieId", "NOTUNIQUE"),  # This will fail
    ("Link", "movieId", "UNIQUE"),       # This will fail
    ("Tag", "movieId", "NOTUNIQUE"),     # This will fail
]

for table, column, uniqueness in indexes:
    try:
        db.command("sql", f"CREATE INDEX ON {table} ({column}) {uniqueness}")
        print(f"✅ Created {table}({column})")
    except Exception as e:
        print(f"❌ Failed {table}({column}): {e}")

Output:

✅ Created Movie(movieId)
✅ Created Rating(userId)
❌ Failed Rating(movieId): NeedRetryException: Cannot create a new index while asynchronous tasks are running
❌ Failed Link(movieId): NeedRetryException: Cannot create a new index while asynchronous tasks are running
❌ Failed Tag(movieId): NeedRetryException: Cannot create a new index while asynchronous tasks are running

Analysis

The issue occurs because:

  1. Creating an index on a large table triggers background LSMTree compaction
  2. Compaction can take 30-60+ seconds for tables with tens of millions of records
  3. During compaction, the database rejects new index creation requests
  4. No internal queuing or waiting mechanism exists

Suggested Solutions

Option 1: Synchronous Blocking (Simplest)

Make CREATE INDEX wait internally for any running compaction to complete before proceeding.

Pros:

  • Simple, predictable behavior
  • No API changes needed
  • Works like other databases

Cons:

  • Blocks the calling thread

Option 2: Internal Queuing

Queue index creation requests and execute them serially after compaction completes.

Pros:

  • Non-blocking API
  • Handles multiple concurrent requests

Cons:

  • More complex implementation
  • Need to communicate queue status to client

Option 3: Automatic Retry

Internally retry the index creation with exponential backoff instead of throwing exception.

Pros:

  • Transparent to client
  • Eventually succeeds

Cons:

  • Hidden latency
  • Harder to set appropriate timeout

Option 4: Document as Expected Behavior

If this is intentional, document that applications must implement retry logic.

Pros:

  • No code changes

Cons:

  • Poor developer experience
  • Inconsistent with typical database behavior

Current Workaround

Implement retry logic with delays:

max_retries = 60  # 10 minutes max (60 * 10 seconds)
retry_delay = 10  # seconds between retries

for table, column, uniqueness in indexes:
    created = False
    
    for attempt in range(max_retries):
        try:
            db.command("sql", f"CREATE INDEX ON {table} ({column}) {uniqueness}")
            print(f"✅ Created {table}({column})")
            created = True
            break
        except Exception as e:
            if "NeedRetryException" in str(e) and "asynchronous tasks" in str(e):
                if attempt < max_retries - 1:
                    print(f"⏳ Compaction running, retry {attempt+1}/{max_retries}...")
                    time.sleep(retry_delay)
                else:
                    print(f"❌ Timeout after {max_retries} attempts")
            else:
                # Different error - don't retry
                print(f"❌ Error: {e}")
                break

This works but requires every application to implement the same retry pattern.

Related Issues

This is separate from but related to issue #2701 (duplicate timestamped indexes during compaction). Both stem from LSMTree compaction behavior on large datasets.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions