Skip to content

Add progress logging to LSMTreeIndex build phase#2772

Merged
robfrank merged 4 commits intomainfrom
copilot/add-logging-to-full-text-index
Nov 15, 2025
Merged

Add progress logging to LSMTreeIndex build phase#2772
robfrank merged 4 commits intomainfrom
copilot/add-logging-to-full-text-index

Conversation

Copy link
Contributor

Copilot AI commented Nov 10, 2025

Index builds on large datasets (millions of records) were completely silent for hours, providing no visibility into progress, performance, or whether the process was stuck.

Changes

  • LSMTreeIndex.java: Added periodic logging in build() method

    • Start message with index name and property count
    • Progress updates every 10K records with processing rate (records/sec)
    • Completion message with total records and elapsed time
  • LSMTreeIndexTest.java: Added testBuildWithLogging() to verify logging behavior

Example output

INFO [LSMTreeIndex] Building index 'Post[Body]' on 1 properties...
INFO [LSMTreeIndex] Building index 'Post[Body]': processed 10000 records (833 records/sec)...
INFO [LSMTreeIndex] Building index 'Post[Body]': processed 20000 records (850 records/sec)...
...
INFO [LSMTreeIndex] Completed building index 'Post[Body]': processed 59819048 records in 22823000ms

This brings build phase logging in line with the existing compaction phase logging and enables monitoring of long-running index operations.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • repository.apache.org
    • Triggering command: /usr/lib/jvm/temurin-17-jdk-amd64/bin/java -classpath /home/REDACTED/.m2/wrapper/dists/apache-maven-3.9.9/3477a4f1/boot/plexus-classworlds-2.8.0.jar -Dclassworlds.conf=/home/REDACTED/.m2/wrapper/dists/apache-maven-3.9.9/3477a4f1/bin/m2.conf -Dmaven.home=/home/REDACTED/.m2/wrapper/dists/apache-maven-3.9.9/3477a4f1 -Dlibrary.jansi.path=/home/REDACTED/.m2/wrapper/dists/apache-maven-3.9.9/3477a4f1/lib/jansi-native -Dmaven.multiModuleProjectDirectory=/home/REDACTED/work/arcadedb/arcadedb org.codehaus.plexus.classworlds.launcher.Launcher -f pom.xml -B -V -e -Dfindbugs.skip -Dcheckstyle.skip -Dpmd.skip=true -Dspotbugs.skip -Denforcer.skip -Dmaven.javadoc.skip -DskipTests -Dmaven.test.skip.exec -Dlicense.skip=true -Drat.skip=true -Dspotless.check.skip=true -s /home/REDACTED/work/arcadedb/.codeql-scratch/dbs/java/working/settings.xml com.github.ferstl:depgraph-maven-plugin:4.0.3-CodeQL:graph -B -DclasspathScopes=compile -DoutputDirectory=/home/REDACTED/work/arcadedb/.codeql-scratch/dbs/java/working/java-standalone-dependency-graph16842588051782640130/${project.groupId}/${project.artifactId} -DgraphFormat=json -Dmaven.ext.class.path=/opt/hostedtoolcache/CodeQL/2.23.3/x64/codeql/java/tools/java-buildless-maven-plugin.jar (dns block)
  • scarf.sh
    • Triggering command: node ./report.js (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Original prompt

This section details on the original issue you should resolve

<issue_title>No logging during FULL_TEXT index build phase</issue_title>
<issue_description>## Problem Description

When creating FULL_TEXT indexes on large datasets, the index build process is completely silent for extended periods (hours), making it impossible to determine if:

  • The process is making progress
  • The process is stuck/frozen
  • How long until completion

This is particularly problematic for datasets with millions of records where the build phase can take 5-10+ hours with zero feedback.

Current Behavior

LSM_TREE indexes provide good logging during compaction:

2025-11-08 15:34:58.737 WARNI [LSMTreeIndex] - Creating a new entry in index 'PostLink_0_...' root page [2024-02-12...]->385
2025-11-08 15:34:59.334 WARNI [LSMTreeIndex] Index 'PostLink_0_...' compacted in 83381ms (keys=5758994 values=6541209...)

FULL_TEXT indexes are silent during the build phase:

15:34:59 - Previous index completes
[5-10 HOURS OF SILENCE - NO LOGS]
23:45:00 - Compaction starts (finally some logs)

Reproduction

  1. Create a FULL_TEXT index on a large text field with millions of records:

    database.getSchema().createTypeIndex(
        SchemaImpl.INDEX_TYPE.FULL_TEXT,
        false,
        "Post",
        new String[] { "Body" }
    );
  2. Observe complete silence during the initial build phase (scanning records, tokenizing text, inserting into LSM tree)

  3. Only when compaction starts do logs appear

Root Cause

In LSMTreeIndex.java (line 626-657), the build() method has no logging:

public long build(final int buildIndexBatchSize, final BuildIndexCallback callback) {
    // ... setup code ...
    
    db.scanBucket(db.getSchema().getBucketById(associatedBucketId).getName(), record -> {
        db.getIndexer().addToIndex(LSMTreeIndex.this, record.getIdentity(), (Document) record);
        total.incrementAndGet();
        
        // NO LOGGING HERE - could be processing for hours!
        
        if (total.get() % buildIndexBatchSize == 0) {
            db.getWrappedDatabaseInstance().commit();
            db.getWrappedDatabaseInstance().begin();
        }
        
        if (callback != null)
            callback.onDocumentIndexed((Document) record, total.get());
        
        return true;
    });
    
    // ... cleanup code ...
}

For FULL_TEXT indexes, this means:

  • Scanning millions of records
  • Tokenizing billions of words (via Lucene analyzer)
  • Inserting each token into the underlying LSM tree
  • All happening silently

Impact on Large Datasets

Based on testing with Stack Overflow dataset:

Dataset Size Records Build Time Logging
Medium (425K posts) 425,735 3m 25s ❌ None
Large (60M posts) 59,819,048 5-7 hours ❌ None

During the 5-7 hour build phase, there is zero indication that work is being done.

Proposed Solution

Add periodic progress logging in LSMTreeIndex.build():

public long build(final int buildIndexBatchSize, final BuildIndexCallback callback) {
    checkIsValid();
    final AtomicLong total = new AtomicLong();
    final long LOG_INTERVAL = 10000; // Log every 10K records
    
    if (propertyNames == null || propertyNames.isEmpty())
        throw new IndexException("Cannot rebuild index '" + name + "' because metadata information are missing");
    
    final DatabaseInternal db = getDatabase();
    
    if (status.compareAndSet(INDEX_STATUS.AVAILABLE, INDEX_STATUS.UNAVAILABLE)) {
        
        LogManager.instance().log(this, Level.INFO, 
            "Building index '%s' on %d properties...", name, propertyNames.size());
        
        final long startTime = System.currentTimeMillis();
        
        db.scanBucket(db.getSchema().getBucketById(associatedBucketId).getName(), record -> {
            db.getIndexer().addToIndex(LSMTreeIndex.this, record.getIdentity(), (Document) record);
            total.incrementAndGet();
            
            // ADD: Periodic progress logging
            if (total.get() % LOG_INTERVAL == 0) {
                final long elapsed = System.currentTimeMillis() - startTime;
                final double rate = total.get() / (elapsed / 1000.0);
                LogManager.instance().log(this, Level.INFO, 
                    "Building index '%s': processed %d records (%.0f records/sec)...", 
                    name, total.get(), rate);
            }
            
            if (total.get() % buildIndexBatchSize == 0) {
                db.getWrappedDatabaseInstance().commit();
                db.getWrappedDatabaseInstance().begin();
            }
            
            if (callback != null)
                callback.onDocumentIndexed((Document) record, total.get());
            
            return true;
        });
        
        // ADD: Completion logging
        final long elapsed = System.currentTimeMillis...

</details>

- Fixes ArcadeData/arcadedb#2763

<!-- START COPILOT CODING AGENT TIPS -->
---

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more [Copilot coding agent tips](https://gh.io/copilot-coding-agent-tips) in the docs.

Copilot AI and others added 2 commits November 10, 2025 09:17
Co-authored-by: robfrank <413587+robfrank@users.noreply.github.com>
Co-authored-by: robfrank <413587+robfrank@users.noreply.github.com>
@mergify
Copy link
Contributor

mergify bot commented Nov 10, 2025

🧪 CI Insights

Here's what we observed from your CI run for 67cc106.

🟢 All jobs passed!

But CI Insights is watching 👀

Copilot AI changed the title [WIP] Add logging during FULL_TEXT index build phase Add progress logging to LSMTreeIndex build phase Nov 10, 2025
Copilot AI requested a review from robfrank November 10, 2025 09:35
@codacy-production
Copy link

Coverage summary from Codacy

See diff coverage on Codacy

Coverage variation Diff coverage
-0.10% 100.00%
Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (f397d6d) 73243 46790 63.88%
Head commit (67cc106) 73255 (+12) 46727 (-63) 63.79% (-0.10%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#2772) 11 11 100.00%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

See your quality gate settings    Change summary preferences

@robfrank robfrank linked an issue Nov 10, 2025 that may be closed by this pull request
@robfrank robfrank added this to the 25.11.1 milestone Nov 10, 2025
@robfrank robfrank requested a review from lvca November 10, 2025 12:40
@robfrank robfrank marked this pull request as ready for review November 10, 2025 15:10
@robfrank robfrank merged commit fa1405e into main Nov 15, 2025
21 of 25 checks passed
robfrank added a commit that referenced this pull request Feb 11, 2026
Co-authored-by: Roberto Franchini <ro.franchini@gmail.com>
(cherry picked from commit fa1405e)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

No logging during FULL_TEXT index build phase

2 participants