Skip to content

Improve refresh speed with softdelete enable#12557

Open
easyice wants to merge 5 commits intoapache:mainfrom
easyice:softdelete_disi
Open

Improve refresh speed with softdelete enable#12557
easyice wants to merge 5 commits intoapache:mainfrom
easyice:softdelete_disi

Conversation

@easyice
Copy link
Contributor

@easyice easyice commented Sep 15, 2023

I found a flame graph in my production environment, the DocValuesConsumer for ___soft_deletes field accounted for a large proportion

image

flame (12).html.zip

The docvalue for ___soft_deletes field is usually sparse, so it will has a IndexedDISI data structure, the IndexedDISI provide operation like advance(), index(), but for ___soft_deletes field, the index() will not used, because the value of the filed is always 1, we can remove the calculation of index() for this scene.

I ran a benchmark use IndexWriter#softUpdateDocument for IndexedDISI.Method.DENSE case, resulted in ~16% improvement in performance.

@easyice
Copy link
Contributor Author

easyice commented Sep 22, 2023

Update:

when we call softUpdateDocument for a segment that already has some deleted doc, it will iterate all the deleted doc use ReadersAndUpdates#MergedDocValues#onDiskDocValues, but he has to iterate the array twice, the first time is
Lucene90DocValuesConsumer#writeValues will compute gcd, min, max. the second time is IndexedDISI#writeBitSet, this creates some waste, we can remove the first iterate for soft delete, this can speed up about 53% for updates.

Benchmark code:

  public static void main(final String[] args) throws Exception {
    long min = Long.MAX_VALUE;
    for (int i = 0; i < 5; i++) {
      min = Math.min(doWrite(), min);
    }
    System.out.println("BEST:" + min);
  }

static long doWrite() throws IOException {
    Random rand = new Random(5);
    Directory dir = new ByteBuffersDirectory();
    IndexWriter writer =
        new IndexWriter(
            dir,
            new IndexWriterConfig(null)
                .setSoftDeletesField("_soft_deletes")
                .setMaxBufferedDocs(IndexWriterConfig.DISABLE_AUTO_FLUSH));
    int maxDoc = 4096 * 100;

    for (int i = 0; i < maxDoc; i++) {
      Document doc = new Document();
      doc.add(new StringField("id", String.valueOf(i), Field.Store.NO));

      writer.addDocument(doc);
      if (i > 0 && i % 5000 == 0) {
        writer.commit();
      }
    }

    System.out.println("start update");
    long t0 = System.currentTimeMillis();

    for (int i = 0; i < maxDoc; i += 2) {
      Document doc = new Document();
      writer.softUpdateDocument(
          new Term("id", String.valueOf(i)), doc, new NumericDocValuesField("_soft_deletes", 1));
      if (i > 0 && i % 100 == 0) {
        writer.commit();
      }
    }
    long tookMs = System.currentTimeMillis() - t0;
    System.out.println("update took:" + (System.currentTimeMillis() - t0));

    IOUtils.close(writer, dir);
    return tookMs;
  }

@easyice
Copy link
Contributor Author

easyice commented Oct 7, 2023

@jpountz Would you please take a look when you get a chance?

@easyice
Copy link
Contributor Author

easyice commented Oct 10, 2023

Update: revert changes about IndexedDISI#advance to keep things as simple as possible.

@jpountz
Copy link
Contributor

jpountz commented Oct 10, 2023

I understand the idea, I'm a bit less happy about special-casing the soft deletes field in our doc values file format. I don't have a better suggestion though...

@easyice
Copy link
Contributor Author

easyice commented Oct 11, 2023

Thanks @jpountz , it isn't perfect, there seems to be no better way to avoid probing data and know early on that all values are the same, I am just trying to improve, but i haven't thought of a better way yet

@easyice
Copy link
Contributor Author

easyice commented Oct 18, 2023

I get some great suggestion from @gf2121 , we can get if the doc values has single value from ReadersAndUpdates#onDiskDocValues and ReadersAndUpdates#updatesToApply this can be avoid use special-casing for soft deletes field in doc values, is it looks better? @jpountz

@gf2121
Copy link
Contributor

gf2121 commented Oct 18, 2023

Thanks @easyice !

As we are exposing API like softUpdateDocument(Term term, Iterable<? extends IndexableField> doc, Field... softDeletes). We cannot guarantee that all soft delete fields have the same value 1 in Lucene. This patch looks more reasonable as we remove this assertion. But I am not sure if it is the right answer to touch the NumericDocValues interface for this scene, Let's see what other people think :)

@github-actions
Copy link
Contributor

github-actions bot commented Jan 8, 2024

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Jan 8, 2024
@gf2121 gf2121 mentioned this pull request Apr 18, 2025
3 tasks
@github-actions github-actions bot removed the Stale label Oct 1, 2025
@github-actions
Copy link
Contributor

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Oct 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants