Skip to content

[4.0] Optimizing Smart Search for larger content#30008

Merged
wilsonge merged 3 commits intojoomla:4.0-devfrom
Hackwar:j4findermemory
Jul 19, 2020
Merged

[4.0] Optimizing Smart Search for larger content#30008
wilsonge merged 3 commits intojoomla:4.0-devfrom
Hackwar:j4findermemory

Conversation

@Hackwar
Copy link
Copy Markdown
Member

@Hackwar Hackwar commented Jul 5, 2020

Summary of Changes

Smart Search has issues with larger content and to be honest, this already triggers rather early. It fails for content larger than 200kb.

The first issue is, that the description column in the #__finder_links table is not large enough. The solution here is to truncate the description to a reasonable size. That reasonable size is 32000 characters and I pulled that number out of thin air. I hope that it is low enough to allow for a string with multi-byte chars still to be saved.

The second issue is, that we are trying to store all tokens at once in the DB. That means if we have a text with 50k words, we are creating a query that tries to insert 50k rows at once. That fails spectacularly. So now we do this in chunks of 128 rows at a time.

Third issue is the tokens table is running full rather easily. The number of tokens storable in the tokens table is depending on the max_heap_table_size, which on my system is set to 16MB and which in that case fails at just 22k entries. To get around this issue, we have the toggleTables() method, but that has only been run so far between the different properties to index. If you text has 2MB in size, it tries to insert all those 2MB into the DB and it fails right away. Thus we are checking the limit also when storing in the additional code of issue 2. I also reduced the number of entries to 10k instead of 30k.

This fixes #27807.

Testing Instructions

Find a large text, for example the bible and create an article with that as content. Save and previously see that it takes a very long time and fails at some point. Apply this patch and try again. See now that it doesn't take as long and actually works successfully now.

@PhilETaylor

This comment was marked as abuse.

@Hackwar
Copy link
Copy Markdown
Member Author

Hackwar commented Jul 6, 2020

Do you have more than that one article in the installation? I'm asking because indexing several articles does require more memory than a single one. You could test if it works when you lower the number of items to index in one pass.

@PhilETaylor

This comment was marked as abuse.

@chrisdavenport
Copy link
Copy Markdown
Contributor

This looks to be the problem that #16621 was intended to fix.

@Hackwar
Copy link
Copy Markdown
Member Author

Hackwar commented Jul 6, 2020

I've reviewed the code and the cache actually doesn't really makes sense. We are storing the tokenised result of the whole string, which either doesn't help a lot, since the string is very short (category names, etc.) or it never has a cache hit, since we would have to index the exactly same text twice. This new cache now caches the tokens and looks those up instead. This actually reduces the necessary memory by several megabytes. I've had results that went as far as reducing from 30+MB to just 14MB.

I also tested the cache size and 1024 is better than 2048. The later meant a significant increase in memory consumption, but no real reduction in execution time.

I would assume that this would also solve the issue around #16621...

@PhilETaylor can you test again if this change fixes your memory issue?

@PhilETaylor

This comment was marked as abuse.

@PhilETaylor

This comment was marked as abuse.

@Hackwar
Copy link
Copy Markdown
Member Author

Hackwar commented Jul 7, 2020

PHP 7.3.5 and MariaDB 10.1.39 here. I've now gone and used the "worst" settings for indexing possible. Enabled tuple search, enabled stemming, had TinyMCE as editor and copied the whole bee movie script in there and it worked fine. My memory had a peak usage of 18MB. Can you debug where the huge memory usage is coming from? allocating 33MB at that point at once seems strange.

@PhilETaylor

This comment was marked as abuse.

@Hackwar
Copy link
Copy Markdown
Member Author

Hackwar commented Jul 10, 2020

Regarding the bee movie: I did try the re-indexing as well and that worked fine, too.

Regarding the bible: We have a general issue with very large texts. It does not fail gracefully here and I don't know why (yet). I had similar issues where it loaded an empty form after saving for an eternity...

This PR is hopefully a step towards handling larger documents better, but we should do larger, more systematic tests with large documents, too...

@PhilETaylor

This comment was marked as abuse.

@PhilETaylor

This comment was marked as abuse.

@Quy
Copy link
Copy Markdown
Contributor

Quy commented Jul 10, 2020

I have tested this item ✅ successfully on e5ae33c


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/30008.

@Quy
Copy link
Copy Markdown
Contributor

Quy commented Jul 10, 2020

RTC


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/30008.

@joomla-cms-bot joomla-cms-bot added the RTC This Pull Request is Ready To Commit label Jul 10, 2020
@richard67 richard67 added this to the Joomla 4.0 milestone Jul 11, 2020
@wilsonge wilsonge merged commit 31a54f7 into joomla:4.0-dev Jul 19, 2020
@wilsonge
Copy link
Copy Markdown
Contributor

Thanks!

@joomla-cms-bot joomla-cms-bot removed the RTC This Pull Request is Ready To Commit label Jul 19, 2020
dgrammatiko added a commit to dgrammatiko/joomla-cms that referenced this pull request Jul 21, 2020
…outs

* '4.0-dev' of github.com:joomla/joomla-cms: (612 commits)
  [4.0] Smart Search: Fixing ordering, order direction and disabled button (joomla#29474)
  [4.0] Generate routed Modal links for iframes when not on the root (joomla#30007)
  [4.0] Get menu directly in com_tags menu route helper (joomla#30039)
  Remove collapse when resizing from mobile to desktop (joomla#30132)
  [4.0] Wrap component output in `main` element to make Cassiopeia more accessible (joomla#29870)
  [4.0] Webauthn gmp warning (joomla#29731)
  [4.0] Refactor to return early, remove if depths and throw NotAllowed (joomla#29694)
  [4.0] CLI help text (joomla#29811)
  Feature/draggable typo fixes (joomla#29987)
  [4.0] Removing unnecessary workaround in finder indexer (joomla#30037)
  [4.0] Optimizing Smart Search for larger content (joomla#30008)
  [4.0] Fix js ajax for pre update checker (joomla#29980)
  [4.0] Cassiopea: Fixing modals custom-select fields display (joomla#30097)
  [4.0][com_fields] Fix draggable sorting (joomla#30094)
  [4.0] Correct incorrect @return documentation (joomla#30092)
  [4.0] Menu items modal: adding missing filters (joomla#30087)
  short to long php open tags with echo (joomla#30089)
  Use new Toolbar (joomla#30085)
  [4.0] Center status/date created headers (joomla#29249)
  [4.0] Fix Cassiopea searchtools alignment in modals (joomla#30077)
  ...

# Conflicts:
#	administrator/components/com_templates/src/View/Template/HtmlView.php
#	installation/sql/postgresql/base.sql
#	libraries/src/Application/AdministratorApplication.php
#	libraries/src/Application/SiteApplication.php
sakiss pushed a commit to sakiss/joomla-cms that referenced this pull request Oct 16, 2020
@Hackwar Hackwar deleted the j4findermemory branch October 23, 2020 20:16
@joomla joomla deleted a comment from avinoor12 Apr 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants