Currently, JabRef implements it's own search syntax and backend for bib-fields. Fulltext pdf files are indexed by a Lucene backend. Since we already manage an Index for the fulltext search, we could also index the bib-fields for a faster, more versatile search functionality.
However, this is no easy task as keeping the index up-to-date poses multiple questions. Mainly how to link bib-entries to their index entry, when to update the index, what fields to index and where to store the index and how to show the results.
I summarize some thoughts below. I would like to work on these ideas over the next weeks and then maybe implement the functionality during JabCon2022.
How to link bib-entries to the index
One problem that I already struggled with when implementing the fulltext search is the absence of a unique key connecting a JabRef bibentry object to a corresponding entry in the lucene index. Citation keys are not necessarily present. JabRefs entry identifier is volatile and may be different each time JabRef opens. To synchronize the index however, we need a mechanism to link an entry to the index.
One solution could be hashes. When the user changes an entry, we would need to generate the hash before the change, update the indexed fields and then update the hash to the hash after the change. This would also allow us to easily check which entries need to be re-indexed at startup. Just compare all hashes in the library to all hashes in the index. Hashes that are not found in the index need to be indexed, hashes that are not found in the library need to be deleted.
When
Every time an entry changes, the index needs to change with it. This can be:
- At startup
- When the user changes an entry from JabRef
- When the user changes something in the bib-file
- Other?
Also, we noticed for the fulltext-search functionality that indexing takes too much time to be done by the GUI thread. I assume that this problem is not given with the normal bib-fields (as it's only a few hundred words at max and no file needs to be opened and parsed). I suggest (at least trying to) index bib-fields in the foreground and keep the fulltext-indexing in the background. A problem that immediately comes to mind: locks. Only one thread may write to the index at a time. If we keep everything in the same index, the background fulltext-indexer could block the indexing of the bib-fields. Solution could be to use two indices, but that makes the search more complicated. This problem needs further investigation.
What
ALL bib-fields and linked files (if files can be parsed by JabRef, currently only .pdf but could probably easily be extended to txt, rtf... if that is a valid use-case).
Uncertain: How to treat custom fields. I am unsure if the fields-set needs to be fixed in the Lucene index or if one can add fields on the fly. This needs further inverstigation.
Where
Personally I would prefer having the index close to the bibfile, but the fulltext-index is currently stored in app-data folders (~/.local on linux) and AFAIK that is what programs are supposed to do so I suggest to keep that location.
How to show the results
I would like to highlight search matches in the table. Fulltext-results are currently shown in a tab in the entry editor - which I really do not like. Back when I implemented the feature, @calixtus proposed a way to show the results directly in the table by inserting a row under the corresponding entry that spans the whole table and shows the results. I cannot currently find the link Carl sent back then, but will look it up again. I think that would be a great way to highlight the search results.
Currently, JabRef implements it's own search syntax and backend for bib-fields. Fulltext pdf files are indexed by a Lucene backend. Since we already manage an Index for the fulltext search, we could also index the bib-fields for a faster, more versatile search functionality.
However, this is no easy task as keeping the index up-to-date poses multiple questions. Mainly how to link bib-entries to their index entry, when to update the index, what fields to index and where to store the index and how to show the results.
I summarize some thoughts below. I would like to work on these ideas over the next weeks and then maybe implement the functionality during JabCon2022.
How to link bib-entries to the index
One problem that I already struggled with when implementing the fulltext search is the absence of a unique key connecting a JabRef bibentry object to a corresponding entry in the lucene index. Citation keys are not necessarily present. JabRefs entry identifier is volatile and may be different each time JabRef opens. To synchronize the index however, we need a mechanism to link an entry to the index.
One solution could be hashes. When the user changes an entry, we would need to generate the hash before the change, update the indexed fields and then update the hash to the hash after the change. This would also allow us to easily check which entries need to be re-indexed at startup. Just compare all hashes in the library to all hashes in the index. Hashes that are not found in the index need to be indexed, hashes that are not found in the library need to be deleted.
When
Every time an entry changes, the index needs to change with it. This can be:
Also, we noticed for the fulltext-search functionality that indexing takes too much time to be done by the GUI thread. I assume that this problem is not given with the normal bib-fields (as it's only a few hundred words at max and no file needs to be opened and parsed). I suggest (at least trying to) index bib-fields in the foreground and keep the fulltext-indexing in the background. A problem that immediately comes to mind: locks. Only one thread may write to the index at a time. If we keep everything in the same index, the background fulltext-indexer could block the indexing of the bib-fields. Solution could be to use two indices, but that makes the search more complicated. This problem needs further investigation.
What
ALL bib-fields and linked files (if files can be parsed by JabRef, currently only .pdf but could probably easily be extended to txt, rtf... if that is a valid use-case).
Uncertain: How to treat custom fields. I am unsure if the fields-set needs to be fixed in the Lucene index or if one can add fields on the fly. This needs further inverstigation.
Where
Personally I would prefer having the index close to the bibfile, but the fulltext-index is currently stored in app-data folders (~/.local on linux) and AFAIK that is what programs are supposed to do so I suggest to keep that location.
How to show the results
I would like to highlight search matches in the table. Fulltext-results are currently shown in a tab in the entry editor - which I really do not like. Back when I implemented the feature, @calixtus proposed a way to show the results directly in the table by inserting a row under the corresponding entry that spans the whole table and shows the results. I cannot currently find the link Carl sent back then, but will look it up again. I think that would be a great way to highlight the search results.