Skip to content

Flexible fusion backend #813

@c-poley

Description

@c-poley

With Annif, it is possible to use several specialised models for prediction in an ensemble. However, all models in an Annif ensemble, can only be given one specific single kind of text for prediction, so it is not possible to pass on different kinds of text to each single model.
The only way to adapt the text for prediction currently is to use the transform parameter. We can make use of that parameter to read either a limited amount of characters from the beginning or read all of the text. A parameter that would enable us to set a specific range (from character x to character y) to be read from the text would give us additional way to specify/cut down the text for specific models.

Could the annif ensemble functionality be extended in such a way that the individual models of an ensemble could be given different kinds of text (expressions of a document) for processing?

Another flexibility for the ensemble functionality would be the use of subsets of vocabularies for individual models in ensembles as discussed in issue #596 .

Anyway, the interface for the predictions need enhancments. In the following we describe some ideas we already discussed a few weeks ago:

  1. Allow to submit text as structured data:
    Use json:
    -d '%7B%22headline%22%3A%22Wonderful%22%2C%22fulltext%22%3A%22Oh%2C%20what%20a%20wonderful%20world%22%7D'
    or
    Use xml tags:
    -d '%3Cheadline%3EWonderful%3C%2Fheadline%3E%3Cfulltext%3EOh%2C%20what%20a%20wonderful%20world%3C%2Ftext%3E'
    ... with the possibility to define the tags at the right places in the projects.conf, like:
    submitted_text=headline,' ',fulltext
    The empty space between the "headline" and the "fulltext" defines the character(s), how to glue the parts of the text to submit.
    submitted_text=headline,'.',toc
    Here is a headline and a toc to submit. They are connected with a "."

  2. An approach on the way to allow a fusion is an enhancement of the limit parameter. This will allow us to define the part of the submitted text.
    In the projects.conf we only need a little enhancement, that defines the starting point and the number of characters to proceed:
    transform=limit(500,2000)

We (and we think the whole community) would really benefit from the implementation of a fusion with freely configurable structured data like in (1). We have to admit, that the usage of structured data would be the most favorite and clean implementation.

Best regards,
Christoph, Frank, Jan-Helge and Sandro from the German National Library

Metadata

Metadata

Assignees

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions