Skip to content

Relax field types supported by _termvectors API #31902

@markharwood

Description

@markharwood

Problem background

In adding a plugin for a new field type I've discovered a few places in core where we have hard-coded assumptions about what represents a searchable text field type:

  1. "All" queries expansion (work in progress here: Change the way that fields are selected for inclusion in all queries. #31655)
  2. MLT queries
  3. TermVectors service

Proposed change

An open question is if all of these problems can be addressed with a new common field-checking service. For now, this issue just relates to solving 3) and the proposed change is to switch the code in TermVectorsService which has :

    // must be a string
    if (fieldType instanceof KeywordFieldMapper.KeywordFieldType == false
            && fieldType instanceof TextFieldMapper.TextFieldType == false) {
        return false;
    }

to this broader base class test:

    if (fieldType instanceof KeywordFieldMapper.KeywordFieldType == false
            && fieldType instanceof StringFieldType == false) {
        return false;
    }

Side effect (desirable or undesirable?)

A side effect of this change is that other forms of indexed strings can now return termvectors info e.g. given this mapping:

PUT test
{
  "mappings": {
	"doc": {
	  "properties": {
		"text": {
		  "type": "text",
		  "index_prefixes": {},
		  "index_phrases": true
		...

and this call to the modified termvectors API:

GET /test/doc/_termvectors
{
	"doc": {
	  "text":  "my example text"		  
	},
	"term_statistics" : true,
	"field_statistics" : true,
	"positions": false,
	"offsets": false,
	"filter" : {
	  "max_num_terms" : 3,
	  "min_term_freq" : 1,
	  "min_doc_freq" : 1
	}
}

..the new version of this API would also return termvectors for the text._index_phrase and text._index_prefix as well as the usual text field. The max_num_terms setting applies to each indexed field so in this case is 3 terms multiplied by 3 fields = 9 terms.

It's not clear if this change in behaviour is useful or not (opinions, @romseygeek ?)

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions