Skip to content

Kuromoji analysis part-of-speech filter not working #26519

@avdv

Description

@avdv

Elasticsearch version (bin/elasticsearch --version): 5.5.2

Plugins installed: [analysis-icu analysis-smartcn ingest-geoip x-pack
analysis-kuromoji analysis-stempel ingest-user-agent
]

JVM version (java -version):

openjdk version "1.8.0_141"
OpenJDK Runtime Environment (build 1.8.0_141-b16)
OpenJDK 64-Bit Server VM (build 25.141-b16, mixed mode)

OS version (uname -a if on a Unix-like system):

Linux 4.9.47-1-lts #1 SMP Sat Sep 2 09:26:00 CEST 2017 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

I am trying to migrate from elasticsearch 2.4 to 5.x. Basically, everything is working as expected, but the part-of-speech filter does not remove the default stoptags which used to work alright before.

Steps to reproduce:

  1. create an index with the kuromoji tokenizer and a part-of-speech filter
$ http PUT :32769/kuromoji_sample <<<'{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "kuromoji_tokenizer",
            "filter": [
              "my_posfilter"
            ]
          }
        },
        "filter": {
          "my_posfilter": {
            "type": "kuromoji_part_of_speech",
            "stoptags": [
              "助詞-格助詞-一般",
              "助詞-終助詞"
            ]
          }
        }
      }
    }
  }
}'

HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "acknowledged": true,
    "shards_acknowledged": true
}
  1. analyze the text "寿司がおいしいね"
$ http :32769/kuromoji_sample/_analyze analyzer=my_analyzer  text="寿司がおいしいね"

HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "tokens": [
        {
            "end_offset": 2,
            "position": 0,
            "start_offset": 0,
            "token": "寿司",
            "type": "word"
        },
        {
            "end_offset": 7,
            "position": 2,
            "start_offset": 3,
            "token": "おいしい",
            "type": "word"
        }
    ]
}

Here the "が" and "ね" characters are correctly removed.

  1. create an index the same way as in step 1, but do not specify the stoptags:
$ http PUT :32769/kuromoji_sample_2 <<<'{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "kuromoji_tokenizer",
            "filter": [
              "my_posfilter"
            ]
          }
        },
        "filter": {
          "my_posfilter": {
            "type": "kuromoji_part_of_speech"
          }
        }
      }
    }
  }
}'

HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "acknowledged": true,
    "shards_acknowledged": true
}
  1. analyze the text "寿司がおいしいね" again
$ http :32769/kuromoji_sample_2/_analyze analyzer=my_analyzer  text="寿司がおいしいね"

HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "tokens": [
        {
            "end_offset": 2,
            "position": 0,
            "start_offset": 0,
            "token": "寿司",
            "type": "word"
        },
        {
            "end_offset": 3,
            "position": 1,
            "start_offset": 2,
            "token": "",
            "type": "word"
        },
        {
            "end_offset": 7,
            "position": 2,
            "start_offset": 3,
            "token": "おいしい",
            "type": "word"
        },
        {
            "end_offset": 8,
            "position": 3,
            "start_offset": 7,
            "token": "",
            "type": "word"
        }
    ]
}

This example is taken from the documentation page here: https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-kuromoji-speech.html

That page says, that stoptags is "An array of part-of-speech tags that should be removed. It defaults to the stoptags.txt file embedded in the lucene-analyzer-kuromoji.jar"

I have looked at the embedded file in that jar and could not find any difference to the version used by in the 2.4 kuromoji plugin.

I also tried to define an empty array, or use a combination of latin characters, but it always returns four tokens instead of two.

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions