Kuromoji analysis part-of-speech filter not working


**Elasticsearch version** (`bin/elasticsearch --version`): 5.5.2

**Plugins installed**: [analysis-icu       analysis-smartcn  ingest-geoip       x-pack
analysis-kuromoji  analysis-stempel  ingest-user-agent
]

**JVM version** (`java -version`):

```
openjdk version "1.8.0_141"
OpenJDK Runtime Environment (build 1.8.0_141-b16)
OpenJDK 64-Bit Server VM (build 25.141-b16, mixed mode)
```

**OS version** (`uname -a` if on a Unix-like system):

Linux 4.9.47-1-lts #1 SMP Sat Sep 2 09:26:00 CEST 2017 x86_64 GNU/Linux

**Description of the problem including expected versus actual behavior**:

I am trying to migrate from elasticsearch 2.4 to 5.x. Basically, everything is working as expected, but the part-of-speech filter does not remove the default stoptags which used to work alright before.

**Steps to reproduce**:

 1. create an index with the kuromoji tokenizer and a part-of-speech filter

```bash
$ http PUT :32769/kuromoji_sample <<<'{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "kuromoji_tokenizer",
            "filter": [
              "my_posfilter"
            ]
          }
        },
        "filter": {
          "my_posfilter": {
            "type": "kuromoji_part_of_speech",
            "stoptags": [
              "助詞-格助詞-一般",
              "助詞-終助詞"
            ]
          }
        }
      }
    }
  }
}'

HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "acknowledged": true,
    "shards_acknowledged": true
}

```
 2. analyze the text "寿司がおいしいね"

```bash
$ http :32769/kuromoji_sample/_analyze analyzer=my_analyzer  text="寿司がおいしいね"

HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "tokens": [
        {
            "end_offset": 2,
            "position": 0,
            "start_offset": 0,
            "token": "寿司",
            "type": "word"
        },
        {
            "end_offset": 7,
            "position": 2,
            "start_offset": 3,
            "token": "おいしい",
            "type": "word"
        }
    ]
}
```
Here the "が" and "ね" characters are correctly removed.

 3. create an index the same way as in step 1, but do not specify the `stoptags`:
```bash
$ http PUT :32769/kuromoji_sample_2 <<<'{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "kuromoji_tokenizer",
            "filter": [
              "my_posfilter"
            ]
          }
        },
        "filter": {
          "my_posfilter": {
            "type": "kuromoji_part_of_speech"
          }
        }
      }
    }
  }
}'

HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "acknowledged": true,
    "shards_acknowledged": true
}
```
 4. analyze the text "寿司がおいしいね" again
 
```bash
$ http :32769/kuromoji_sample_2/_analyze analyzer=my_analyzer  text="寿司がおいしいね"

HTTP/1.1 200 OK
content-encoding: gzip
content-type: application/json; charset=UTF-8
transfer-encoding: chunked

{
    "tokens": [
        {
            "end_offset": 2,
            "position": 0,
            "start_offset": 0,
            "token": "寿司",
            "type": "word"
        },
        {
            "end_offset": 3,
            "position": 1,
            "start_offset": 2,
            "token": "が",
            "type": "word"
        },
        {
            "end_offset": 7,
            "position": 2,
            "start_offset": 3,
            "token": "おいしい",
            "type": "word"
        },
        {
            "end_offset": 8,
            "position": 3,
            "start_offset": 7,
            "token": "ね",
            "type": "word"
        }
    ]
}
```

This example is taken from the documentation page here: https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-kuromoji-speech.html

That page says, that stoptags is "An array of part-of-speech tags that should be removed. It defaults to the **stoptags.txt** file embedded in the lucene-analyzer-kuromoji.jar"

I have looked at the embedded file in that jar and could not find any difference to the version used by in the 2.4 kuromoji plugin.

I also tried to define an empty array, or use a combination of latin characters, but it always returns four tokens instead of two.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kuromoji analysis part-of-speech filter not working #26519

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Kuromoji analysis part-of-speech filter not working #26519

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions