Skip to content

Daitch-Mokotoff soundex gives incorrect results when it should return multiple encodings #28211

@bkazez

Description

@bkazez

Elasticsearch version: Version: 6.1.1, Build: bd92e7f/2017-12-17T20:23:25.338Z, JVM: 1.8.0_144

Plugins installed: [analysis-icu, analysis-phonetic]

JVM version: java version "1.8.0_144"

OS version: Darwin Kernel Version 17.3.0

Description of the problem including expected versus actual behavior:

Daitch-Mokotoff analyzer returns only one token when it should return multiple.

Steps to reproduce:

...
        "analyzer_daitch_mokotoff": {
          "type": "custom",
          "tokenizer": "lowercase",
          "filter": [
            "daitch_mokotoff"
          ]
        }
curl -XGET 'http://localhost:9200/indexname/_analyze?pretty' -H 'Content-Type: application/json' -d'{
  "analyzer": "analyzer_daitch_mokotoff",
  "text": "CHAUPTMAN"
}'

This should return 573660 (ch sounding like tch) and 473660 (ch sounding like kh) but instead only returns 473660.

{
  "tokens" : [
    {
      "token" : "473660",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "word",
      "position" : 0
    }
  ]
}

See Daitch-Mokotoff soundex spec here: http://www.avotaynu.com/soundex.htm

Until this is fixed, the D-M soundex feature in the phonetic plugin is not usable.

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions