Skip to content

Similarity comparisons should ignore case and non alphanumeric characters #238

@ericbuckley

Description

@ericbuckley

Summary

When comparing a feature on two different records, differences in cases and punctuation shouldn't negatively impact the score.

Impact

Not normalizing the strings before JaroWinkler/Levenshtein comparisons negatively drops scores, when the reality is they are referencing the same value.

Steps to reproduce

Examples to consider:

  • Thomas vs thomas
  • O'Hara vs Ohara
  • 321 Main St vs 321 Main St.
  • Jose vs José
  • Albany vs Albany

Expected behavior

The above 5 cases should result in similarity scores of 1.0.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No fields configured for Bug.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions