Add `string:jaro_distance/2` by the-mikedavis · Pull Request #7863 · erlang/otp

the-mikedavis · 2023-11-14T21:37:43Z

@garazdawi mentioned this might be a nice addition to string (erlang/rebar3#2844 (comment)). It's a translation from Elixir's String.jaro_distance/2 adapted to allow unicode:chardata() rather than only binaries.

@garazdawi also mentioned someone from the Erlang/OTP team was interested in working on this so I am submitting what I have now - please feel free to take it over or supersede this PR if you'd like!

Previously, if a case failed and expected a non-string, non-tuple value, the 'io:format/2' call in the 'test_1/5' helper would error rather than printing because of the '~ts' control sequence. For example, writing an incorrect case in the `length` case like so will fail: ?TEST("abc", [], 4) To fix this we swap the clauses so that we use '~ts' for binaries and lists and '~w' for everything else.

okeuday · 2023-11-15T10:06:32Z

@the-mikedavis Please consider having this functionality as string:distance/3 with the first argument as DistanceType :: jaro with the potential for adding the atoms jaro_winkler | levenshtein | damerau_levenshtein in the future (as described here) and the return value as an integer for the number of edits or characters in common. A second function string:distance_normalized/3 could be added for the [0.0 .. 1.0] return value. Providing people with a choice of the algorithm can relate to the expected string contents and the expected algorithm latency.

dgud · 2023-11-15T10:27:30Z

@okeuday I don't like that, they have very different properties and return values, (I have implemented most of them).
Also the Levenstein algorithms are 10 times slower then jaro, so I don't really see the need of them.

For what would you use the Levenstein algorithm when jaro is 10 times faster?

okeuday · 2023-11-15T11:42:05Z

@dgud Applications of the Levenshtein algorithm are described here and most usages are likely uncommon in Erlang, but a shell feature that could provide help based on module or function spelling mistakes could be an Erlang use-case. If the strings are relatively short the latency may be justifiable with getting a better result. Not attempting to advocate for a slower shell. It just seems best to care about more than 1 string distance algorithm.

dgud · 2023-11-15T12:21:14Z

> a shell feature that could provide help based on module or function spelling mistakes could be an Erlang use-case.

Which is one reason why we want to add jaro, and why we considered the other ones.
I'm still not convinced, that a Levenshtein variant is significantly better at that.

dgud · 2023-11-15T15:43:24Z

@the-mikedavis We have another variant of the code coming, we have concluded that the elixir variant calculates the transpositions "wrongly", at least it's different than the defacto standard, for example:
jaro_distance("Sunday", "Saturday") gives another answer than the algorithm on rosetta code.

The original paper also seems to agree more with rosetta algorithms, then elixirs, though the original code
does an integer division of 2 which would give the same result as elixirs of the example above.

But I'll keep this PR open for inspiration, i.e. test changes and docs.

dgud · 2023-11-17T09:26:44Z

Thanks, for the work.
Closed this in favor of #7879

the-mikedavis added 2 commits November 14, 2023 16:27

Add 'string:jaro_distance/2'

65146e6

dgud self-assigned this Nov 15, 2023

dgud mentioned this pull request Nov 17, 2023

Add string:jaro_similarity/2 #7879

Merged

dgud closed this Nov 17, 2023

the-mikedavis deleted the md-string-jaro_distance/2 branch November 17, 2023 13:54

josevalim mentioned this pull request Feb 15, 2024

Fix bug in jaro_distance implementation elixir-lang/elixir#13349

Closed

sabiwara mentioned this pull request Feb 25, 2024

Use Erlang's implementation of jaro distance, fix bugs elixir-lang/elixir#13369

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `string:jaro_distance/2`#7863

Add `string:jaro_distance/2`#7863
the-mikedavis wants to merge 2 commits intoerlang:masterfrom
the-mikedavis:md-string-jaro_distance/2

the-mikedavis commented Nov 14, 2023

Uh oh!

okeuday commented Nov 15, 2023

Uh oh!

dgud commented Nov 15, 2023

Uh oh!

okeuday commented Nov 15, 2023 •

edited

Loading

Uh oh!

dgud commented Nov 15, 2023

Uh oh!

dgud commented Nov 15, 2023

Uh oh!

dgud commented Nov 17, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

the-mikedavis commented Nov 14, 2023

Uh oh!

okeuday commented Nov 15, 2023

Uh oh!

dgud commented Nov 15, 2023

Uh oh!

okeuday commented Nov 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dgud commented Nov 15, 2023

Uh oh!

dgud commented Nov 15, 2023

Uh oh!

dgud commented Nov 17, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

okeuday commented Nov 15, 2023 •

edited

Loading