Skip to content

mempalace_check_duplicate misses existing text at default threshold 0.9 #247

@lzhuojun251-ctrl

Description

@lzhuojun251-ctrl

mempalace_check_duplicate misses clearly existing content at its default threshold of 0.9.

I tested it against text that already exists in the palace, including exact sentences copied from indexed content, and still got:

{
"is_duplicate": false,
"matches": []
}

Example 1:
马丁·海德格尔(Martin Heidegger)出生于德国巴登——符腾堡(Baden-Württemberg)梅斯基尔希的一个贫寒的天主教家庭中。

This only returned true after lowering the threshold to 0.4.

Example 2:
本标准使用重新起草法参考 ISO 690:2010(E)《信息和文献 参考文献和信息资源引用指南》编制,与 ISO 690:2010 的一致性程度为非等效。

This only returned true after lowering the threshold to 0.15.

So on real indexed content, the default 0.9 appears too high to detect even exact existing text.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/i18nMultilingual, Unicode, non-English embeddingsarea/searchSearch and retrievalbugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions