Skip to content

[ES Number] Wrong digit detection in Spanish due to thousands separator #1468

@paulovn

Description

@paulovn

The regular expression to detect sequences of digits is parametrized to include the thousands separator. In the Python library this produces the following regex:

IntegerRegexDefinition = lambda placeholder, thousandsmark: f'(((?<!\\d+\\s*)-\\s*)|((?<=\\b)(?<!(\\d+\\.|\\d+,))))\\d{{1,3}}({thousandsmark}\\d{{3}})+(?={placeholder})'

(see https://github.com/Microsoft/Recognizers-Text/blob/master/Python/libraries/recognizers-number/recognizers_number/resources/base_numbers.py#L13)

Unfortunately in Spanish the thousands separator is a dot .. That means that when the regex is instantiated, a dot is placed in it, which, given that the dot is a metacharacter in the regex, translates to "any character".

This makes the regex match unintended sequences; in effect anything like \d{1,3}.\d{3} is detected as a plain number: 2r345 , 23#310, 329@120 are all numbers for this regex. The problem also composes to more than one separator: 2Q423W532 is also a number

In Python this is easily solvable by quoting the parameter:

IntegerRegexDefinition = lambda placeholder, thousandsmark: f'(((?<!\\d+\\s*)-\\s*)|((?<=\\b)(?<!(\\d+\\.|\\d+,))))\\d{{1,3}}(' + re.escape(thousandsmark) + f'\\d{{3}})+(?={placeholder})'

...but since that file is automatically generated this should be patched upstream. The problem is also likely to affect other programming languages as well

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions