[ES Number] Wrong digit detection in Spanish due to thousands separator

The regular expression to detect sequences of digits is parametrized to include the thousands separator. In the Python library this produces the following regex:

```Python
IntegerRegexDefinition = lambda placeholder, thousandsmark: f'(((?<!\\d+\\s*)-\\s*)|((?<=\\b)(?<!(\\d+\\.|\\d+,))))\\d{{1,3}}({thousandsmark}\\d{{3}})+(?={placeholder})'
```
(see https://github.com/Microsoft/Recognizers-Text/blob/master/Python/libraries/recognizers-number/recognizers_number/resources/base_numbers.py#L13)

Unfortunately in Spanish the thousands separator is a dot `.`. That means that when the regex is instantiated, a dot is placed in it, which, given that the dot is a metacharacter in the regex, translates to "any character".

This makes the regex match unintended sequences; in effect anything like `\d{1,3}.\d{3}` is detected as a plain number: `2r345` , `23#310`, `329@120` are all numbers for this regex. The problem also composes to more than one separator: `2Q423W532` is also a number

In Python this is easily solvable by quoting the parameter:

```Python
IntegerRegexDefinition = lambda placeholder, thousandsmark: f'(((?<!\\d+\\s*)-\\s*)|((?<=\\b)(?<!(\\d+\\.|\\d+,))))\\d{{1,3}}(' + re.escape(thousandsmark) + f'\\d{{3}})+(?={placeholder})'
```
...but since that file is automatically generated this should be patched upstream. The problem is also likely to affect other programming languages as well


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ES Number] Wrong digit detection in Spanish due to thousands separator #1468

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[ES Number] Wrong digit detection in Spanish due to thousands separator #1468

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions