Skip to content

UTF-8 template file with 1 Polish character parsed using ISO-8859-1 #448

@rmilecki

Description

@rmilecki

I'm creating this issue to provide another example of unreliable chardet behaviour.

A following simple template:

# -*- coding: utf-8 -*-
# SPDX-License-Identifier: MIT
issuer: PayPro
keywords:
  - 'NIP: PL779-236-98-87'
fields:
  vatin:
    parser: static
    value: PL7792369887
  amount: Do zap.aty:\s+(\d[\d\s]*\.\d{2})
  date: Data wystawienia:\s+(\d{4}-\d{2}-\d{2})
  sale_date: Data sprzedaży:\s+(\d{4}-\d{2}-\d{2})
  invoice_number: NR:\s+([\dA-Z/]+)
options:
  currency: PLN
  date_formats:
    - '%Y-%m-%d'
  decimal_separator: '.'

gets encoding incorrectly detected as ISO-8859-1 (instead of utf-8) and its sale_date RegEx doesn't get matched against invoice content.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions