Problem:
In python 3, plain string literals become unicode strings. In some parts of NVDA we use plain strings for binary data. This will cause errors and strange behaviour.
To gain confidence in a Python 3 release of NVDA we need consider these strings, and how they are used, and decide if the literal should be prefixed with 'u', 'r', or 'b'.
There are more than 1000 unspecified string literals in the NVDA codebase. To count this I used regex [^u\n"]"[^"\n] on *.py files in the repo/source directory, pycharm IDE also allows to ignore results in comments.
Suggested approach:
- Existing cases specifying 'u' are ok, they were already intended to be used as unicode strings.
- Existing cases specifying 'r' are higher risk, they may be used for binary data. We should look at these first.
- Using regex
r".+" there seems to be just under 5000 of these strings.
- Cases with no prefix, the vast majority will be ok to be unicode strings. There are certainly some cases that are used as bytes / binary data. These will be the hardest to find.
Looking at each string individually will take weeks, I suggest we see how we can exclude low risk areas:
- translated strings (
_("blah"), pgettext("blah")) can be ignored. Whether these have 'u', 'r', or no prefix. We can be quite confident they will not be bytes.
Open questions:
- How are we going to keep track of what has been looked at / excluded?
- Is it feasible to use regex to automate adding string literals to areas that we are confident in?
- How do we identify general cases we can exclude?
Problem:
In python 3, plain string literals become unicode strings. In some parts of NVDA we use plain strings for binary data. This will cause errors and strange behaviour.
To gain confidence in a Python 3 release of NVDA we need consider these strings, and how they are used, and decide if the literal should be prefixed with 'u', 'r', or 'b'.
There are more than 1000 unspecified string literals in the NVDA codebase. To count this I used regex
[^u\n"]"[^"\n]on*.pyfiles in therepo/sourcedirectory, pycharm IDE also allows to ignore results in comments.Suggested approach:
r".+"there seems to be just under 5000 of these strings.Looking at each string individually will take weeks, I suggest we see how we can exclude low risk areas:
_("blah"),pgettext("blah")) can be ignored. Whether these have 'u', 'r', or no prefix. We can be quite confident they will not be bytes.Open questions: