Skip to content

Add detection for MacRoman encoding#5

Closed
rspeer wants to merge 1 commit intochardet:masterfrom
LuminosoInsight:master
Closed

Add detection for MacRoman encoding#5
rspeer wants to merge 1 commit intochardet:masterfrom
LuminosoInsight:master

Conversation

@rspeer
Copy link
Copy Markdown

@rspeer rspeer commented Nov 16, 2012

MacRoman is not in particularly common use anymore, as it has been deprecated by Mac OS for over a decade. However, there are programs such as Microsoft Office for Mac that didn't get the memo, and will often output in MacRoman when they write plain text files.

This patch allows chardet to correctly detect MacRoman, instead of calling it something random and incorrect like ISO-8859-2. The MacRoman detector works similarly to the Latin-1 detector, but starts at a lower probability.

I hope this is the right way to do it. There is surprisingly little support in chardet for adding a new single-byte Latin encoding.

MacRoman is not in particularly common use anymore, as it has been
deprecated by Mac OS for over a decade. However, there are programs such
as Microsoft Office for Mac that didn't get the memo, and will output in
MacRoman by default.

The MacRoman detector works similarly to the Latin-1 detector, but
starts at a lower probability.
@puzzlet
Copy link
Copy Markdown
Contributor

puzzlet commented Dec 2, 2012

I'm not an authority here, but could you give some live examples we can test?

dan-blanchard pushed a commit that referenced this pull request Dec 15, 2013
@dan-blanchard
Copy link
Copy Markdown
Member

@rspeer, if you can provide some example documents for testing, I'd gladly merge this.

@adamn
Copy link
Copy Markdown

adamn commented May 17, 2016

Since this detector is for archaic technology, and it's given the lowest priority in the universal detector, it seems like this can just go in. Maybe with a warning (or a setting to disable the detector by default) if necessary?

Documentation and examples have been pending for almost 2 years so I don't see that ever happening.

@sigmavirus24
Copy link
Copy Markdown
Member

@adamn you're right. This hasn't changed in almost 4 years and the pull request doesn't merge cleanly. I'm going to close this unless someone revives it in a new pull request. (We've also had no requests (other than this PR) for this encoding.)

@rspeer
Copy link
Copy Markdown
Author

rspeer commented May 24, 2016

I don't think that's what adamn meant, sigmavirus24, but oh well.

MacRoman is the default encoding that Office for Mac uses for export, and is a very frequently used and frequently mis-detected encoding. People who need MacRoman detection don't know they need MacRoman detection, they just know "chardet doesn't work". Every encoding in chardet that doesn't start with "UTF" is archaic technology in a sense: that's what chardet is for, isn't it?

Sorry for not hand-holding this PR to completion, but I lost interest in fixing chardet.

@dan-blanchard
Copy link
Copy Markdown
Member

Every encoding in chardet that doesn't start with "UTF" is archaic technology in a sense: that's what chardet is for, isn't it?

Yeah, I agree with @rspeer. chardet is essentially a tool for dealing with archaic encodings.

@dan-blanchard dan-blanchard reopened this May 24, 2016
@jbrockmendel
Copy link
Copy Markdown

if you can provide some example documents for testing, I'd gladly merge this.

Found in the wild: http://eclipse.gsfc.nasa.gov/5MCSE/5MKSEcatalog.txt

The only non-ascii present is four occurrences of "\xa1". Decoded as Mac-Roman, these are "degree" symbols, as in latitude/longitude. chardet.detect (using cchardet 1.1.1) returns {'confidence': 0.8844350576400757, 'encoding': u'WINDOWS-1252'}.

@dan-blanchard
Copy link
Copy Markdown
Member

@jbrockmendel Thanks for the example! I will likely just add MacRoman to the set of encodings supported by several Western languages in #99 and not use this PR's approach, but I'll keep it open for now until we decide.

@alichur
Copy link
Copy Markdown

alichur commented Jul 20, 2017

I would love to see this fixed. Sadly any mac user dealing with CSV files in Excel will end up with MacRoman encoding when they save.

@jbrockmendel
Copy link
Copy Markdown

@alichur could we use this behavior to generate a thorough set of samples?

@alichur
Copy link
Copy Markdown

alichur commented Jul 25, 2017

@jbrockmendel yes I believe so. Open any CSV file in excel (on a mac) and when you save it the file encoding will be mac Roman.

@MrCsabaToth
Copy link
Copy Markdown

A good use case can be a CSV file saved with Mac Excel. The typical examples are Mac Roman variants of the single apostrophes (left 0xD4, right 0xD5, bottom 0xE2), the double quotes (left 0xD2, right 0xD3, bottom 0xE3) and dash variants (0xD0, 0xD1). Those hit me all the!!! https://en.wikipedia.org/wiki/Mac_OS_Roman

@MrCsabaToth
Copy link
Copy Markdown

To detect these cases (0xD0, 0xD1, 0xD2, 0xD3, 0xD4, 0xD5, 0xE2, 0xE3) though in my custom code I check for the surrounding characters if they are in the standard ASCII range to make sure I won't deal with some UTF-8 sequence.

@YesThatAllen
Copy link
Copy Markdown

Late to the game here.. we were all set to use chardet, even implemented it then realized that mac_roman isn't supported.

As of April 2022:

Asking finger for info via Popen will give UTF-8 data in almost all cases, and return mac_roman when double byte characters are in the response, sigh.

Could you give some live examples we can test?

macOS's ioreg command will vary in its output.

/usr/sbin/ioreg -l will respond using mac_roman encoding if there's an apostrophe in the name of a bluetooth mouse/trackpad: "Product"="Allen’s Trackpad"

On the same computer, /usr/sbin/ioreg -rd1 -c IOPlatformExpertDevice will not include the pointing device, and so ioreg will respond with utf-8

@dan-blanchard
Copy link
Copy Markdown
Member

@YesThatAllen thanks for the tip about ioreg. I had no idea we finally had a way to generate MacRoman data. I've added a test for this, and revived this very old PR and manually merged it via c292b52.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants