Skip to content

IndexError on sending in ISO-8859-7 #124

@pvanderlinden

Description

@pvanderlinden

When I send in the following (python 3.5), chardet will raise an IndexError:
b'\xcc\xe5 \xef\xec\xe9\xeb\xdf\xe1 \xf4\xe7\xf2'
In 2.3.0 this would return ISO-8859-7 in version 3.0.1 and 3.0.0 it returns None. Not entirely sure which is the correct behaviour.

Traceback:

    encoding = chardet.detect(body)['encoding']
/opt/anaconda3/envs/env/lib/python3.5/site-packages/chardet/__init__.py:39: in detect
    return detector.close()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <chardet.universaldetector.UniversalDetector object at 0x7f177ce554a8>

    def close(self):
        """
            Stop analyzing the current document and come up with a final
            prediction.
    
            :returns:  The ``result`` attribute, a ``dict`` with the keys
                       `encoding`, `confidence`, and `language`.
            """
        # Don't bother with checks if we're already done
        if self.done:
            return self.result
        self.done = True
    
        if not self._got_data:
            self.logger.debug('no data received!')
    
        # Default to ASCII if it is all we've seen so far
        elif self._input_state == InputState.PURE_ASCII:
            self.result = {'encoding': 'ascii',
                           'confidence': 1.0,
                           'language': ''}
    
        # If we have seen non-ASCII, return the best that met MINIMUM_THRESHOLD
        elif self._input_state == InputState.HIGH_BYTE:
            prober_confidence = None
            max_prober_confidence = 0.0
            max_prober = None
            for prober in self._charset_probers:
                if not prober:
                    continue
                prober_confidence = prober.get_confidence()
                if prober_confidence > max_prober_confidence:
                    max_prober_confidence = prober_confidence
                    max_prober = prober
            if max_prober and (max_prober_confidence > self.MINIMUM_THRESHOLD):
                charset_name = max_prober.charset_name
                lower_charset_name = max_prober.charset_name.lower()
                confidence = max_prober.get_confidence()
                # Use Windows encoding name instead of ISO-8859 if we saw any
                # extra Windows-specific bytes
                if lower_charset_name.startswith('iso-8859'):
                    if self._has_win_bytes:
                        charset_name = self.ISO_WIN_MAP.get(lower_charset_name,
                                                            charset_name)
                self.result = {'encoding': charset_name,
                               'confidence': confidence,
                               'language': max_prober.language}
    
        # Log all prober confidences if none met MINIMUM_THRESHOLD
        if self.logger.getEffectiveLevel() == logging.DEBUG:
            if self.result['encoding'] is None:
                self.logger.debug('no probers hit minimum threshold')
>               for prober in self._charset_probers[0].probers:
E               IndexError: list index out of range

/opt/anaconda3/envs/env/lib/python3.5/site-packages/chardet/universaldetector.py:271: IndexError

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions