Skip to content

Unicode issues with tail plugin #1269

@h0od

Description

@h0od

I'm using irssi to output log files from channels. I then use the tail plugin to parse entries from these files. The log files are encoded as UTF-8. Until recently this has worked perfectly, but now the plugin aborts with an error "Failed to decode file using utf-8. Check encoding" anytime a Unicode character appears in the log.

When further investigating, I saw that it is caused by an exception thrown in native_str_to_text():

try:
 line = native_str_to_text(line, encoding=encoding)
except UnicodeError:
 raise plugin.PluginError('Failed to decode file using %s. Check encoding.' % encoding)
if PY2:
  def native_str_to_text(string, **kwargs):
          if 'encoding' not in kwargs:
              kwargs['encoding'] = 'ascii'
          return string.decode(**kwargs)
  else:
      def native_str_to_text(string, **kwargs):
  return string

Since I'm running Python 2.7 the native_str_to_text() is basically just calling string.decode(). And because the error message contains our preferred encoding I think we can be fairly certain that we are actually providing the decode-method with the correct encoding string.

I extended the code so that it printed out the full exception and the result was the following output:

'ascii' codec can't encode character u'\u25e2' in position 21: ordinal not in range(128)

The strange thing about all this is that when I cloned the git repository and run the code from there instead of the installed package, everything works. So I think this is some locale/environment trouble. I've tried to recreate the virtualenv but it did not have any effect. I've also tried to set the LC_ALL, PYTHONIOENCODING, LANG to UTF-8 with no luck.

I was able to recreate the exception in some test code by explicitly setting PYTHONIOENCODING=ascii. So there's definitely some issues with how python performs decoding in my environment.

However, what did fix the problem was to change the tail.py so that it opens the file in binary mode instead by simply changing.

- with open(filename, 'r') as file:
+ with open(filename, 'rb') as file:

I'm new to Python so I'm not confident this is a good solution because by looking how native_str_to_text() is defined it could cause problems with Python 3 users as I've heard that the unicode string management has changed between v2 and v3. An other solution that came in mind was to specify an encoding to the open() method of the file.

Config:

 taskX:
    tail:
      file: ~/.irssi/logfromchannel.log
      encoding: utf-8
      entry:
        title: nick:\s(.*?)\s:\shttp://.*
        url: nick:\s.*?\s:\s(http://.*)
      format:
        url: '%(url)s'
 (... and other settings for the task, not related to tail)

Log:

2016-07-06 10:05 CRITICAL plugin        taskX      Failed to decode file using utf-8. Check encoding.
2016-07-06 10:05 WARNING  task          taskX      Aborting task (plugin: tail)

Additional information:

  • Flexget Version: 2.1.6
  • Python Version: 2.7.10
  • Installation method: Standard installation from released package using virtualenv and pip
  • OS and version:
    openSUSE 13.1
    Linux 3.12.57-44-default Request to pull Rotten Tomatoes plugin changes #1 SMP Wed Apr 6 09:18:15 UTC 2016 (9b4534f) x86_64 x86_64 x86_64 GNU/Linux

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions