Skip to content

extractText() doesn't work on Chinese PDF #252

@zxteloiv

Description

@zxteloiv

As I have tested, pure English content in a PDF can be extracted without problem.
But nothing readable could be extracted for a Chinese page.

I guess it's caused by the encoding.
I tried to modify the following line to below
https://github.com/mstamy2/PyPDF2/blob/master/PyPDF2/utils.py#L246

def u_(s):
    if sys.version_info[0] < 3:
        return unicode(s, encoding='utf-8')
    else:
        return s

But it doesn't work.

My environment:

  • Python 2.7.10
  • OS X El Capitan
  • PyPDF2 version 1.25.1

Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    is-cjk-issueIssue related to CJK (Chinese-Japanese-Korean)workflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions