Skip to content

Extracted text is unreadable (random glyphs) for PDFs with Japanese text #2445

@lehnerpat

Description

@lehnerpat

Intro

Hi everyone,

thank you very much for creating and working on Docspell!

I've been wanting to get started with digitally organizing my documents for a while now. I found Docspell as one solution that might work well for me, so I've started trying it out.

One thing upfront: my use case is probably a bit unusual, since I have documents in three languages (German, English, and Japanese) that I want to put into my archive/DMS. (Note: while I do have a few documents with mixed languages in the same document, we can ignore that for now and focus only on single-language documents.)

Problem summary

For some PDFs that contain Japanese text, the "extracted text" in Docspell is just some random glyphs.
This is specifically about PDFs that already contain text (I'm pretty sure it's not an OCR issue).
I've also noticed that this problem doesn't occur for all Japanese-text PDFs, but I don't know the cause.

Reproducing the problem on v0.40.0

  1. Set up Docspell with docker compose, following the docker compose section of the installation manual.
    Since I wanted to use a release version, I deviated from the manual by downloading the docker compose file from tag v0.40.0 instead.
    Specifically, I did these steps:

    > cd /tmp
    > pwd
    /tmp
    > mkdir -p docspell/docker/docker-compose
    > cd docspell/docker/docker-compose
    > pwd
    /tmp/docspell/docker/docker-compose
    > wget https://raw.githubusercontent.com/eikek/docspell/v0.40.0/docker/docker-compose/docker-compose.yml
    --2023-12-30 16:45:25--  https://raw.githubusercontent.com/eikek/docspell/v0.40.0/docker/docker-compose/docker-compose.yml
    Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8003::154, 2606:50c0:8000::154, 2606:50c0:8001::154, ...
    Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8003::154|:443... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 4740 (4.6K) [text/plain]
    Saving to: ‘docker-compose.yml’
    
    docker-compose.yml  100%[===================>]   4.63K  --.-KB/s    in 0s
    
    2023-12-30 16:45:25 (20.2 MB/s) - ‘docker-compose.yml’ saved [4740/4740]
    
    > ls -A
    docker-compose.yml
    > docker-compose up -d
    [+] Building 0.0s (0/0)                                    docker:desktop-linux
    [+] Running 8/8
    ✔ Network docker-compose_default                  Created                 0.0s
    ✔ Volume "docker-compose_docspell-postgres_data"  Created                 0.0s
    ✔ Volume "docker-compose_docspell-solr_data"      Created                 0.0s
    ✔ Container docspell-solr                         Started                 0.0s
    ✔ Container postgres_db                           Started                 0.1s
    ✔ Container docspell-joex                         Started                 0.1s
    ✔ Container docspell-restserver                   Started                 0.1s
    ✔ Container docspell-consumedir                   Started                 0.0s
    
    • FYI, here are the containers that were created, and the images that they use:

      > docker ps
      CONTAINER ID   IMAGE                        COMMAND                  CREATED         STATUS                   PORTS                    NAMES
      1045dbbbd81a   docspell/dsc:latest          "dsc -d http://docsp…"   5 minutes ago   Up 5 minutes                                      docspell-consumedir
      9afa894f6acb   docspell/joex:latest         "/opt/joex-entrypoin…"   5 minutes ago   Up 5 minutes (healthy)   0.0.0.0:7878->7878/tcp   docspell-joex
      90faedb8cdca   docspell/restserver:latest   "/opt/docspell-rests…"   5 minutes ago   Up 5 minutes (healthy)   0.0.0.0:7880->7880/tcp   docspell-restserver
      d0ea7652c5a4   solr:9                       "docker-entrypoint.s…"   5 minutes ago   Up 5 minutes (healthy)   8983/tcp                 docspell-solr
      adf3c972c07f   postgres:15.2                "docker-entrypoint.s…"   5 minutes ago   Up 5 minutes             5432/tcp                 postgres_db
      
      
      > docker image ls
      REPOSITORY           TAG     IMAGE ID       CREATED        SIZE
      solr                 9       3c38c30d646b   13 days ago    593MB
      postgres             15.2    bf700010ce28   8 months ago   379MB
      docspell/dsc         latest  54d581f6c5a1   9 months ago   20.1MB
      docspell/joex        latest  d129a81f07fd   9 months ago   1.99GB
      docspell/restserver  latest  1e700758d41a   9 months ago   336MB
      
  2. Open the web UI at http://localhost:7880 and create a new collective + user using the "Sign up!" button.

    • Collective ID: issuerepro
      User Login: issuerepro
      Password: issuerepro
  3. Download two example documents that contain Japanese text:

  4. Upload the documents to Docspell via the web UI:

    • Open the dashboard (http://localhost:7880/app/dashboard), and log in with user issuerepro that we created above.
    • Choose the files via drag-and-drop or using the "Select..." button in the drop area on the dashboard.
    • Click "Submit".
    • Wait for processing to finish (should be relatively quick, since no OCR needs to be done).
  5. Open the visa document 000472926.pdf, and go to "View extracted data":

    • CleanShot 2023-12-30 at 17 07 01@2x

    • The data looks pretty good: some extraneous whitespace, but overall mostly the right Japanese characters.
      Small sample and screenshot:

      身元保証書
      令和 � 月 日
      
      大 使 □
      
      在 日本国 殿
      
      総領事 □
      
      ビ ザ 申 請 人
      ※氏名�必z旅券N~²ルフ±ベット表記w記載しvください。申請人|複数~場合{�ï表者~身分事項²ñO{記入
      
      ~Nÿ申請人名簿²添付しvください。
      
      国 籍
      
      職 業
      

      CleanShot 2023-12-30 at 17 09 15@2x

  6. Open the insurance document standard_jyusetsu_20191201.pdf, and go to "View extracted data":

    • The data looks pretty bad: there are some Japanese characters in there, but there are a lot of random glyphs between them.
      Small sample and screenshot:

      ス¿ンÀーù÷害ß険 Ý�Ï項~t®明ÿÝ�Ï項®明þĀ
      イン¿ーネッø募Ö}
      
      契}概�~t®明û注意喚起å報~t®明
      ■s~þ÷�1¹¿ンÀーù÷ûßþ~4t~÷ùン<ë転ÎUけß険 ÿバイ¿ûĀ=1<ë転ÎUけß険
      
      ÿバイ¿û ベスøĀ=1<Á¼~ß険 î�Ï故=1<Á¼~ß険 å~~Ï故={·y»Ý�zÏ項²®nw
      vい~y2tY}_{ßzz¯�{zº1Y}w¿�÷{uÛ~うえ1uÛÕ容{誤º|zいsx²ú�w1z
      w¿���uい2
      
      ■s~þ÷�1tY}{·y»yyv~Õ容²�載wvい»�~w�あº~{³2詳}{tいv�<tY}~w
      zºÿn�ßþ}款û{}ÖĀ={�載wvい~y2_�ûーĀúー¸{�掲載wvい~y~w1ß�{ßxv
      t参照��uいÿhttps://www.au-sonpo.co.jp/Ā2zz1t郵�²希望u¼»|\�aumß»¹¿þー»ン
      ¿ーxtËn��uい2
      

Some more version / environment information

  • OS: macOS Sonoma 14.1.2 (23B92)
    • Architecture: Intel
  • Browser: Safari 17.1.2 (19616.2.9.11.12)

Conclusion

I hope this report contains enough information to make the issue clear and to let you (try to) reproduce it.

Please let me know if there's any other information I can contribute to diagnose this.

Based on some web research, I'm afraid this issue might actually be related to how the PDF (and the fonts in it) are encoded; possibly some fonts are not properly included in the insurance document. I still hope there's something we can find out about this.

At the moment, I haven't found any other eDMS software that seems to fit my needs better or that handles Japanese PDFs better. So while I'm still a bit hesitant to invest completely into Docspell, I'm willing to try to diagnose and hopefully fix or mitigate this issue :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions