Extracted text is unreadable (random glyphs) for PDFs with Japanese text

## Intro

Hi everyone,

thank you very much for creating and working on Docspell!

I've been wanting to get started with digitally organizing my documents for a while now. I found Docspell as one solution that might work well for me, so I've started trying it out.

One thing upfront: my use case is probably a bit unusual, since I have documents in three languages (German, English, and Japanese) that I want to put into my archive/DMS. (Note: while I do have a few documents with mixed languages in the same document, we can ignore that for now and focus only on single-language documents.)

## Problem summary

For _some_ PDFs that contain Japanese text, the "extracted text" in Docspell is just some random glyphs.
This is specifically about PDFs that already contain text (I'm pretty sure it's not an OCR issue).
I've also noticed that this problem doesn't occur for _all_ Japanese-text PDFs, but I don't know the cause.

## Reproducing the problem on v0.40.0

1. Set up Docspell with docker compose, following the [docker compose section](https://docspell.org/docs/install/docker/#docker-compose) of the installation manual.
  Since I wanted to use a release version, I deviated from the manual by downloading the [docker compose file from tag v0.40.0](https://github.com/eikek/docspell/blob/v0.40.0/docker/docker-compose/docker-compose.yml) instead.
  Specifically, I did these steps:

    ```
    > cd /tmp
    > pwd
    /tmp
    > mkdir -p docspell/docker/docker-compose
    > cd docspell/docker/docker-compose
    > pwd
    /tmp/docspell/docker/docker-compose
    > wget https://raw.githubusercontent.com/eikek/docspell/v0.40.0/docker/docker-compose/docker-compose.yml
    --2023-12-30 16:45:25--  https://raw.githubusercontent.com/eikek/docspell/v0.40.0/docker/docker-compose/docker-compose.yml
    Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8003::154, 2606:50c0:8000::154, 2606:50c0:8001::154, ...
    Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8003::154|:443... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 4740 (4.6K) [text/plain]
    Saving to: ‘docker-compose.yml’

    docker-compose.yml  100%[===================>]   4.63K  --.-KB/s    in 0s

    2023-12-30 16:45:25 (20.2 MB/s) - ‘docker-compose.yml’ saved [4740/4740]

    > ls -A
    docker-compose.yml
    > docker-compose up -d
    [+] Building 0.0s (0/0)                                    docker:desktop-linux
    [+] Running 8/8
    ✔ Network docker-compose_default                  Created                 0.0s
    ✔ Volume "docker-compose_docspell-postgres_data"  Created                 0.0s
    ✔ Volume "docker-compose_docspell-solr_data"      Created                 0.0s
    ✔ Container docspell-solr                         Started                 0.0s
    ✔ Container postgres_db                           Started                 0.1s
    ✔ Container docspell-joex                         Started                 0.1s
    ✔ Container docspell-restserver                   Started                 0.1s
    ✔ Container docspell-consumedir                   Started                 0.0s
    ```

    * FYI, here are the containers that were created, and the images that they use:

        ```
        > docker ps
        CONTAINER ID   IMAGE                        COMMAND                  CREATED         STATUS                   PORTS                    NAMES
        1045dbbbd81a   docspell/dsc:latest          "dsc -d http://docsp…"   5 minutes ago   Up 5 minutes                                      docspell-consumedir
        9afa894f6acb   docspell/joex:latest         "/opt/joex-entrypoin…"   5 minutes ago   Up 5 minutes (healthy)   0.0.0.0:7878->7878/tcp   docspell-joex
        90faedb8cdca   docspell/restserver:latest   "/opt/docspell-rests…"   5 minutes ago   Up 5 minutes (healthy)   0.0.0.0:7880->7880/tcp   docspell-restserver
        d0ea7652c5a4   solr:9                       "docker-entrypoint.s…"   5 minutes ago   Up 5 minutes (healthy)   8983/tcp                 docspell-solr
        adf3c972c07f   postgres:15.2                "docker-entrypoint.s…"   5 minutes ago   Up 5 minutes             5432/tcp                 postgres_db


        > docker image ls
        REPOSITORY           TAG     IMAGE ID       CREATED        SIZE
        solr                 9       3c38c30d646b   13 days ago    593MB
        postgres             15.2    bf700010ce28   8 months ago   379MB
        docspell/dsc         latest  54d581f6c5a1   9 months ago   20.1MB
        docspell/joex        latest  d129a81f07fd   9 months ago   1.99GB
        docspell/restserver  latest  1e700758d41a   9 months ago   336MB
        ```

1. Open the web UI at `http://localhost:7880` and create a new collective + user using the "Sign up!" button.

    * Collective ID: `issuerepro`
      User Login: `issuerepro`
      Password: `issuerepro`

1. Download two example documents that contain Japanese text:

    * Document that works: [身元保証書](https://www.mofa.go.jp/mofaj/files/000472926.pdf) ([internet archive backup link](https://web.archive.org/web/20230531002317/https://www.mofa.go.jp/mofaj/files/000472926.pdf)), "Letter of Guarantee" for visa to Japan from the website of the Japanese Ministry of Foreign Affairs
    * Document that doesn't work: [au保険 スタンダード傷害保険 重要事項のご説明](https://www.au-sonpo.co.jp/pc/common/pdf/standard_shogai/standard_jyusetsu_20191201.pdf) ([internet archive backup link](https://web.archive.org/web/20220522083123/https://www.au-sonpo.co.jp/pc/common/pdf/standard_shogai/standard_jyusetsu_20191201.pdf)), an insurance terms explanation sheet for one a bicycle insurance plan by au insurance

1. Upload the documents to Docspell via the web UI:

    * Open the dashboard (`http://localhost:7880/app/dashboard`), and log in with user `issuerepro` that we created above.
    * Choose the files via drag-and-drop or using the "Select..." button in the drop area on the dashboard.
    * Click "Submit".
    * Wait for processing to finish (should be relatively quick, since no OCR needs to be done).

1. Open the visa document `000472926.pdf`, and go to "View extracted data":

    * ![CleanShot 2023-12-30 at 17 07 01@2x](https://github.com/eikek/docspell/assets/1099818/02cafca6-0cf3-4600-9b16-ec8d5ca38582)
    * The data looks pretty good: some extraneous whitespace, but overall mostly the right Japanese characters.
      Small sample and screenshot:

        ```
        身元保証書
        令和  月 日

        大 使 □

        在 日本国 殿

        総領事 □

        ビ ザ 申 請 人
        ※氏名必z旅券N~²ルフ±ベット表記w記載しvください。申請人|複数~場合{ï表者~身分事項²ñO{記入

        ~Nÿ申請人名簿²添付しvください。

        国 籍

        職 業
        ```

      ![CleanShot 2023-12-30 at 17 09 15@2x](https://github.com/eikek/docspell/assets/1099818/0f49e03a-2a26-4466-9db8-91ecb0c18531)

1. Open the insurance document `standard_jyusetsu_20191201.pdf`, and go to "View extracted data":

    * The data looks pretty bad: there are some Japanese characters in there, but there are a lot of random glyphs between them.
        Small sample and screenshot:

        ```
        ス¿ンÀーù÷害ß険 ÝÏ項~t®明ÿÝÏ項®明þĀ
        イン¿ーネッø募Ö}

        契}概~t®明û注意喚起å報~t®明
        ■s~þ÷1¹¿ンÀーù÷ûßþ~４t~÷ùン<ë転ÎUけß険 ÿバイ¿ûĀ=1<ë転ÎUけß険

        ÿバイ¿û ベスøĀ=1<Á¼~ß険 îÏ故=1<Á¼~ß険 å~~Ï故={·y»ÝzÏ項²®nw
        vい~y2tY}_{ßzz¯{zº1Y}w¿÷{uÛ~うえ1uÛÕ容{誤º|zいsx²úw1z
        w¿uい2

        ■s~þ÷1tY}{·y»yyv~Õ容²載wvい»~wあº~{³2詳}{tいv<tY}~w
        zºÿnßþ}款û{}ÖĀ={載wvい~y2_ûーĀúー¸{掲載wvい~y~w1ß{ßxv
        t参照uいÿhttps://www.au-sonpo.co.jp/Ā2zz1t郵²希望u¼»|\ａｕmß»¹¿þー»ン
        ¿ーxtËnuい2
        ```

        <scrsh3>


## Some more version / environment information

* OS: macOS Sonoma 14.1.2 (23B92)
  * Architecture: Intel
* Browser: Safari 17.1.2 (19616.2.9.11.12)

## Conclusion

I hope this report contains enough information to make the issue clear and to let you (try to) reproduce it.

Please let me know if there's any other information I can contribute to diagnose this.

Based on some web research, I'm afraid this issue might actually be related to how the PDF (and the fonts in it) are encoded; possibly some fonts are not properly included in the insurance document. I still hope there's something we can find out about this.

At the moment, I haven't found any other eDMS software that seems to fit my needs better or that handles Japanese PDFs better. So while I'm still a bit hesitant to invest completely into Docspell, I'm willing to try to diagnose and hopefully fix or mitigate this issue :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracted text is unreadable (random glyphs) for PDFs with Japanese text #2445

Intro

Problem summary

Reproducing the problem on v0.40.0

Some more version / environment information

Conclusion

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Extracted text is unreadable (random glyphs) for PDFs with Japanese text #2445

Description

Intro

Problem summary

Reproducing the problem on v0.40.0

Some more version / environment information

Conclusion

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions