-
Notifications
You must be signed in to change notification settings - Fork 171
Extracted text is unreadable (random glyphs) for PDFs with Japanese text #2445
Description
Intro
Hi everyone,
thank you very much for creating and working on Docspell!
I've been wanting to get started with digitally organizing my documents for a while now. I found Docspell as one solution that might work well for me, so I've started trying it out.
One thing upfront: my use case is probably a bit unusual, since I have documents in three languages (German, English, and Japanese) that I want to put into my archive/DMS. (Note: while I do have a few documents with mixed languages in the same document, we can ignore that for now and focus only on single-language documents.)
Problem summary
For some PDFs that contain Japanese text, the "extracted text" in Docspell is just some random glyphs.
This is specifically about PDFs that already contain text (I'm pretty sure it's not an OCR issue).
I've also noticed that this problem doesn't occur for all Japanese-text PDFs, but I don't know the cause.
Reproducing the problem on v0.40.0
-
Set up Docspell with docker compose, following the docker compose section of the installation manual.
Since I wanted to use a release version, I deviated from the manual by downloading the docker compose file from tag v0.40.0 instead.
Specifically, I did these steps:> cd /tmp > pwd /tmp > mkdir -p docspell/docker/docker-compose > cd docspell/docker/docker-compose > pwd /tmp/docspell/docker/docker-compose > wget https://raw.githubusercontent.com/eikek/docspell/v0.40.0/docker/docker-compose/docker-compose.yml --2023-12-30 16:45:25-- https://raw.githubusercontent.com/eikek/docspell/v0.40.0/docker/docker-compose/docker-compose.yml Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8003::154, 2606:50c0:8000::154, 2606:50c0:8001::154, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8003::154|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 4740 (4.6K) [text/plain] Saving to: ‘docker-compose.yml’ docker-compose.yml 100%[===================>] 4.63K --.-KB/s in 0s 2023-12-30 16:45:25 (20.2 MB/s) - ‘docker-compose.yml’ saved [4740/4740] > ls -A docker-compose.yml > docker-compose up -d [+] Building 0.0s (0/0) docker:desktop-linux [+] Running 8/8 ✔ Network docker-compose_default Created 0.0s ✔ Volume "docker-compose_docspell-postgres_data" Created 0.0s ✔ Volume "docker-compose_docspell-solr_data" Created 0.0s ✔ Container docspell-solr Started 0.0s ✔ Container postgres_db Started 0.1s ✔ Container docspell-joex Started 0.1s ✔ Container docspell-restserver Started 0.1s ✔ Container docspell-consumedir Started 0.0s-
FYI, here are the containers that were created, and the images that they use:
> docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 1045dbbbd81a docspell/dsc:latest "dsc -d http://docsp…" 5 minutes ago Up 5 minutes docspell-consumedir 9afa894f6acb docspell/joex:latest "/opt/joex-entrypoin…" 5 minutes ago Up 5 minutes (healthy) 0.0.0.0:7878->7878/tcp docspell-joex 90faedb8cdca docspell/restserver:latest "/opt/docspell-rests…" 5 minutes ago Up 5 minutes (healthy) 0.0.0.0:7880->7880/tcp docspell-restserver d0ea7652c5a4 solr:9 "docker-entrypoint.s…" 5 minutes ago Up 5 minutes (healthy) 8983/tcp docspell-solr adf3c972c07f postgres:15.2 "docker-entrypoint.s…" 5 minutes ago Up 5 minutes 5432/tcp postgres_db > docker image ls REPOSITORY TAG IMAGE ID CREATED SIZE solr 9 3c38c30d646b 13 days ago 593MB postgres 15.2 bf700010ce28 8 months ago 379MB docspell/dsc latest 54d581f6c5a1 9 months ago 20.1MB docspell/joex latest d129a81f07fd 9 months ago 1.99GB docspell/restserver latest 1e700758d41a 9 months ago 336MB
-
-
Open the web UI at
http://localhost:7880and create a new collective + user using the "Sign up!" button.- Collective ID:
issuerepro
User Login:issuerepro
Password:issuerepro
- Collective ID:
-
Download two example documents that contain Japanese text:
- Document that works: 身元保証書 (internet archive backup link), "Letter of Guarantee" for visa to Japan from the website of the Japanese Ministry of Foreign Affairs
- Document that doesn't work: au保険 スタンダード傷害保険 重要事項のご説明 (internet archive backup link), an insurance terms explanation sheet for one a bicycle insurance plan by au insurance
-
Upload the documents to Docspell via the web UI:
- Open the dashboard (
http://localhost:7880/app/dashboard), and log in with userissuereprothat we created above. - Choose the files via drag-and-drop or using the "Select..." button in the drop area on the dashboard.
- Click "Submit".
- Wait for processing to finish (should be relatively quick, since no OCR needs to be done).
- Open the dashboard (
-
Open the visa document
000472926.pdf, and go to "View extracted data": -
Open the insurance document
standard_jyusetsu_20191201.pdf, and go to "View extracted data":-
The data looks pretty bad: there are some Japanese characters in there, but there are a lot of random glyphs between them.
Small sample and screenshot:ス¿ンÀーù÷害ß険 Ý�Ï項~t®明ÿÝ�Ï項®明þĀ イン¿ーネッø募Ö} 契}概�~t®明û注意喚起å報~t®明 ■s~þ÷�1¹¿ンÀーù÷ûßþ~4t~÷ùン<ë転ÎUけß険 ÿバイ¿ûĀ=1<ë転ÎUけß険 ÿバイ¿û ベスøĀ=1<Á¼~ß険 î�Ï故=1<Á¼~ß険 å~~Ï故={·y»Ý�zÏ項²®nw vい~y2tY}_{ßzz¯�{zº1Y}w¿�÷{uÛ~うえ1uÛÕ容{誤º|zいsx²ú�w1z w¿���uい2 ■s~þ÷�1tY}{·y»yyv~Õ容²�載wvい»�~w�あº~{³2詳}{tいv�<tY}~w zºÿn�ßþ}款û{}ÖĀ={�載wvい~y2_�ûーĀúー¸{�掲載wvい~y~w1ß�{ßxv t参照��uいÿhttps://www.au-sonpo.co.jp/Ā2zz1t郵�²希望u¼»|\�aumß»¹¿þー»ン ¿ーxtËn��uい2
-
Some more version / environment information
- OS: macOS Sonoma 14.1.2 (23B92)
- Architecture: Intel
- Browser: Safari 17.1.2 (19616.2.9.11.12)
Conclusion
I hope this report contains enough information to make the issue clear and to let you (try to) reproduce it.
Please let me know if there's any other information I can contribute to diagnose this.
Based on some web research, I'm afraid this issue might actually be related to how the PDF (and the fonts in it) are encoded; possibly some fonts are not properly included in the insurance document. I still hope there's something we can find out about this.
At the moment, I haven't found any other eDMS software that seems to fit my needs better or that handles Japanese PDFs better. So while I'm still a bit hesitant to invest completely into Docspell, I'm willing to try to diagnose and hopefully fix or mitigate this issue :)

