I've been meaning to take a stab at this idea of automating a process to take a picture of any receipt and log it to an expenses spreadsheet. Clawshier is the OpenClaw skill that came as a result. It's open source on GitHub and available in ClawHub.
Is it any good?
Far from perfect but a fun project. Image recognition is currently the weakest link. It currently supports two providers: OpenAI (default) and Ollama.
OpenAI sometimes refuses to process the image but I simply retry after a minute and that's it. I'd say 60% of the time, it works every time. Quality of extraction is quite good for establishment name, expense category recognition, totals and taxes. Thermal receipt pictures with poor lighting can be very tricky, on top of cryptic text/acronyms in them. I traced the image recognition step to take between 5 and 12 seconds which is good.
For Ollama I tried llama3.2-vision:11b due to my available memory (fits in my 16GB RAM), but sadly was unusable. Infinite loops (extracting one line thousands of times), hallucinations and taking too long to process. 5 minute timeouts sometimes worked but essentially a coin toss whether it would extract "something" or simply hang/crash.
I ran benchmarks and I had to downsize the image to 512px to consistently avoid those timeouts, but that degraded the image too much for it to be usable. Perhaps the llama3.2-vision:90b would perform better... Wish I had a Ryzen Strix Halo with that juicy RAM.
What does it look like?
Send your bot a picture of any receipt. There's currently an issue with dates because receipts I deal with sometimes have DD-MM and MM-DD dates so image extraction sometimes mixes them up. The current workaround is to specify the date, but it's likely not an issue if you only deal with MM-DD-YYYY dates.
Then the spreadsheets are updated with the new expense record. A few different sheets are created/updated automatically with each pipeline execution:
-
Summary
- Aggregated totals and charts
-
Invoice Archive Breakdown
- Breakdown of each invoice
- Currently very flaky data from image recognition since these are the individual line items that are tricky to read/decode even for humans
-
Monthly expense sheet
- Expenses aggregated in their corresponding MM-YY date sheet
Conclusions
Now I'm the guy taking a picture of his receipts everywhere I go. Not a huge fan of sending these to OpenAI and would wish I could do more of this using local models, but it might help me track my "small expenses" which I struggle to keep a close eye on.
Also worth noting I'm not a huge fan of the DX for OpenClaw skills either, it was painful at times... I hope it's not a "skill issue" π₯
In terms of usage I'm expecting a couple dollars worth of OpenAI credits per month, we'll see how it goes. Again local models would be able to reduce this substantially since image recognition takes up the bulk of the processing.
There's plenty of ways this could be improved (i.e. PaddleOCR/Tesserect which would also avoid sending my pics to Sam). If Clawshier sounds interesting to you take a peek at the GitHub repo. Pura Vida!


Top comments (9)
Make sure to upload these receipts π
π€£
Really practical project. Receipt OCR is one of those deceptively hard problems -- thermal paper, inconsistent date formats, cryptic abbreviations, and every restaurant has its own receipt layout.
The validation step before persisting to Google Sheets is the right call. I've built similar document extraction pipelines and the pattern of OCR -> structure -> validate -> persist with a fail-safe checkpoint is basically the only way to keep your data clean without constant manual review.
For the local model angle, have you looked at Qwen2-VL or Florence-2? Both handle document/receipt extraction surprisingly well and are lighter than llama3.2-vision. PaddleOCR as a preprocessing step before the LLM can also boost accuracy significantly on low-contrast thermal prints.
What a great post! π
Honest question: what happens to the 40% that fail? In data pipelines, 60% reliability means the pipeline is broken, not "mostly working." You end up building a second system just to catch and fix the errors from the first one.
Do you have a manual review step for the receipts it botches, or do you just accept the data loss? Because that's the real design decision here, not which vision model to use.
Hey, that was more a joke than actual statistics on success/failures. Did you open the link on that quote? π
In reality I do sometimes encounter some issues where OCR wasn't successful though, which is understandable since most receipt pictures I upload are handheld, with poor lighting and sometimes crumbled up with partially blurred out text.
Yes, the skill itself was built as an orchestration of other smaller/internal skills or steps executed one at a time. Here's a bit more detail in how that works:
The idea behind this was for the skill to be easy to retry and for it to fail safely with the validation step before persiting. You get a summary of what was recorded too as a response so you can easily check whether the receipt was interpreted and persisted correctly (see this success response below)
This means I can quickly tell if the total amount and category are incorrect (most important details I care for at the moment). This is an example of a failure I was referring to during the OpenAI OCR step:
You can see what GTP 5.4 told me when I requested more details about these refusals:
Overall, a retry after a minute or two works... So I guess that's just part of the non-deterministic nature of these models π€·π»ββοΈ
I do have to note I'm very happy with results so far. Despite occasional hiccups I see very good results from OpenAI OCR but would definitely prefer to use a local model or OCR process (like those mentioned in the post) instead if possible
Really practical writeup. The OCR -> structuring -> validation -> persist pipeline with a fail-safe checkpoint is exactly the right pattern for document extraction. On the Ollama side: did you try Qwen2-VL instead of llama3.2-vision? It's significantly lighter (7B/9B variants available) and handles receipt/document extraction surprisingly well. For thermal paper specifically, PaddleOCR as a preprocessing step before the LLM can handle the low-contrast text much better than downscaling to 512px. Either way β the architecture is sound. Well done.
Your monthly expense visualizations effectively highlight category distributions, but local receipt models face hardware constraints in real-time OCR processing compared to cloud APIs. How are you optimizing model quantization for mobile deployment while maintaining accuracy on skewed receipt images?
Some comments may only be visible to logged-in visitors. Sign in to view all comments.