I’m working on some tooling acceptarium for extracting structured data from scanned receipts for importing into PTA ledgers. The goal is to be entirely agnostic as to the PTA tool used (ledger-cli, hldeger, beancount, whatever). For the whole workflow to be sane we need to have some intermediary format for the structured data while we’re extracting it. I’m wondering if there are any semi-standardized schemas for this that would make life easier to start with rather than hand rolling one from scratch.
I’m not so much concerned with a file format (e.g. JSON, TOML, CSV, etc) as I am with the data schema itself. Is there a standard set of fields and field names used to represent receipts? I know there are some elaborate XML formats for general financial transactions but they are so complicated they don’t make any sense for an in-memory data struct schema and don’t really make sense as an intermediary to relatively simple PTA formats either.
Bridging the gap between real world and the precision of PTA (or to any other computing tool) is a classic engineering challenge. Very forward-thinking open source ai project at last.
first things come to my mind is vendor, date, currency, total, tax, line items? (really hard part)
Validation: You can easily verify if sum(items) + tax == total before it ever touches your journals. if you can master the line items part.
Easy there mate! This isn't primarily an 'AI project'. Besides the fact that I'm adamantly opposed to any iteration of ML being conflated with 'intelligence', the use of vision models instead of traditional OCR techniques and other LLMs instead of regex or other pattern matching are both completely optional. Also the project is hand coded with love not vibe coded, so lets not label it as primarily an 'AI' project even if the non-deterministic elements might prove to be big time savers.
Thanks, but I have those fields already covered in the current prototype (except for tax which is a rabbit trail). Line items aren't that hard, it turns out the least reliable bits to extract tend to be things that vary more such as any form of unique transaction ID which varies wildly by vendor and vision models tend to mix up what field of several is the most likely candidate for an actual transaction code.
The items like seems to work pretty well for a pretty wide range of vendors. Mapping these to people's desired account list might actually prove to be harder than identifying a list of items. And of course the long tail of edge cases such as "item 1, qty 2, total $16.00" followed by "discount code, qty 1, $-4.00" or other weird things like multiple different tax rates on a single item do tend to throw spurious results.
Speaking of taxes though — it would be useful to hear from people that do record sales taxes separately from other expenses. I opened this issue upstream for gathering feedback if anybody wants to pitch in.
Validation will mostly happen by exporting/printing to a PTA ledger format and using the actual PTA tool to do whatever validation the user chooses to run. We won't be touching journals that have other data in them anyway.
Back to the topic at hand, just adding some notes on existing formats:
We could probably use a subset of fields described in OFX to have some level of standardized data schema, but OFX does not support splits at all.
On the other hand QIF supports splits, but is more of a description of an obtuse plain text format rather than a data schema.