Schemas for data that would be easy to import to PTA systems

I’m working on some tooling acceptarium for extracting structured data from scanned receipts for importing into PTA ledgers. The goal is to be entirely agnostic as to the PTA tool used (ledger-cli, hldeger, beancount, whatever). For the whole workflow to be sane we need to have some intermediary format for the structured data while we’re extracting it. I’m wondering if there are any semi-standardized schemas for this that would make life easier to start with rather than hand rolling one from scratch.

I’m not so much concerned with a file format (e.g. JSON, TOML, CSV, etc) as I am with the data schema itself. Is there a standard set of fields and field names used to represent receipts? I know there are some elaborate XML formats for general financial transactions but they are so complicated they don’t make any sense for an in-memory data struct schema and don’t really make sense as an intermediary to relatively simple PTA formats either.

Does this ring any bells?

Bridging the gap between real world and the precision of PTA (or to any other computing tool) is a classic engineering challenge. Very forward-thinking open source ai project at last.

first things come to my mind is vendor, date, currency, total, tax, line items? (really hard part)

Validation: You can easily verify if sum(items) + tax == total before it ever touches your journals. if you can master the line items part.