A declarative idempotent rule-based beancount transaction import engine - beanhub-import

fangpenlin · May 9, 2024, 11:18pm

Hi Paintext Accounting fellows,

I love beancount and use it daily, but if there's one thing I wish there was a better way to do, it is dealing with importing transactions from CSV files. There are issues to deal with, such as duplicate transactions. Once transactions are added, applying changes to all of them is hard. If I want to move transactions around to different files, it's also not easy to do. There's also a lack of features like getting data from multiple sources and merging it into a single transaction. Of course, as a software engineer myself, I can write code and modify existing importers to meet my own needs. But I always wonder if there's a better way to do it.

With that in mind, I spent the past few days building a whole new beancount importer from the ground up. While it's not 100% done yet, it's already at the point I am happy with. I can now easily import transactions from CSV files with the new tool, which I used to need a complex custom Python script to do the same job. I open-sourced the project from the very beginning. Now you can find it at

This project is still in its early stages and subject to major changes. If you want to find out how it works, you can read the how-it-works section in the readme.

You can also clone the demo repo to try it out yourself:

Currently, the extraction of transactions relies on another library, beanhub-extract:

github dot com / LaunchPlatform/beanhub-extract (new users cannot paste more than two links )

For the purpose of my use case and also as a proof of concept, it only supports Mercury Bank CSV files for now. However, I will add support for more banks' CSV files in the future. I will also make beancount-import able to support third-party extractors.

Besides supporting more banks, I am also making generating transactions from multiple sources possible. Here's an example of the merge rule I envisioned:

merges:
- match:
  - name: mercury
    extractor:
      equals: "mercury"
    desc: "Credit card payment"
    merge_key: "{{ date }}:{{ amount }}"
  - name: chase
    extractor:
      equals: "chase"
    desc: "Payment late fee"
    merge_key: "{{ post_date }}:{{ amount }}"
  actions:
    - txn:
        narration: "Paid credit card"
        postings:
          - account: Expenses:CreditCardPayment
            amount:
              number: "{{ -mercury.amount }}"
              currency: "{{ mercury.currency | default('USD', true) }}"
          - account: Expenses:LateFee
            amount:
              number: "{{ -chase.amount }}"
              currency: "{{ chase.currency | default('USD', true) }}"

Let me know what you think or any suggestions are welcome

Best,
Fang-Pen Lin.

simonmic · May 10, 2024, 12:53am

Nice! I wonder how it compares to existing beancount importers (I thought there were a few), or hledger's import rules.

fangpenlin · May 10, 2024, 3:57am

The beancount's current importer approach mostly relies on the same importer class to do extraction and transaction generation simultaneously.

For example, the extract method directly takes a file, reads CSV data from it, and generates entry data immediately as the return value:

github.com

beancount/beancount/blob/6475e79ca9aeb92563244e4b62af368e1360d83c/examples/ingest/office/importers/utrade/utrade_csv.py#L58-L167


      
          def extract(self, file):
              # Open the CSV file and create directives.
              entries = []
              index = 0
              with open(file.name) as infile:
                  for index, row in enumerate(csv.DictReader(infile)):
                      meta = data.new_metadata(file.name, index)
                      date = parse(row['DATE']).date()
                      rtype = row['TYPE']
                      link = "ut{0[REF #]}".format(row)
                      desc = "({0[TYPE]}) {0[DESCRIPTION]}".format(row)
                      units = amount.Amount(D(row['AMOUNT']), self.currency)
                      fees = amount.Amount(D(row['FEES']), self.currency)
                      other = amount.add(units, fees)
          
                      if rtype == 'XFER':
                          assert fees.number == ZERO
                          txn = data.Transaction(
                              meta, date, self.FLAG, None, desc, data.EMPTY_SET, {link}, [
                                  data.Posting(self.account_cash, units, None, None, None,

This file has been truncated. show original

As a result, it brings two problems:

I cannot reuse the same importer to generate the transaction I want, and I can only modify the transaction generation logic in place.
We missed the opportunity to match with other transactions and join the result as a single output transaction in the ledger.

I haven't used Hledger, but I glanced at the document. The rule appears to be more of a set of imperative instructions for parsing the CSV file and generating transactions accordingly. In a way, it works similarly to Beancount's importer.

Beanhub-import's approach is more of a data pipeline style approach. Here's the flow diagram from the how-it-works section:

I treat all raw, generated, and existing beancount transactions as a unit in the pipeline. Each step is more like a filter, transformer, or join, so it's a more functional approach. I also parse the beancount file into a syntax tree and directly manipulate it based on what needs to be changed by comparing existing transactions in the file and the generated ones.

The beancount importer also lacks deep integration with the language syntax tree for the importer use case, which is why much manual processing is needed after the user runs the import command. Beanhub-import inserts transactions with a unique import-id value from the CSV file. Like this:

2024-01-01 * "Digital Ocean"
  import-id: "import-data/mercury/2024.csv:-1"
  import-src: "import-data/mercury/2024.csv"
  Assets:Bank:US:Mercury                                             -8.57 USD
  Expenses:Engineering:ServiceSubscription                            8.57 USD

With those, we can accurately update the ledge files without manual editing.