Watch out for Copilot

On the Beancount mail list, Martin Blais writes (click for full thread):

Dear Beancount users,

This is a PSA -- TL;DR: don't enable Github Copilot completion on your ledgers.

If you installed Github Copilot in your personal code editor/computer, be aware that it uploads "snippets" of your input files to it and possibly to third-party APIs (e.g., OpenAI). I think people are just beginning to become aware of the implications of this due to their employers crafting policies around what LLMs they can use and what-not, but it's still early days and it's easy to accidentally screw up, so here are some thoughts about this.

I think it's really easy to install Github Copilot to get code completions in say, Emacs, and then to open up your ledger and it's in Copilot minor-mode everywhere (for example if you enabled it via (add-hook 'prog-mode-hook 'copilot-mode) or similar, to be turned on everywhere ("it's amazing, right?")), which means you get completions on its contents. AFAICT it's impossible to know how much context is sent up to the models for queries. GH claims general "context" is sent:

GitHub Copilot · Your AI pair programmer · GitHub

In other places I've seen it's mentioned that "a few lines of context before and after the code you're editing". AFAIK there's no way to know how large this context is, and I've seen mentions of the selection somewhere. For example, if you select your entire ledger file, does it upload the whole thing as context for your completion prompt?

Github's retention policy mentions prompts aren't retained, but what about context?

I see "Prompts and Suggestions" in the FAQ:

And some of your transaction data may end up getting used to train new models?

Please correct me if I'm wrong:

  • I don't believe there is a local log (on your computer) of what was actually sent.

(If you just accidentally once opened up your ledger with the entire history of your financial life, it's not impossible that the whole thing was uploaded to Copilot.)

  • I don't believe Github lets you view the content you've uploaded and sent from their site either.

  • I don't believe Github lets you delete the content as a matter or normal usage (like Google Dashboard does, e.g., https://myaccount.google.com/dashboard)

There's some mention in the FAQ:

This takes you to this page:

Okay, so maybe. This looks good in theory, but what if your data has also been sent to a third-party service?

AFAIK Copilot uses OpenAI's Codex model. Do they have a setup to host and run it themselves?

Or is all the data sent to a service run by OpenAI?

I think it's appropriate to be really cautious about this.

1 Like

Very interesting, be careful with data of other people and data you care about.

I would only use it for projects I publish publicly. (I actually haven't used it, yet, but probably will, soon.) For best code, more context is important, so if it's public, anyway, I wouldn't mind the whole thing being studied. It would be nice if it was restrictable, though.

You can easily shut copilot away from filetypes you don't want it to work in. I chose to shut copilot up by default, the selectively allow it for some filetypes.

1 Like

In VS Code, I made a new profile for copilot that activates in a test directory, just to try it out. The plugins are only installed in that profile. I think I also set the file types through the plugin settings.