Data from AI coding-agent sessions that you run on your computer, by default, not only are not anonymous, they contain frequent repetitions of your username. This is because commands and log messages sometimes use absolute paths, and those paths will be descended from your home directory, so the transcripts are littered with references to paths like /home/<username>/projects/<projectname>.
Training pipelines should probably strip this out, but I'm not aware of any saying that they do so (and I had an agent look for statements to that effect and couldn't find any). This means that if you use an AI coding assistant, have your transcripts incorporated into a training run, and use the model from that training run in ways that also mention the same username, the model may be primed with much more information about you, your projects and your past coding agent interactions than you expect.
There are a few things that have an extremely strong statistical imprint, in your sessions: your choice of programming language, projects, and tab size, for example. If this is happening, you would expect agents to quickly learn associations between usernames and programming languages, and might bias their new-project setup towards the language they associate with the current user. That would be harmless, and a bit useful.
There are plausible scenarios where this would be pathological, however. It might produce user- or group-specific quirks, making it hard for people to collectively reason about what models are like. If training data contains a mix of users using smarter models and users using dumber models, the resulting model might perform differently depending on which group your username was in in the training data.
A small but significant fraction of transcripts I see online show users being angry and abusive towards their AIs. I'm not sure what will happen when those users try out next-gen models that remember more than expected; I don't expect naive game theory or human psyc