Ariel Kwiatkowski

Wiki Contributions


Counterpoint: this is needlessly pedantic and a losing fight.

My understanding of the core argument is that "agent" in alignment/safety literature has a slightly different meaning than "agent" in RL. It might be the case that the difference turns out to be important, but there's still some connection between the two meanings.

I'm not going to argue that RL inherently creates "agentic" systems in the alignment sense. I suspect there's at least a strong correlation there (i.e. an RL-trained agent will typically create an agentic system), but that's honestly beside the point.

The term "RL agent" is very well entrenched and de facto a correct technical term for that part of the RL formalism. Just because alignment people use that term differently, doesn't justify going into neighboring fields and demanding them to change their ways.

It's kinda like telling biologists that they shouldn't use the word [matrix]( because actual matrices are arrays of numbers (or linear maps whatever, mathematicians don't @ me)


And finally, as an example why even if I drank the kool-aid, I absolutely couldn't do the switch you're recommending -- what about multiagent RL? Especially one with homogeneous agents. Doing s/agent/policy/g won't work, because a multiagent algorithm doesn't have to be multipolicy. 


The appendix on s/reward/reinforcement/g is even more silly in my opinion. RL agents (heh) are designed to seek out the reward. They might fail, but that's the overarching goal.

I would be interested in some advice going a step further -- assuming a roughly sufficient technical skill level (in my case, soon-to-be PhD in an application of ML), as well as an interest in the field, how to actually enter the field with a full-time position? I know independent research is one option, but it has its pros and cons. And companies which are interested in alignment are either very tiny (=not many positions), or very huge (like OpenAI et al., =very selective)

Isn't this extremely easy to directly verify empirically? 

Take a neural network $f$ trained on some standard task, like ImageNet or something. Evaluate $|f(kx) - kf(x)|$ on a bunch of samples $x$ from the dataset, and $f(x+y) - f(x) - f(y)$ on samples $x, y$. If it's "almost linear", then the difference should be very small on average. I'm not sure right now how to define "very small", but you could compare it e.g. to the distance distribution $|f(x) - f(y)|$ of independent samples, also depending on what the head is.

FWIW my opinion is that all this "circumstantial evidence" is a big non sequitur, and the base statement is fundamentally wrong. But it seems like such an easily testable hypothesis that it's more effort to discuss it than actually verify it.

"Overall, it continually gets more expensive to do the same amount of work"


This doesn't seem supported by the graph? I might be misunderstanding something, but it seems like research funding essentially followed inflation, so it didn't get more expensive in any meaningful terms. The trend even seems to be a little bit downwards for the real value.

Looking for research idea feedback:

Learning to manipulate: consider a system with a large population of agents working on a certain goal, either learned or rule-based, but at this point - fixed. This could be an environment of ants using pheromones to collect food and bring it home.

Now add another agent (or some number of them) which learns in this environment, and tries to get other agents to instead fulfil a different goal. It could be ants redirecting others to a different "home", hijacking their work.

Does this sound interesting? If it works, would it potentially be publishable as a research paper? (or at least a post on LW) Any other feedback is welcome!

But isn't the whole point that the hotel is full initially, and yet can accept more guests?

Has anyone tried to work with neural networks predicting the weights of other neural networks? I'm thinking about that in the context of something like subsystem alignment, e.g. in an RL setting where an agent first learns about the environment, and then creates the subagent (by outputting the weights or some embedding of its policy) who actually obtains some reward

This reminds me of an idea bouncing around my mind recently, admittedly not aiming to solve this problem, but possibly exhibiting it.

Drawing inspiration from human evolution, then given a sufficiently rich environment where agents have some necessities for surviving (like gathering food), they could be pretrained with something like a survival prior which doesn't require any specific reward signals.

Then, agents produced this way could be fine-tuned for downstream tasks, or in a way obeying orders. The problem would arise when an agent is given an order that results in its death. We might want to ensure it follows its original (survival) instinct, unless overridden by a more specific order.

And going back to a multiagent scenario, similar issues might arise when the order would require antisocial behavior in a usually cooperative environment. The AI Economist comes to mind where that could come into play, where agents actually learn some nontrivial social relations