All of Ariel Kwiatkowski's Comments + Replies

Often academics justify this on the grounds that you're receiving more than just monetary benefits: you're receiving mentorship and training. We think the same will be true for these positions. 


I don't buy this. I'm actually going through the process of getting a PhD at ~40k USD per year, and one of the main reasons why I'm sticking with it is that after that, I have a solid credential that's recognized worldwide, backed by a recognizable name (i.e. my university and my supervisor). You can't provide either of those things.

This offer seems to take the worst of both worlds between academia and industry, but if you actually find someone good at this rate, good for you I suppose

My point is that your comment was extremely shallow, with a bunch of irrelevant information, and in general plagued with the annoying ultra-polite ChatGPT style - in total, not contributing anything to the conversation. You're now defensive about it and skirting around answering the question in the other comment chain ("my endorsed review"), so you clearly intuitively see that this wasn't a good contribution. Try to look inwards and understand why.

It's really good to see this said out loud. I don't necessarily have a broad overview of the funding field, just my experiences of trying to get into it - both into established orgs, or trying to get funding for individual research, or for alignment-adjacent stuff - and ending up in a capabilities research company.

I wonder if this is simply the result of the generally bad SWE/CS market right now. People who would otherwise be in big tech/other AI stuff, will be more inclined to do something with alignment. Similarly, if there's less money in overall tech (maybe outside of LLM-based scams), there may be less money for alignment.

This is roughly my situation.  Waymo froze hiring and had layoffs while continuing to increase output expectations.  As a result I/we had more work.   I left in March to explore AI and landed on Mechanistic Interpretability research.

Is it a thing now to post LLM-generated comments on LW?


If Orthogonal wants to ever be taken seriously, by far the most important thing is improving the public-facing communication. I invested a more-than-fair amount of time (given the strong prior for "it won't work" with no author credentials, proof-of-concepts, or anything that would quickly nudge that prior) trying to understand QACI, and why it's not just gibberish (both through reading LW posts and interacting with authors/contributors on the discord server), and I'm still mostly convinced there is absolutely nothing of value in this direction. 

And n... (read more)

When you say "X is not a paradox", how do you define a paradox?

Does the original paper even refer to x-risk? The word "alignment" doesn't necessarily imply that specific aspect.

4Cleo Nardo3mo
Nope, no mention of xrisk — which is fine because "alignment" means "the system does what the user/developer wanted", which is more general than xrisk mitigation. But the paper's results suggest that finetuning is much worse than RLHF or ConstitutionalAI at this more general sense of "alignment", despite the claims in their conclusion.

I feel like this is one of the cases where you need to be very precise about your language, and be careful not to use an "analogous" problem which actually changes the situation.


Consider the first "bajillion dollars vs dying" variant. We know that right now, there's about 8B humans alive. What happens if the exponential increase exceed that number? We probably have to assume there's an infinite number of humans, fair enough.

What does it mean that "you've chosen to play"? This implies some intentionality, but due to the structure of the game, where th... (read more)

2Martin Randall3mo
I definitely agree on the need for care in switching between variants. It can also be helpful that they can "change the situation" because this can reveal something unspecified about the original variant. Certainly I was helped by making a second variant, as this clarified for me that the probabilities are different from the deity view vs the snake view, because of anthropics. In the original variant, it's not specified when exactly players get devoured. Maybe it is instant. Maybe everyone is given a big box that contains either a bazillion dollars, or human-eating snakes, and it opens exactly a year later. In my variant, I was imagining the god initially created a batch of snakes with uncolored eyes, then played dice, then gave them red or blue eyes. So the snakes, like the players, can have experiences prior to the dice being rolled. And yes, no snakes exist before I start. (why is the god wicked? No love for snakes...) I'll update the text to clarify that no snakes exist until the god of snake creation gets to work. I think this is a great crystallization of the paradox. In this scenario, it seems like I should believe I have a 1/36 chance of red eyes, and my new friend has a 1/2 chance of red eyes. But my friend has had exactly the same experiences as me, and they reason that the probabilities are reversed.

Counterpoint: this is needlessly pedantic and a losing fight.

My understanding of the core argument is that "agent" in alignment/safety literature has a slightly different meaning than "agent" in RL. It might be the case that the difference turns out to be important, but there's still some connection between the two meanings.

I'm not going to argue that RL inherently creates "agentic" systems in the alignment sense. I suspect there's at least a strong correlation there (i.e. an RL-trained agent will typically create an agentic system), but that's honestly be... (read more)

I'm... not demanding that the field of RL change? Where in the post did you perceive me to demand this? For example, I wrote that "I wouldn't say 'reinforcement function' in e.g. a conference paper." I also took care to write "This terminology is loaded and inappropriate for my purposes." Each individual reader can choose to swap to "policy" without communication difficulties, in my experience: (As an aside, I also separately wish RL would change its terminology, but it's a losing fight as you point out, and I have better things to do with my time.)

I would be interested in some advice going a step further -- assuming a roughly sufficient technical skill level (in my case, soon-to-be PhD in an application of ML), as well as an interest in the field, how to actually enter the field with a full-time position? I know independent research is one option, but it has its pros and cons. And companies which are interested in alignment are either very tiny (=not many positions), or very huge (like OpenAI et al., =very selective)

Isn't this extremely easy to directly verify empirically? 

Take a neural network $f$ trained on some standard task, like ImageNet or something. Evaluate $|f(kx) - kf(x)|$ on a bunch of samples $x$ from the dataset, and $f(x+y) - f(x) - f(y)$ on samples $x, y$. If it's "almost linear", then the difference should be very small on average. I'm not sure right now how to define "very small", but you could compare it e.g. to the distance distribution $|f(x) - f(y)|$ of independent samples, also depending on what the head is.

FWIW my opinion is that all this "... (read more)

At least how I would put this -- I don't think the important part is that NNs are literally almost linear, when viewed as input-output functions. More like, they have linearly represented features (i.e. directions in activation space, either in the network as a whole or at a fixed layer), or there are other important linear statistics of their weights (linear mode connectivity) or activations (linear probing).  Maybe beren can clarify what they had in mind, though.

"Overall, it continually gets more expensive to do the same amount of work"


This doesn't seem supported by the graph? I might be misunderstanding something, but it seems like research funding essentially followed inflation, so it didn't get more expensive in any meaningful terms. The trend even seems to be a little bit downwards for the real value.

Grant award amounts remained the same adjusted for the biomedical price index, but the nominal amounts had to double to match the price index increases.

Looking for research idea feedback:

Learning to manipulate: consider a system with a large population of agents working on a certain goal, either learned or rule-based, but at this point - fixed. This could be an environment of ants using pheromones to collect food and bring it home.

Now add another agent (or some number of them) which learns in this environment, and tries to get other agents to instead fulfil a different goal. It could be ants redirecting others to a different "home", hijacking their work.

Does this sound interesting? If it works, would it potentially be publishable as a research paper? (or at least a post on LW) Any other feedback is welcome!

This sounds interesting to me.

But isn't the whole point that the hotel is full initially, and yet can accept more guests?

Yeah, the hotel being always half full no matter how many guests it has doesn't seem as cool.

Has anyone tried to work with neural networks predicting the weights of other neural networks? I'm thinking about that in the context of something like subsystem alignment, e.g. in an RL setting where an agent first learns about the environment, and then creates the subagent (by outputting the weights or some embedding of its policy) who actually obtains some reward

This reminds me of an idea bouncing around my mind recently, admittedly not aiming to solve this problem, but possibly exhibiting it.

Drawing inspiration from human evolution, then given a sufficiently rich environment where agents have some necessities for surviving (like gathering food), they could be pretrained with something like a survival prior which doesn't require any specific reward signals.

Then, agents produced this way could be fine-tuned for downstream tasks, or in a way obeying orders. The problem would arise when an agent is given an ord... (read more)