This is a special post for quick takes by Abhimanyu Pallavi Sudhir. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
15 comments, sorted by Click to highlight new comments since:

The use of "Differential Progress" ("does this advance safety more or capabilities more?") by the AI safety community to evaluate the value of research is ill-motivated.

Most capabilities advancements are not very counterfactual ("some similar advancement would have happened anyway"), whereas safety research is. In other words: differential progress measures absolute rather than comparative advantage / disregards the impact of supply on value / measures value as the y-intercept of the demand curve rather than the intersection of the demand and supply curves.

Even if you looked at actual market value, just p_safety > p_capabilities isn't a principled condition.

Concretely, I think that harping on differential progress risks AI safety getting crowded out by harmless but useless work -- most obviously "AI bias" "AI disinformation", and in my more controversial opinion, overtly prosaic AI safety research which will not give us any insights that can be generalized beyond current architectures. A serious solution to AI alignment will in all likelihood involve risky things like imagining more powerful architectures and revealing some deeper insights about intelligence.

I think there are two important insights here. One is that counterfactual differential progress is the right metric for weighing whether ideas or work should be published. This seems obviously true but not obvious, so well worth stating, and frequently.

The second important idea is that doing detailed work on alignment requires talking about specific AGI designs. This also seems obviously true, but I think goes unnoticed and unappreciated a lot of the time. How an AGI arrives at decisions, beliefs, and values is going to be dependent on its specific architectures.

Putting these two concepts together makes the publication decision much more difficult. Should we cripple alignment work in the interest of having more time before AGI? One pat answer I see is "discuss those ideas privately not publicly". But in practice, this severely limits the number of eyes on each idea, making it vastly more likley that good ideas in alignment aren't spread worked on quickly.

I don't have any good solutions here, but want to note that this issue seems critically important for alignment work. I've personally been roadblocked in substantial ways by this dilemma.

My background means I have relatively a lot of knowledge and theories about how the human mind works. I have specific ideas about several possible routes from current AI to x-risk AGI. Each of these routes also has associated alignment plans. But I can't discuss those plans in detail without discussing the AGI designs in detail. They sound vague and unconvincing without the design forms they fit into. This is a sharp limitation on how much progress I can make on these ideas. I have a handful of people who can and will engage in detail in private, limited and vague engagement in public where the ideas must remain vague, and largely I am working on my own. Private feedback indicates that these AGI designs and alignment schemes might well be viable and relevant, although of course a debunking is always one conversation away.

This is not ideal, nor do I know of a route around it.

current LLMs vs dangerous AIs

Most current "alignment research" with LLMs seems indistinguishable from "capabilities research". Both are just "getting the AI to be better at what we want it to do", and there isn't really a critical difference between the two.

Alignment in the original sense was defined oppositionally to the AI's own nefarious objectives. Which LLMs don't have, so alignment research with LLMs is probably moot.

something related I wrote in my MATS application:


  1. I think the most important alignment failure modes occur when deploying an LLM as part of an agent (i.e. a program that autonomously runs a limited-context chain of thought from LLM predictions, maintains a long-term storage, calls functions such as search over storage, self-prompting and habit modification either based on LLM-generated function calls or as cron-jobs/hooks).

  2. These kinds of alignment failures are (1) only truly serious when the agent is somehow objective-driven or equivalently has feelings, which current LLMs have not been trained to be (I think that would need some kind of online learning, or learning to self-modify) (2) can only be solved when the agent is objective-driven.

quick thoughts on LLM psychology

LLMs cannot be directly anthromorphized. Though something like “a program that continuously calls an LLM to generate a rolling chain of thought, dumps memory into a relational database, can call from a library of functions which includes dumping to recall from that database, receives inputs that are added to the LLM context” is much more agent-like.

Humans evolved feelings as signals of cost and benefit — because we can respond to those signals in our behaviour.

These feelings add up to a “utility function”, something that is only instrumentally useful to the training process. I.e. you can think of a utility function as itself a heuristic taught by the reward function.

LLMs certainly do need cost-benefit signals about features of text. But I think their feelings/utility functions are limited to just that.

E.g. LLMs do not experience the feeling of “mental effort”. They do not find some questions harder than others, because the energy cost of cognition is not a useful signal to them during the training process (I don’t think regularization counts for this either).

LLMs also do not experience “annoyance”. They don’t have the ability to ignore or obliterate a user they’re annoyed with, so annoyance is not a useful signal to them.

Ok, but aren’t LLMs capable of simulating annoyance? E.g. if annoying questions are followed by annoyed responses in the dataset, couldn’t LLMs learn to experience some model of annoyance so as to correctly reproduce the verbal effects of annoyance in its response?

More precisely, if you just gave an LLM the function ignore_user() in its function library, it would run it when “simulating annoyance” even though ignoring the user wasn’t useful during training, because it’s playing the role.

I don’t think this is the same as being annoyed, though. For people, simulating an emotion and feeling it are often similar due to mirror neurons or whatever, but there is no reason to expect this is the case for LLMs.

conditionalization is not the probabilistic version of implies

P Q Q| P P → Q
T T T T
T F F F
F T N/A T
F F N/A T

Resolution logic for conditionalization: Q if P or True

Resolution logic for implies: Q if P or None

The simplest way to explain "the reward function isn't the utility function" is: humans evolved to have utility functions because it was instrumentally useful for the reward function / evolution selected agents with utility functions.

(yeah I know maybe we don't even have utility functions; that's not the point)

Concretely: it was useful for humans to have feelings and desires, because that way evolution doesn't have to spoonfeed us every last detail of how we should act, instead it gives us heuristics like "food smells good, I want".

Evolution couldn't just select a perfect optimizer of the reward function, because there is no such thing as a perfect optimizer (computational costs mean that a "perfect optimizer" is actually uncomputable). So instead it selected agents that were boundedly optimal given their training environment.

One thing I'm surprised by is how everyone learns the canonical way to handwrite certain math characters, despite learning most things from printed or electronic material. E.g. writing as IR rather than how it's rendered.

I know I learned the canonical way because of Khan Academy, but I don't think "guy handwriting on a blackboard like thing" is THAT disproportionately common among educational resources?

I learned maths mostly by teachers at school writing on a whiteboard, university lecturers writing on a blackboard or projector, and to a lesser extent friends writing on pieces of paper.

There was a tiny supplement of textbook-reading at school and large supplement of printed-notes-reading at university.

I would guess only a tiny fraction learn exclusively via typed materials. If you have any kind of teacher, how could you? Nobody shows you how to rearrange an equation by live-typing latex.

I used to have an idea for a karma/reputation system: repeatedly recalculate karma weighted by the karma of the upvoters and downvoters on a comment (then normalize to avoid hyperinflation) until a fixed point is reached.

I feel like this is vaguely somehow related to:

Oh right, lol, good point.

Also check out "personalized pagerank", where the rating shown to each user is "rooted" in what kind of content this user has upvoted in the past. It's a neat solution to many problems.

[+][comment deleted]3-7