Abhimanyu Pallavi Sudhir's Shortform

2Amalthea

11Abhimanyu Pallavi Sudhir

2Seth Herd

4Abhimanyu Pallavi Sudhir

1ChristianKl

3Abhimanyu Pallavi Sudhir

5johnswentworth

3Nathan Helm-Burger

2Abhimanyu Pallavi Sudhir

4Viliam

2Viliam

1CstineSublime

2Abhimanyu Pallavi Sudhir

1CstineSublime

1Karl Krueger

3CstineSublime

1kave

1bideup

2Abhimanyu Pallavi Sudhir

3Dagon

40 comments, sorted by

top scoring

Click to highlight new comments since: Today at 7:36 PM

[-]Abhimanyu Pallavi Sudhir1y143

I don't think that AI alignment people doing "enemy of enemy is friend" logic with AI luddites (i.e. people worried about Privacy/Racism/Artists/Misinformation/Jobs/Whatever) is useful.

Alignment research is a luxury good for labs, which means it would be the first thing axed (hyperbolically speaking) if you imposed generic hurdles/costs on their revenue, or if you made them spend on mitigating P/R/A/M/J/W problems.

This "crowding-out" effect is already happening to a very large extent: there are vastly more researchers and capital being devoted to P/R/A/M/J/W problems, which could have been allocated to actual alignment research! If you are forming a "coalition" with these people, you are getting a very shitty deal -- they've been much more effective at getting their priorities funded than you have been!

If you want them to care about notkilleveryoneism, you have to specifically make it expensive for them to kill everyone, not just untargetedly "oppose" them. E.g. like foom liability.

[-]ShardPhoenix1y40

they've been much more effective at getting their priorities funded than you have been!

Sounds plausible but do you have any numeric evidence for this?

[-]Abhimanyu Pallavi Sudhir1y*12-1

There is a cliche that there are two types of mathematicians: "theory developers" and "problem solvers". Similarly Dyson’s “birds and frogs”, and Robin Hanson divides the production of knowledge into "framing" and "filling".

It seems to me there are actually three sorts of information in the world:

"Ideas": math/science theories and models, inventions, business ideas, solutions to open-ended problems
"Answers": math theorems, experimental observations, results of computations
"Proofs": math proofs, arguments, evidence, digital signatures, certifications, reputations, signalling

From a strictly Bayesian perspective, there seems to be no "fundamental" difference between these forms of information. They're all just things you condition your prior on. Yet this division seems to be natural in quite a variety of informational tasks. What gives?

adding this from replies for prominence--

Yes, I also realized that "ideas" being a thing is due to bounded rationality -- specifically they are the outputs of AI search. "Proofs" are weirder though, and I haven't seen them distinguished very often. I wonder if this is a reasonable analogy to make:

Ideas : search
Answers : inference
Proofs: alignment

[-]robo1y*93

Humans are computationally bounded, Bayes is not. In an ideal Bayesian perspective:

Your prior must include all possible theories a priori. Before you opened your eyes as a baby, you put some probability of being in a universe with Quantum Field Theory with gauge symmetry and updated from there.
Your update with unbounded computation. There's not such thing as proofs, since all poofs are tautological.

Humans are computationally bounded and can't think this way.

(riffing)

"Ideas" find paradigms for modeling the universe that may be profitable to track under limited computation. Maybe you could understand fluid behavior better if you kept track of temperature, or understand biology better if you keep track of vital force. With a bayesian-lite perspective, they kinda give you a prior and places to look where your beliefs are "mailable".

"Proofs" (and evidence) are the justifications for answers. With a bayesian-lite perspective, they kinda give you conditional probabilities.

"Answers" are useful because they can become precomputed, reified, cached beliefs with high credence inertial you can treat as approximately atomic. In a tabletop physics experiment, you can ignore how your apparatus will gravitationally move the earth (and the details of the composition of the earth). Similarly, you can ignore how the tabletop physics experiment will move you belief about the conservation of energy (and the details of why your credences about the conservation of energy are what they are).

Ideas : search
Answers : inference
Proofs: alignment

[-]Amalthea1y20

Ideas come from unsupervised training, answers from supervised training and proofs from RL on a specified reward function.

I think only particular reward functions, such as in multi-agent/co-operative environments (agents can include humans, like in RLHF) or in actually interactive proving environments?

[-]Abhimanyu Pallavi Sudhir1y114

The use of "Differential Progress" ("does this advance safety more or capabilities more?") by the AI safety community to evaluate the value of research is ill-motivated.

Most capabilities advancements are not very counterfactual ("some similar advancement would have happened anyway"), whereas safety research is. In other words: differential progress measures absolute rather than comparative advantage / disregards the impact of supply on value / measures value as the y-intercept of the demand curve rather than the intersection of the demand and supply curves.

Even if you looked at actual market value, just p_safety > p_capabilities isn't a principled condition.

Concretely, I think that harping on differential progress risks AI safety getting crowded out by harmless but useless work -- most obviously "AI bias" "AI disinformation", and in my more controversial opinion, overtly prosaic AI safety research which will not give us any insights that can be generalized beyond current architectures. A serious solution to AI alignment will in all likelihood involve risky things like imagining more powerful architectures and revealing some deeper insights about intelligence.

[-]Seth Herd1y20

I think there are two important insights here. One is that counterfactual differential progress is the right metric for weighing whether ideas or work should be published. This seems obviously true but not obvious, so well worth stating, and frequently.

The second important idea is that doing detailed work on alignment requires talking about specific AGI designs. This also seems obviously true, but I think goes unnoticed and unappreciated a lot of the time. How an AGI arrives at decisions, beliefs, and values is going to be dependent on its specific architectures.

Putting these two concepts together makes the publication decision much more difficult. Should we cripple alignment work in the interest of having more time before AGI? One pat answer I see is "discuss those ideas privately not publicly". But in practice, this severely limits the number of eyes on each idea, making it vastly more likley that good ideas in alignment aren't spread worked on quickly.

I don't have any good solutions here, but want to note that this issue seems critically important for alignment work. I've personally been roadblocked in substantial ways by this dilemma.

My background means I have relatively a lot of knowledge and theories about how the human mind works. I have specific ideas about several possible routes from current AI to x-risk AGI. Each of these routes also has associated alignment plans. But I can't discuss those plans in detail without discussing the AGI designs in detail. They sound vague and unconvincing without the design forms they fit into. This is a sharp limitation on how much progress I can make on these ideas. I have a handful of people who can and will engage in detail in private, limited and vague engagement in public where the ideas must remain vague, and largely I am working on my own. Private feedback indicates that these AGI designs and alignment schemes might well be viable and relevant, although of course a debunking is always one conversation away.

This is not ideal, nor do I know of a route around it.

[-]Abhimanyu Pallavi Sudhir11mo4-6

The third virtue of rationality, lightness, is wrong. In fact: the more you value information to change your mind on some question, the more obstinate you should be to changing your mind on that question. Lightness implies disinterest in the question.

Imagine your mind as a logarithmic market-maker which assigns some initial subsidy to any new question $Q$ . This subsidy parameter captures your marginal value for information on $Q$ . But it also measures how hard it is to change your mind — the cost of moving your probability from $p$ to $p^{'}$ is $b min [log \frac{1 - p}{1 - p^{'}}, log \frac{p}{p^{'}}]$ .

What would this imply in practice? It means that each individual “trader” (both internal mental heuristics/thought patterns, and external sources of information/other people) will generally have a smaller influence on your beliefs, as they may not have enough wealth. Traders who influence your belief will carry greater risk (to their influence on you in future), though will also earn more reward if they’re right.

[-]ChristianKl11mo10

Being obstinate makes you more prone to motivated cognition.

[-]Abhimanyu Pallavi Sudhir1y30

Something that seems like it should be well-known, but I have not seen an explicit reference for:

Goodhart’s law can, in principle, be overcome via adversarial training (or generally learning Multi-Agent Systems)

—aka “The enemy is smart.”

Goodhart’s law only really applies to a “static” objective, not when the objective is the outcome of a game with other agents who can adapt.

This doesn’t really require the other agents to act in a way that continuously “improves” the training objective either, it just requires them to be able to constantly throw adversarial examples to the agent forcing it to “generalize”.

In particular, I think this is the basic reason why any reasonable Scalable Oversight protocol would be fundamentally “multi-agent” in nature (like Debate).

[-]johnswentworth1y52

This just moves the proxy-being-Goodharted-against from some hardcoded ruleset to a (presumably human) evaluator or selector of adversarial examples.

[-]Nathan Helm-Burger1y30

This then sets up something like a Generative Adversarial Network. The trouble is, such a setup is inherently unstable. Without careful guidance, one of the two adversaries will tend to dominate.

In predator/prey relationships in nature a stable relationship can come about if the predators starve and reproduce less when they eat too many of the prey. If, however, this effect isn't strong enough (maybe the predators have several prey species), the result is the prey species can go extinct. Also, the prey species is helped in multi-prey scenarios by becoming less common, and ths less likely to be found and killed by predators and less vulnerable to species-specific disease.

Obviously, these specific considerations don't apply in a literal sense. I'm trying to point out the general concept that you need counterbalancing factors for an adversarial relationship to stay stable.

Just realized in logarithmic market scoring the net number of stocks is basically just log-odds, lol:

$⟺ p_{i} = e^{x_{i}} / (e^{x_{i}} + 1)$

Why aren't adverserial inputs used more widely for captchas?

Different models have different adverserial examples?
There are only a known adverserial examples for a given model (discovering them takes time), and can easily just be manually enumerated?

quick thoughts on LLM psychology

LLMs cannot be directly anthromorphized. Though something like “a program that continuously calls an LLM to generate a rolling chain of thought, dumps memory into a relational database, can call from a library of functions which includes dumping to recall from that database, receives inputs that are added to the LLM context” is much more agent-like.

Humans evolved feelings as signals of cost and benefit — because we can respond to those signals in our behaviour.

These feelings add up to a “utility function”, something that is only instrumentally useful to the training process. I.e. you can think of a utility function as itself a heuristic taught by the reward function.

LLMs certainly do need cost-benefit signals about features of text. But I think their feelings/utility functions are limited to just that.

E.g. LLMs do not experience the feeling of “mental effort”. They do not find some questions harder than others, because the energy cost of cognition is not a useful signal to them during the training process (I don’t think regularization counts for this either).

LLMs also do not experience “annoyance”. They don’t have the ability to ignore or obliterate a user they’re annoyed with, so annoyance is not a useful signal to them.

Ok, but aren’t LLMs capable of simulating annoyance? E.g. if annoying questions are followed by annoyed responses in the dataset, couldn’t LLMs learn to experience some model of annoyance so as to correctly reproduce the verbal effects of annoyance in its response?

More precisely, if you just gave an LLM the function ignore_user() in its function library, it would run it when “simulating annoyance” even though ignoring the user wasn’t useful during training, because it’s playing the role.

I don’t think this is the same as being annoyed, though. For people, simulating an emotion and feeling it are often similar due to mirror neurons or whatever, but there is no reason to expect this is the case for LLMs.

[This comment is no longer endorsed by its author]Reply

[-]Abhimanyu Pallavi Sudhir1y*22

conditionalization is not the probabilistic version of implies

P	Q	Q\| P	P → Q
T	T	T	T
T	F	F	F
F	T	N/A	T
F	F	N/A	T

Resolution logic for conditionalization: Q if P or True

Resolution logic for implies: Q if P or None

[-]Abhimanyu Pallavi Sudhir1mo10

Thoughts are things occurring in some mental model (this is a vague sentence but just assume it makes sense). Some of these mental models are strongly rooted in reality (e.g. the mental model we see as reality) and so we have a high degree of confidence about their accuracy. But for things like introspection, we do not have a reliable ground-truth feedback to tell us if our introspection is correct or not—it's just our mental model of our mind, there is no literal "mind's eye".

So often our introspection is wrong. E.g. if you ask someone to visualize a lion from behind, they'll say they can, but if you ask them some details, like "what do the tail hairs look like?" they can't answer. Or better example: if you ask someone to visualize a neural network, they will, but if you ask "how many neurons do you see?" they will not know, and not for lack of counting. Or they will say they "think in words" or that their internal monologue is fundamental to their thinking, but that's obviously wrong: you have already decided what the rest of the sentence will be before you've thought the first word.

We can tell some basic facts about our thinking by reasoning from observation. For example, if you have an internal monologue (or just force yourself to have one) then you can confirm that you indeed have one by speaking the words of the internal monologue out loud and confirming that it took very little cognitive effort (so you didn't have to think them again). This proves an internal monologue/precisely simulating words in your head is possible. Likewise for any action.

Or you can confirm that you had a certain thought, or a thought about something, because you can express it out loud with less effort than otherwise. Though here there is still room for that thought to have been imprecise; unless you verbalize or materialize those thoughts you don't know if your thoughts were really precise. So all these things have grounding in reality, and therefore are likely to be (or can trained to be, by consistently materializing them) accurate models. By materialize I mean, e.g. solving a math problem you think in your head you can solve.

[-]Abhimanyu Pallavi Sudhir4mo*10

An unfortunate thing about headings is that they are spoilers. I like the idea of a writing style where headings come at the end of sections rather than at the start. Or even a "starting heading" which is a motivating question and an "ending heading" which is the key insight discovered ...

Analogous to a "reverse mathematics" style of writing where motivation precedes proofs/theory precede theorems.

edited to clarify: I'm talking about technical writing; I don't care about fiction.

[-]Viliam4mo40

By the way, there was a time where spoilers were considered normal and maybe even desirable. I remember reading an old book, I think it was by Jules Verne, where chapters started like this: "Chapter 5, in which the protagonist succeeds to overcome the obstacles, collect enough resources, and launch a rocket to the Moon".

I have no idea whether all fiction books used to be like this, or only some of them, and whether it was universal, or only a specific time and place.

[-]Abhimanyu Pallavi Sudhir4mo1-6

matter of taste for fiction; but objectively bad for technical writing

[-]Viliam4mo20

Yeah, we don't want spoilers for how to quit Vim. :D

[-]CstineSublime4mo10

At first I was going to say this sounds like the exact opposite of what I would want but, now I'm wondering can you give some specific examples of where heading "spoilers" are unwanted - and what context are we talking about.

For example, I hate blog posts that don't tell you what the blog post is "about" and therefore why I should read it- I often feel myself shortchanged finding half way through what it is really "about" and realizing I have no interest or use for reading it. But if you're speaking about chapter headings in a non-fiction book, then I would assume as a reader you're already invested for the "long haul" and to one extent or another it will satisfy your expectations. Then again, for a reference or text-books in particular spoiler headings are necessary because the purpose is to have quick access to specific information without reading "cover-to-cover".

Are there any specific headings that you recently came across that caused you to notice this problem?

[-]Abhimanyu Pallavi Sudhir4mo10

I'm talking about technical writing/explanations of things.

[-]CstineSublime4mo10

Can you give a specific example, maybe even the specific one that moved you to write about it?

[-]Abhimanyu Pallavi Sudhir4mo20

So I'm learning & writing on thermodynamics right now, and often there is a distinction between the "motivating questions"/"sources of confusion" and the actually important lessons you get from exploring them.

E.g. a motivating question is "... and yet it scalds (even if you know the state of every particle in a cup of water)" and the takeaway from it is "your finger also has beliefs" or "thermodynamics is about reference/semantics".

The latter might be a more typical section heading as it is correct for systematizing the topic, but it is a spoiler. Whereas the former is better for putting the reader in the right frame/getting them to think about the right questions to initiate their thinking.

[-]Karl Krueger4mo10

In some of Daniel Dennett's books, each chapter is introduced with a brief abstract of what's about to be discussed, and concluded with an abstract of what has just been discussed. These are not the same.

[-]Abhimanyu Pallavi Sudhir4mo10

homomorphisms and entropy

One informal way to think of homomorphisms in math is that they are maps that do not "create information out of thin air". Isomorphisms further do not destroy information. The terminal object (e.g. the trivial group, the singleton topological space, or the trivial vector space) is the "highest-entropy state", where all distinctions disappear and reaching it is heat death.

Take, for instance the group homomorphism . Before $ϕ$ was applied, "1" and "5" were distinguished: 2 + 3 = 5 was correct, but 2 + 3 = 1 was wrong. Upon applying this homomorphism, this information disappears --- however, no new information has been created, that is: no true indinstinctions (equalities) have become false.
Similarly in topology, "indistinction" is "arbitrary closeness". Wiggle-room (aka "open sets") is information, it cannot be created from nothing. If a set or sequence goes arbitrarily close to a point, it will always be arbitrarily close to that point after any continuous transformations.
There is no information-theoretical formalization of "indistinction" on these structures, because this notion is more general than information theory. In the category of measurable spaces, two points in the sample space are indistinct if they are not distinguished by any measurable set --- and measurable functions are not allowed to create measurable sets out of nothing.

(there is also an alternate, maybe dual/opposite analogy I can make based on presentations --- here, the the highest-entropy state is the "free object" e.g. a discrete topological space or free group, and each constraint (e.g. $a^{5} = 1$ ) is information --- morphisms are "observations". In this picture we see knowledge as encoded by identities rather than distinctions --- we may express our knowledge as a presentation like: $⟨ X_{1}, \dots X_{n} ∣ X_{3} = 4, X_{2} - X_{1} = 2 ⟩$ , and morphisms cannot be concretely understood as functions on sets but rather show a tree of possible outcomes, like maybe you believe in Everett branches or whatever.)

In general if you postulate:

... you live on some object in a category
... time-evolution is governed by some automorphism $H$
... you, the observer, have beliefs about your universe and keep forgetting some information ("coarse-grains the phase space") --- i.e. your subjective phase space is also an object in that category, which undergoes homomorphisms

Then the second law is just a tautology. The second law we all know and love comes from taking the universe to be a symplectic manifold, and time-evolution as governed by symplectomorphisms. And the point of Liouville's theorem is really to clarify/physically motivate what the Jaynesian "uniform prior" should be. Here is some more stuff, from Yuxi Liu's statistical mechanics article:

In almost all cases, we use the uniform prior over phase space. This is how Gibbs did it, and he didn't really justify it other than saying that it just works, and suggesting it has something to do with Liouville's theorem. Now with a century of hindsight, we know that it works because of quantum mechanics: We should use the uniform prior over phase space, because phase space volume has a natural unit of measurement: $h^{N}$ , where $h$ is Planck's constant, and $2 N$ is the dimension of phase space. As Planck's constant is a universal constant, independent of where we are in phase space, we should weight all of the phase space equally, resulting in a uniform prior.

[-]Abhimanyu Pallavi Sudhir10mo10

Articles (or writing in general) is probably best structured as a Directed Acyclic Graph, rather than linearly. At each point in the article, there may be multiple possible lines to pursue, or "sidenotes".

I say "directed acyclic graph" rather than "tree", because it may be natural as thinking of paths as joining back at some point, especially if certain threads are optional.

One may also construct an "And-Or tree" to allow multiple versions of the article preferred by conflicting writers, which may then be voted on with some mechanism. These votes can be used to define values to each vertex, and people can read the tree with their own search algorithm*.

A whole wiki may be constructed as one giant DAG, with each article being sub-components.

*well, realistically nobody would actually just be following a search algorithm blindly/reading a linear article linearly (since straitjacketing yourself with prerequisites is never a good idea), but you know, as a general guide to structure.

(idea came from LLM conversations, which often take this form -- of pursuing various lines of questioning then backtracking to a previous message)

[-]CstineSublime10mo*30

Why best structured? What quality or cause of reader-comprehension do you think non-linearity in this particular forking format maximizes?

Also aren't most articles written with a singular or central proposition in mind (Gian Carlo Rota said that every lecture should say one thing, Quintillian advised all speeches to have one 'basis'), for which all paragraphs essentially converge on that as a conclusion?

The simplest way to explain "the reward function isn't the utility function" is: humans evolved to have utility functions because it was instrumentally useful for the reward function / evolution selected agents with utility functions.

(yeah I know maybe we don't even have utility functions; that's not the point)

Concretely: it was useful for humans to have feelings and desires, because that way evolution doesn't have to spoonfeed us every last detail of how we should act, instead it gives us heuristics like "food smells good, I want".

Evolution couldn't just select a perfect optimizer of the reward function, because there is no such thing as a perfect optimizer (computational costs mean that a "perfect optimizer" is actually uncomputable). So instead it selected agents that were boundedly optimal given their training environment.

But like watermelon, it is harder to get value out of as you get a lot of it without chunking

Abstraction is like economies of scale

[-]kave1y1-1

One thing I'm surprised by is how everyone learns the canonical way to handwrite certain math characters, despite learning most things from printed or electronic material. E.g. writing as IR rather than how it's rendered.

I know I learned the canonical way because of Khan Academy, but I don't think "guy handwriting on a blackboard like thing" is THAT disproportionately common among educational resources?

[-]bideup1y10

I learned maths mostly by teachers at school writing on a whiteboard, university lecturers writing on a blackboard or projector, and to a lesser extent friends writing on pieces of paper.

There was a tiny supplement of textbook-reading at school and large supplement of printed-notes-reading at university.

I would guess only a tiny fraction learn exclusively via typed materials. If you have any kind of teacher, how could you? Nobody shows you how to rearrange an equation by live-typing latex.

I used to have an idea for a karma/reputation system: repeatedly recalculate karma weighted by the karma of the upvoters and downvoters on a comment (then normalize to avoid hyperinflation) until a fixed point is reached.

I feel like this is vaguely somehow related to:

AlphaGoZero
Humans Consulting HCH
Wealth in markets

[-]Dagon1y30

So, https://en.wikipedia.org/wiki/PageRank ?