Book 4 of the Sequences Highlights

While far better than what came before, "science" and the "scientific method" are still crude, inefficient, and inadequate to prevent you from wasting years of effort on doomed research directions.

Customize
Rationality+Rationality+World Modeling+World Modeling+AIAIWorld OptimizationWorld OptimizationPracticalPracticalCommunityCommunity
Personal Blog+
Richard_NgoΩ690
0
I haven't yet read through them thoroughly, but these four papers by Oliver Richardson are pattern-matching to me as potentially very exciting theoretical work. tl;dr: probabilistic dependency graphs (PDGs) are directed graphical models designed to be able to capture inconsistent beliefs (paper 1). The definition of inconsistency is a natural one which allows us to, for example, reframe the concept of "minimizing training loss" as "minimizing inconsistency" (paper 2). They provide an algorithm for inference in PDGs (paper 3) and an algorithm for learning via locally minimizing inconsistency which unifies several other algorithms (like the EM algorithm, message-passing, and generative adversarial training) (paper 4). Oliver is an old friend of mine (which is how I found out about these papers) and a final-year PhD student at Cornell under Joe Halpern.
The Wikipedia articles on the VNM theorem, Dutch Book arguments, money pump, Decision Theory, Rational Choice Theory, etc. are all a horrific mess. They're also completely disjoint, without any kind of Wikiproject or wikiboxes for tying together all the articles on rational choice. It's worth noting that Wikipedia is the place where you—yes, you!—can actually have some kind of impact on public discourse, education, or policy. There is just no other place you can get so many views with so little barrier to entry. A typical Wikipedia article will get more hits in a day than all of your LessWrong blog posts have gotten across your entire life, unless you're @Eliezer Yudkowsky. I'm not sure if we actually "failed" to raise the sanity waterline, like people sometimes say, or if we just didn't even try. Given even some very basic low-hanging fruit interventions like "write a couple good Wikipedia articles" still haven't been done 15 years later, I'm leaning towards the latter. edit me senpai
In this interview, Eliezer says the following: > I think if you push anything [referring to AI systems] far enough, especially on anything remotely like the current paradigms, like if you make it capable enough, the way it gets that capable is by starting to be general.  > > And at the same sort of point where it starts to be general, it will start to have it's own internal preferences, because that is how you get to be general. You don't become creative and able to solve lots and lots of problems without something inside you that organizes your problem solving and that thing is like a preference and a goal. It's not built in explicitly, it's just something that's sought out by the process that we use to grow these things to be more and more capable. It caught my attention, because it's a concise encapsulation of something that I already knew Eliezer thought, and which seems to me to be a crux between "man, we're probably all going to die" and "we're really really fucked", but which I don't myself understand. So I'm taking a few minutes to think through it afresh now.  I agree that systems get to be very powerful by dint of their generality.  (There are some nuances around that: part of what makes GPT-4 and Claude so useful is just that they've memorized so much of the internet. That massive knowledge base helps make up for their relatively shallow levels of intelligence, compared to smart humans. But the dangerous/scary thing is definitely AI systems that are general enough to do full science and engineering processes.) I don't (yet?) see why generality implies having a stable motivating preference.  If an AI system is doing problem solving, that does definitely entail that it has a goal, at least in some local sense: It has the goal of solving the problem in question. But that level of goal is more analogous to the prompt given to an LLM than it is to a robust utility function. I do have the intuition that creating an SEAI by training an RL agent on millions of simulated engineering problems is scary, because of reward specification problems of your simulated engineering problems. It will learn to hack your metrics. But an LLM trained on next-token prediction doesn't have that problem? Could you use next token prediction to build a detailed world model, that contains deep abstractions that describe reality (beyond the current human abstractions), and then prompt it, to elicit those models? Something like, you have the AI do next token prediction on all the physics papers, and all the physics time-series, and all the text on the internet, and then you prompt it to write the groundbreaking new physics result that unifies QM and GR, citing previously overlooked evidence.  I think Eliezer says "no, you can't, because to discover deep theories like that requires thinking and not just "passive" learning in the ML sense of updating gradients until you learn abstractions that predict the data well. You need to generate hypotheses and test them." In my state of knowledge, I don't know if that's true.  Is that a crux for him? How much easier is the alignment problem, if it's possible to learn superhuman abstractions "passively" like that? I mean there's still a problem that someone will build a more dangerous agent from components like that. And there's still a problem that you can get world-altering technologies / world-destroying technologies from that kind of oracle.  We're not out of the woods. But it would mean that building a superhuman SEAI isn't an immediate death sentence for humanity. I think I still don't get it.  
epistemic status: speculative, probably simplistic and ill defined Someone asked me "What will I do once we have AGI?" I generally define the AGI-era starting at the point where all economically valuable tasks can be performed by AIs at a lower cost than a human (at subsistance level, including buying any available augmentations for the human). This notably excludes: 1) any tasks that humans can do that still provide value at the margin (ie. the caloric cost of feeding that human while they're working vs while they're not working rather than while they're not existing) 2) things that are not "tasks", such as: a) caring about the internal experience of the service provider (ex.: wanting a DJ that feels human emotions regardless of its actions) --> although, maybe you could include that in the AGI definition too. but what if you value having a DJ be exactly a human? then the best an AGI could do is 3D print a human or something like that. or maybe you're even more specific, and you want a "pre-singulatarian natural human", in which case AGI seems impossible by (very contrived) definition. b) the value of the memories encoded in human brains c) the value of doing scientific experiments on humans For my answer to the question, I wanted to say something like, think about what I should do with my time for a long time, and keep my options open (ex.: avoid altering my mind in ways I don't understand the consequences well). But then, that seems like something that might be economically useful to sell, so using the above definition, it seems like I should have AI system that are able to do that better/cheaper than me (unless I intrinsically didn't want that, or something like that). So maybe I have AI systems computing that for me and keeping me posted with advice while I do whatever I want. But maybe I can still do work that is useful at the margin, as per (1), and so would probably do that. But what if even that wasn't worth the marginal caloric cost, and it was better to feed those calories into AI systems? (2) is a bit complex, but probably(?) wouldn't impact the answer to the initial question much. So, what would I do? I don't know. Main thing that comes to mind is observe how the worlds unfold (and listen to what the AGIs are telling me). But maybe "AGI" shouldn't be defined as "aligned AGI". Maybe a better definition of AGI is like "outperforming humans at all games/tasks that are well defined" (ie. where humans don't have a comparative advantage just by knowing what humans value). In which case, my answer would be "alignment research" (assuming it's not "die").
eggsyntax216
6
Terminology proposal: scaffolding vs tooling. I haven't seen these terms consistently defined with respect to LLMs. I've been using, and propose standardizing on: * Tooling: affordances for LLMs to make calls, eg ChatGPT plugins. * Scaffolding: an outer process that calls LLMs, where the bulk of the intelligence comes from the called LLMs, eg AutoGPT. Some smaller details: * If the scaffolding itself becomes as sophisticated as the LLMs it calls, we should start focusing on the system as a whole rather than just describing it as a scaffolded LLM. * This terminology is relative to a particular LLM. In a complex system (eg a treelike system with code calling LLMs calling code calling LLMs calling...), some particular component can be tooling relative to the LLM above it, and scaffolding relative to the LLM below. * It's reasonable to think of a system as scaffolded if the outermost layer is a scaffolding layer. * There's are other possible categories that don't fit this as neatly, eg LLMs calling each other as peers without a main outer process, but I expect this definition to cover most real-world cases. Thanks to @Andy Arditi for helping me nail down the distinction.

Popular Comments

Recent Discussion

I'm mostly going to use this to crosspost links to my blog for less polished thoughts, Musings and Rough Drafts.

5Eli Tyre
In this interview, Eliezer says the following: It caught my attention, because it's a concise encapsulation of something that I already knew Eliezer thought, and which seems to me to be a crux between "man, we're probably all going to die" and "we're really really fucked", but which I don't myself understand. So I'm taking a few minutes to think through it afresh now.  I agree that systems get to be very powerful by dint of their generality.  (There are some nuances around that: part of what makes GPT-4 and Claude so useful is just that they've memorized so much of the internet. That massive knowledge base helps make up for their relatively shallow levels of intelligence, compared to smart humans. But the dangerous/scary thing is definitely AI systems that are general enough to do full science and engineering processes.) I don't (yet?) see why generality implies having a stable motivating preference.  If an AI system is doing problem solving, that does definitely entail that it has a goal, at least in some local sense: It has the goal of solving the problem in question. But that level of goal is more analogous to the prompt given to an LLM than it is to a robust utility function. I do have the intuition that creating an SEAI by training an RL agent on millions of simulated engineering problems is scary, because of reward specification problems of your simulated engineering problems. It will learn to hack your metrics. But an LLM trained on next-token prediction doesn't have that problem? Could you use next token prediction to build a detailed world model, that contains deep abstractions that describe reality (beyond the current human abstractions), and then prompt it, to elicit those models? Something like, you have the AI do next token prediction on all the physics papers, and all the physics time-series, and all the text on the internet, and then you prompt it to write the groundbreaking new physics result that unifies QM and GR, citing previously overlo
4Adele Lopez
In my view, this is where the Omohundro Drives come into play. Having any preference at all is almost always served by an instrumental preference of survival as an agent with that preference. Once a competent agent is general enough to notice that (and granting that it has a level of generality sufficient to require a preference), then the first time it has a preference, it will want to take actions to preserve that preference. This seems possible to me. Humans have plenty of text in which we generate new abstractions/hypotheses, and so effective next-token prediction would necessitate forming a model of that process. Once the AI has human-level ability to create new abstractions, it could then simulate experiments (via e.g. its ability to predict python code outputs) and cross-examine the results with its own knowledge to adjust them and pick out the best ones.
5bideup
Sorry, what's the difference between these two positions? Is the second one meant to be a more extreme version of the first?

Yes.

The prevailing notion in AI safety circles is that a pivotal act—an action that decisively alters the trajectory of artificial intelligence development—requires superhuman AGI, which itself poses extreme risks. I challenge this assumption.

Consider a pivotal act like "disable all GPUs globally." This could potentially be achieved through less advanced means, such as a sophisticated computer virus akin to Stuxnet. Such a virus could be designed to replicate widely and render GPUs inoperable, without possessing the capabilities to create more dangerous weapons like bioweapons.

I've observed a lack of discussion around these "easier" pivotal acts in the AI safety community. Given the possibility that AI alignment might prove intractable, shouldn't we be exploring alternative strategies to prevent the emergence of superhuman AI?

I propose that this avenue deserves significantly more attention. If AI alignment is indeed unsolvable, a pivotal act to halt or significantly delay superhuman AI development could be our most crucial safeguard.

I'm curious to hear the community's thoughts on this perspective. Are there compelling reasons why such approaches are not more prominently discussed in AI safety circles?

5interstice
Weaker AI probably wouldn't be sufficient to carry out an actually pivotal act. For example the GPU virus would probably be worked around soon after deployment, via airgapping GPUs, developing software countermeasures, or just resetting infected GPUs.
1Michael Soareverix
Is it possible to develop specialized (narrow) AI that surpasses every human at infecting/destroying GPU systems, but won't wipe us out? LLM-powered Stuxnet would be an example. Bacteria isn't smarter than humans, but it is still very dangerous. It seems like a digital counterpart could prevent GPUs and so, prevent AGI. (Obviously, I'm not advocating for this in particular since it would mean the end of the internet and I like the internet. It seems likely, however, that there are pivotal acts possible by narrow AI that prevent AGI without actually being AGI.)

No I don't think so because people could just airgap the GPUs.

Work done as part of the Visiting Fellow program at Constellation. Thanks to Aaron Scher for conversations and feedback throughout the project.

Motivation

There are many situations where a language model could identify relevant situational information from its prompt and use this in a way humans don't want: deducing facts about the user and using this to appeal to them, inferring that it is undergoing evaluations and acting differently from usual, or determining that humans aren't tracking its actions and executing a strategy to seize power.

One counter-measure one may perform is to train the model to "ignore" such situational information: train the model to behave similarly regardless of the presence or content of the information. Supposing that such training causes the model to behave similarly in these cases, the...

A recent area of focus has been securing AI model weights. If the weights are located in a data center and an adversary wants to obtain model weights, the weights have to leave physically (such as a hard drive going out the front door) or through the data center's internet connection. If the facility has good physical security, then the weights have to leave through the internet connection. Recently, there has been discussion on how to make model weight exfiltration more difficult, such as Ryan Greenblatt's proposal for upload limits.

A key factor enabling this is that the critical data we want to protect (model weights) are very large files. Current models can have trillions of parameters, which translates to terabytes of data. Ryan calculated that the total...

2KhromeM
I do not understand how you can extract weights through just conversing with an LLM any more than you can get information on how my neurons are structured by conversing with me. Extracting training data it has seen is one thing, but presumably it has never seen its weights. If the system prompts did not tell it it was an LLM, it should not even be able to figure out that.
1karvonenadam
The purpose of this proposal is to limit anyone from transferring model weights out of a data center. If someone wants to steal the weights and give them to China or another adversary, the model weights have to leave physically (hard drive out of the front door) or through the internet connection. If the facility has good physical security, then the weights have to leave through the internet connection. If we also take steps to secure the internet connection, such as treating all outgoing data as language tokens and using a perplexity filter, then the model weights can be reasonably secure. We don't even have to filter all outgoing data. If there was 1 Gigabyte of unfiltered bandwidth per day, it would take 2,000 days to transfer GPT-4's 2 terabytes of weights out (although this could be reduced by compression schemes).

Thanks for this comment, by the way! I added a paragraph to the beginning to make this post more clear.

A few months ago, Rob Bensinger made a rather long post (that even got curated) in which he expressed his views on several questions related to personal identity and anticipated experiences in the context of potential uploading and emulation. A critical implicit assumption behind the exposition and reasoning he offered was the adoption of what I have described as the "standard LW-computationalist frame." In response to me highlighting this, Ruben Bloom said the following:

I differ from Rob in that I do think his piece should have flagged the assumption of ~computationalism, but think the assumption is reasonable enough to not have argued for in this piece.

I do think it is interesting philosophical discussion to hash it out, for the sake of rigor and really pushing for clarity.

...

For a change of pace, I think it's useful to talk about behaviorism.

In this context, we're interpreting positions like "behaviorism" or "computationalism" as strategies for responding to the question "what are the differences that make a difference to my self?"

The behaviorist answers that the differences that make a difference are those that impact my behavior. But secretly, behaviorism is a broad class of strategies for answering, because what's "my behavior," anyhow? If you have a choice to put either a red dot or a blue dot on the back of my head, does ... (read more)

epistemic status: speculative, probably simplistic and ill defined

Someone asked me "What will I do once we have AGI?"

I generally define the AGI-era starting at the point where all economically valuable tasks can be performed by AIs at a lower cost than a human (at subsistance level, including buying any available augmentations for the human). This notably excludes:

1) any tasks that humans can do that still provide value at the margin (ie. the caloric cost of feeding that human while they're working vs while they're not working rather than while they're not... (read more)

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

Note: An initial proposal and some good discussion already existed on LW here. I’m spurring this here as a post instead of a comment due to length, the need for a fresh look, and a specific call to action.

 

Summary

I think a petition-style boycott commitment could reach critical mass enough to significantly shift OpenAI corporate policy. 

I specifically think a modular petition allowing different users to choose which goalposts the target must cross to end their boycott would be a good method of coalition building among those concerned about AI Safety from different angles.

 

Postulates

  • OpenAI needs some reform to be a trustworthy leader in the age of AI
    • Zvi’s Fallout and Exodus roundups are good summaries, but the main points are:
    • The NDA Scandal: forcing employees to sign atypically aggressive non-disparagement and recursive non-disparagement agreements
    • Firing
...

I know little about Rob Henderson except that he wrote a well-received memoir and that he really really really wants you to remember that he invented the concept of “luxury beliefs”. In his own words, these are:

ideas and opinions that confer status on the upper class at very little cost, while often inflicting costs on the lower classes

The concept has metastasized and earned widespread adoption — particularly among social conservatives and right-wing populists. It might sound sophisticated, but it’s fundamentally flawed. Its vague and inconsistent definitions necessitate a selective application, and it’s ultimately used to launder mundane political preferences into something seemingly profound and highbrow.[1]

It’ll be most useful to break down Henderson’s concept into parts and go through it step-by-step.

1. Fashionable beliefs are always in style

First, there’s...

2ymeskhout
He claimed that monogamy was rejected by the upper class sufficiently enough to cause divorce and single parenthood to spike, he literally says "The upper class got high on their own supply." I consider that "widely adopted", and if you disagree with my description, it helps to specify exactly why. Regarding his classmates, his favorite anecdote has been one person who says polyamory is good but doesn't practice it, so I don't know where he establishes that doing polyamory is widely adopted by his classmates. I can't make up and apply new criteria like "bizarrely unconventional", nor can I just accept Henderson's framework when I'm critiquing it. Again, I can't just make up new criteria. My whole point has been that 'luxury beliefs' is selectively applied, and making up new requirements so that only a specific set of beliefs fit the bill is exactly what I'm critiquing.
2Jiro
There are various types of opposition to monogamy. Outright support of polygamy is not the only one. Yes you can. Of course, it's not "making it up", it's "figuring it out". If there are obvious explanations why he might want to use that example other than "he's biased against leftists", you shouldn't jump to "he's biased against leftists". And "polygamy is a lot weirder" is too obvious an explanation for you to just ignore it. If you're criticizing his version and not your version, you pretty much are required to accept his framework.
3ymeskhout
My entire criticism of his luxury beliefs framework is that it is arbitrary and applied in a selective ad-hoc manner, largely for the purpose of flattering one's pre-existing political sensibilities. The very fact that you're adding all these previously unmentioned rule amendments reinforces my thesis exactly. If you think my criticism is off-base, it would be helpful if you pointed out exactly where it is contradicted. Something like "if your critique is correct then we should expect X, but instead we see Y" would be neat.
Jiro20

I don't have to make up things after the fact to say "he probably chose the polygamy example because polygamy is weird". It's obvious.