LESSWRONG
LW

2993
Towards_Keeperhood
825Ω10172660
Message
Dialogue
Subscribe

I'm trying to prevent doom from AI. Currently trying to become sufficiently good at alignment research. Feel free to DM for meeting requests.

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Orcas
No wikitag contributions to display.
2Towards_Keeperhood's Shortform
3y
28
Reasons to sell frontier lab equity to donate now rather than later
Towards_Keeperhood5d5-2

Yeah I am also a bit disappointed with that list.

I would recommend controlAI.

Reply
Towards_Keeperhood's Shortform
Towards_Keeperhood6d32

I definitely have to update here - that's just law of probability. Maybe you don't have to update much if you already expected to have superhuman competetive programming around now.

But also this isn't the only update that informs my new timelines. I was saying more like "look I wrote down advanced predictions and it was actually useful to me", rather than intending to give an epistemically legible account of my timeline models.

Reply
Towards_Keeperhood's Shortform
Towards_Keeperhood7d50

I updated my timelines

7 months ago, I wrote down those AI predictions:

How long until the sun (starts to) get eaten? 10th/50th/90th percentile: 3y, 12y, 37y.

How long until an AI reaches Elo 4000 on codeforces? 10/50/90: 9mo, 2.5y, 11.5y

About one month ago, aka 6 months after I wrote this, OpenAI's model won the ICPC world finals, which I guess is sorta equivalent to Elo 4000 on codeforces, given that it won by a significant margin.

(This updates me to thinking that both (1) AI capabilities increase faster than I expected, and (2) competetive programming requires less general intelligence than I expected.)

Absent any coordinated slowdown, my new 10/50/90 guess for dyson sphere level capability is: 1y, 3.3y, 18y.

(I still find it hard to predict whether progress will continue continuous or whether there will be at least one capability leap.)

Reply
"Intelligence" -> "Relentless, Creative Resourcefulness"
Towards_Keeperhood13d20

I don't like superagency, but yeah seems important to have a better word for this. Maybe just RCR as abbreviation. Or hard-going or hard-optimizing.

I sometimes used "Harry-Factor" when talking to people who read HPMoR to describe what kind of intelligence I mean, and made examples like what he came up with in the last army battle, but obviously we want a different word.

Reply
Foom & Doom 2: Technical alignment is hard
Towards_Keeperhood1mo10

The purpose of studying LDT would be to realize that the type signature you currently imagine Steve::consequentialist preferences to have is different from the type signature that Eliezer would imagine.

The starting point for the whole discussion is a consequentialist preference—you have desires about the state of the world after the decision is over.

You can totally have preferences about the past that are still influenced by your decision (e.g. Parfit's hitchhiker).

Decisions don't cause future states, they influence which worlds end up real vs counterfactual. Preferences aren't over future states but over worlds - which worlds would you like to be more real?

AFAIK Eliezer only used the word "consequentialism" in abstract descriptions of the general fact that you (usually) need some kind of search in order to find solutions to new problems. (Like I think just using a new word for what he used to call optimization.) Maybe he also used the outcome pump as an example, but if you asked him what how consequentialist preferences look like in detail, I'd strongly bet he'd say sth like preferences over worlds rather than preferences over states in the far future.

Reply
MIRI 2024 Communications Strategy
Towards_Keeperhood1mo10

However, we would like to diversify the public face of MIRI and potentially invest heavily in a spokesperson who is not Eliezer, if we can identify the right candidate.

Is this still up to date?

Reply
Foom & Doom 2: Technical alignment is hard
Towards_Keeperhood1mo10

To me it seems a bit surprising that you say we agree on the object level, when in my view you're totally guilty of my 2.b.i point above of not specifying the tradeoff / not giving a clear specification of how decisions are actually made.

I also think the utility maximizer frame is useful, though there are 2 (IMO justified) assumptions that I see as going along with it:

  1. There's sth like a simplicity prior over the space of utility functions (because there needs to be some utility maximizing structure implemented in the AI).
  2. The utility function is a function of the trajectory of the environment. (Or in even better formalization it may take as input a program which is the environment.)
    1. I think using a learned value function (LVF) that computes valence of thoughts is a worse frame to use for tackling corrigibility because it's harder to clearly evaluate what actions the agent will end up taking. And because this kind of "imagine some plan and what the outcome would be and let the LVF evaluate that" doesn't seem to me how smarter than human minds operate - considering what change in the world an action would cause seems more natural than whether some imagined scene seems appealing. Even humans like me move away from the LVF frame, e.g. I'm trying to correct for scope insensitivity of my LVF by doing sth more like explicit expected utility calculations.[1]

“You’re wrong that this supposed mistake that you attribute to Eliezer is a path through which we can solve the alignment problem, and Eliezer doesn’t emphasize it because it’s an unimportant dead-end technicality” (maybe! I don’t claim to have a solution to the alignment problem right now; perhaps over time I will keep trying and failing and wind up with a better appreciation of the nature of the blockers).

I'm more like "Your abstract guesturing didn't let me see any concrete proposal that would make me more hopeful, and even if good proposals are in that direction it seems to me like most of the work would still be ahead instead of it being like 'we can just do it sorta like that' as you seem to present it. But maybe I'm wrong and maybe you have more intuitions and will find a good concrete proposal.".

I don’t follow what you think Eliezer means by “consequentialism”. I’m open-minded to “farfuturepumping”, but only if you convince me that “consequentialism” is actually misleading.

Maybe study logical decision theory? Not sure where to best start but maybe here:

"Logical decision theories" are algorithms for making choices which embody some variant of "Decide as though you determine the logical output of your decision algorithm."

Like consequentialism in the sense of "what's the consequence of choosing the logical output of your decision algorithm in a particular way", where consequence here isn't a time-based event but rather the way the universe looks like conditional on the output of your decision algorithm.

  1. ^

    I'm not confident those are the only reason why LVF seems worse here, I didn't fully articulate my intuitions yet.

Reply
Foom & Doom 2: Technical alignment is hard
Towards_Keeperhood1moΩ110

I want to note that it seems to me that Jeremy is trying to argue you out of the same mistake I tried to argue you out in this thread.

The problem is that you use "consequentialism" different than Eliezer means it. I suppose he only used the word in a couple of occasions where he tried to get accross the basic underlying model without going into excessive details, and it may read to you like your "far futue outcome pumping" matches your definitions there (though back when I looked over your cited support that Eliezer means it, it didn't seem at all like the evidence points to this interpretation). But if you get a deep understanding of logical decision theory, or you study a lot of MIRI papers where they (where the utility of agents is iirc always over trajectories of the environment program[1]), you see what Eliezer's deeper position is.

Probably not worth the time to further discuss what certain other people do or don’t believe, as opposed to what’s true.

I think you're strawmanning Eliezer and propagating a wrong understanding of what "consequentialism" was supposed to refer to, and this seems like an important argument to have separately from what's true. But a good point that we should distinguish arguing about this from arguing about what's true.

Going forward, I suggest you use another word like "farfuturepumping" instead of "consequentialism". (I'll also use another word for Eliezer::consequentialism and clarify it, since it's apparently often misunderstood.)

As quick summary, which may not be easily understandable due to inferential distance, I think that me and Eliezer both think that:

  1. Smart AIs will be utility optimizing, but this utility is over computations/universe-trajectories, not future states.
    1. This is a claim about how AI cognition will look like, not just about that its behavior will be coherent according to some utility function. Smart AIs will think in some utility-maximizing ways, even though it initially may be quite a mess where it's really hard to read off what values are being optimized for, and the values may change a bit as the AI changes.
      1. Coherence arguments only imply that a coherent agent will behave as if they optimized a utility function, not about what cognitive algorithm the agent uses. There's an extra step needed to get to cognitive utility maximization, and AFAIK it hasn't been explained well anywhere, but maybe it's sorta intuitive?
  2. It's perfectly alright to have non-farfuturepumping preferences like you describe, but just saying it's possible isn't enough, you actually need to write down the utility function over universe-trajectories.
    1. This is because if you just say "well it's possible, so there", you may fail to think concretely enough to see how a utility function that has the properties you imagine would actually be quite complex, and thus unlikely to be learned.
    2. Why can't you have a utility function but also other preferences?
      1. There needs to be some tradeoff between the utility function and the other preferences, and however you choose this the result can be formalized as utility function. If you don't do this you can engage in abstract wishful thinking where you can imagine a different tradeoff for different cases and thereby delude yourself about your proposal robustly working.
    3. Why can't you just specify that in some cases utility function u1 should be used, and in others u2 should be used?
      1. Because when u1 is used then there's an instrumental incentive to modify the code of the AI s.t. always u1 is used. You want reflective consistency to avoid such problems.

I would recommend you to chat with Jeremy (and maybe reread our comment thread).

  1. ^

    Yes utility is often formalized over the space of outcomes, but the space of outcomes is iirc the space of trajectories. 

Reply
Steve Byrnes’s Shortform
Towards_Keeperhood1mo30

The authors propose to get an international treaty to pause progress towards superintelligence, including both scaling & R&D. I’m for it, although I don’t hold out much hope for such efforts to have more than marginal impact. I expect that AI capabilities would rebrand as AI safety, and plow ahead:

The problem is: public advocacy is way too centered on LLMs, from my perspective. Thus, those researchers I mentioned, who are messing around with new paradigms on arXiv, are in a great position to twist “Pause AI” type public advocacy into support for what they’re doing!

[...]

I think these people are generally sincere but mistaken, and I expect that, just as they have fooled themselves, they will also successfully fool their friends, their colleagues, and government regulators…

This seems way too pessimistic to me. (Or like sure it's going to be hard and I'm not super optimistic, but given that you're also relatively pessimistic the international AI R&D shutdown approach doesn't seem too unpromising to me.)

Sure they are going to try to convince government regulators that their research is great for safety, but we're going to try to convince the public and regulators otherwise.

I mean it's sorta understandable to say that we currently seem to be in a relatively weak position and getting sufficient change seems hard, but movements can grow quickly. Yeah understandable that this doesn't seem super convincing, but I think we have a handful of smart people who might be able to find ways to effectively shift the gameboard here. Idk.

More to the point though, conditional that we manage to internationally ban AI R&D, it doesn't obviously seem that much more difficult or that much less likely that we manage to also ban AI safety efforts which can lead to AI capability increases, based on the understanding that those efforts are likely delusional and alignment is out of reach. (Tbc I would try to not ban your research, but given that your agenda is the only one I am aware of into which I put significantly more than 0 hope, it's not clear to me that it's worth overcomplicating the ban around that.)

Also in this common knowledge problem domain, self-fulfilling prophecies are sorta a thing, and I think it's a bit harmful to the cause if you post on twitter and bluesky that you don't have much hope in government action. Tbc, don't say the opposite either, keep your integrity, but maybe leave the critizism on lesswrong? Idk.

Reply
LessWrong FAQ
Towards_Keeperhood2mo32

Can you make "sort by magic" the default sort for comments under a post? Here's why:

The problem: Commenting late on a post (after the main reading peak) is disincentivized, not only because fewer people will read the post and look over the comments, but also because most people only look over the top scoring comments and won't scroll down far enough to read your new comment. This also causes early good comments to continue to accumulate more karma because more people read those, so the usual equilibrium is that early good comments stay on top and late good comments don't really get noticed.

Also, what one cares about for sorting is the quality of a comment, and the correct estimator for that would be "number of upvotes per views". I don't know how you calculate magic but it seems very likely like a better proxy for this than top scoring. (If magic doesn't seem very adequate and you track page viewcounts, you could also get a more principled new magic sort, though you'd have to track for each comment what viewcount the page had at the time when the comment was posted. Like if the average ratio of upvotes per views is a/b, you could assign each comment a score of (upvotes+a)/(page_views_since_comment_was_posted+b), and sort descending by score.)

Reply
Load More
16Dark Lord's Answer: Review and Economics Excerpts
3mo
6
22Keltham on Becoming more Truth-Oriented
6mo
2
12What alignment-relevant abilities might Terence Tao lack?
6mo
2
1Thoughts on Creating a Good Language
6mo
2
22Introduction to Representing Sentences as Logical Statements
7mo
9
50I changed my mind about orca intelligence
7mo
24
9Help make the orca language experiment happen
7mo
12
12Optimizing Feedback to Learn Faster
8mo
0
53Considerations on orca intelligence
10mo
14
38Orca communication project - seeking feedback (and collaborators)
11mo
16
Load More