Studying BTech+MTech at IIT Delhi. Check out my comment (and post) history for topics of interest. Please prefer seeing recent comments over older ones as my thoughts have been updating very frequently recently. 

Discovered EA in September 2021, was involved in cryptocurrency before that, graduating 2023


Last updated: June 2022

Wiki Contributions


Scott Aaronson and Steven Pinker Debate AI Scaling

If you're studying state-of-the-art AI you don't need to know any of these topics.

Scott Aaronson and Steven Pinker Debate AI Scaling

They're core to the agent model of AI, with coherent preferences. Once you get coherent preferences you get utility maximization, which gets instrumental convergence, uncorrigibility, self-preservation and so on.

Questions like:

  • how likely is it we build agent AI? Either explicitly or indirectly
  • how likely is it our agent AI will have coherent preferences?

Are still open questions, and different researchers will have different opinions.

Scott Aaronson and Steven Pinker Debate AI Scaling

Is there a longform discussion anywhere on pros and cons of nuking Russia post WW2 and establishing world govt?

It isn't as obvious to me what the correct answer was, the way it is obvious to ppl in this discussion.

Also this seems like a central clash of opinion that will resurface in the AI race, or any attempts today to reduce nuclear risk.

LessWrong Has Agree/Disagree Voting On All New Comment Threads

Thanks for responding.

Opt-in sounds like a lot of cognitive overhead for every single comment

Maybe opt-out is better then, not sure.

Also, giving readers 2 axes to vote on is another form of cognitive overhead. If it is known that the second axis won't be meaningful for a comment it might make more sense to hide the axis than have every reader independently realise why the second axis is not meaningful for a given comment.

This is a small downside, but may still be larger than the overhead placed on commentors to opt out of the second axis. Not sure.

allows for people to avoid having the truth value of their comments be judged when they make especially key claims in their argument.

I guess this has pros and cons, it isn't immediately obvious which are more, I'd be keen to read anything about this.

One more thing is that my guess is the agree/disagree voting axis will encourage people to split up their comments more, and state things that are more cleanly true or false.

Possible. Although I feel I'd think more about this when writing comments if the second axis was opt-in and not present everywhere.

LessWrong Has Agree/Disagree Voting On All New Comment Threads

I'd weakly prefer the agree/ disagree axis to be opt-in (or atleast opt-out) for each comment, with the commentor choosing whether to have the axis or not.

IMO having the agree/disagree button on all comments faces following issues:

  • what if a single comment states multiple positions? You might agree with some and not with others.

  • what if you're uncertain if you've understood the commentor's position? Let's say you don't vote. The vote is biased by people who think they correctly understood the position. What if you're unsure about loopholes or edge cases that you think are important?

  • what if the comment isn't an opinion, it's a quote or a collation of other people's perspectives?

In general writing an opinion that can be cleanly agreed or disagreed on, is a narrower set of actions than writing a comment that has net positive value to the discourse. Typically the commentor is in a position to know whether their comment is cleanly specified enough as an opinion to be agreed / disagreed on, so they could maybe have the ability to switch on or off this axis on their comments.

Coherence arguments imply a force for goal-directed behavior

Thanks for your reply!

You've changed my mind maybe, although I was super uncertain when I wrote that comment too. I won't take more of your time.

Coherence arguments imply a force for goal-directed behavior

Thanks for replying!

I understand better what you're trying to say now, maybe I'm just not fully convinced yet. Would be keen on your thoughts on following if you have time! Mostly trying to convey some intuitions of mine (which could be wrong) than a rigorous argument.

I feel like things that tend to preserve coherence might have an automatic tendency to behave in a way as if they have goals.

And then it might have instrumentally convergent subgoals and so on.

(And that this could happen even if you didn't start off by giving the program goals or ability to predict consequences of actions etc etc)

For instance say you have an oracle AI that takes compute steps to minimize incoherence in its world models. It doesn't evaluate or compare steps before taking them, atleast as understood in the "agent" paradigm. But it is intelligent in some sense and it somehow takes the correct steps a lot of the time.

If you connect this AI to a secondary memory, it seems reasonable to predict that it will probably start writing to the new memory too, to have larger but coherent models. (Ofcourse this is not certain and you as a programmer might have deeper knowledge of what it'll do.) If you have a second program running that messes up the world models of this program, it seems reasonable to predict it might switch off the second program (assuming it has admin rights to do so).

You might say that this just means it is now doing reasoning as per the agent paradigm, it is looking at its possible actions, forecasting consequences, comparing them etc. But I'm not sure this needs to be true. I feel like if there is any sense in which a program is doing correct things, and here we define correct as "coherent" or preserving coherence, then no matter how it is managing to do these correct things, it is likely to be behaving in certain specific ways that can be make it adversarial.

Coherence feels to me like a thing that's valuable primarily in the presence of adversaries. Fractured information or preferences or world models etc. can be okay and useful .... unless you start creating adversarial agents, adversarial datapoints, adversarial action spaces (like money pumps) etc that detect these fractures, and decide that useful AIs are ones that give useful results even in the presence of these adversarial things. Now if you build useful AIs by this definition of useful, you'll get more coherence.

Edit: By fractured I mean like, somewhat coherent somewhat incoherent.

What’s the contingency plan if we get AGI tomorrow?

Are you asking short-term or long-term?

Short-term the only thing that matters is buying time. It isn't very obvious buying a little time helps, but if you can't buy time nothing helps so you need to buy time.

Some strategies:

 - Persuasion: Convince people in the company to not deploy or atleast delay deploying the AI.

 - Persuade other actors: Persuade other actors who can exert control over the company and prevent them from deploying it, such as US military. There is a lot of nuance over which actors are likely to correctly respond to the threat within 24 hours, and also what are the long-term consequences of involving said actors.

 - Force: Includes everything from cutting off electricity and internet access to the facility, to cyberhacking, to using lethal force to enter or destroy the facility or the people running it.


In general this discussion involves details that are

a) outside the overton window of acceptable actions. Just as an example I can imagine worlds (not sure how likely) where this scenario ends up with a foreign military launching a missile at the city in which the AI lab resides. (If multiple militaries have access to information that a lab is building this in 24 hours, they're unlikely to find it easy to trust each other to handle it.)

b) private - for instance who has access to power or powerful people to execute such plans,

Hence it's likely you'll only get so far on this discussion on LessWrong.

Coherence arguments imply a force for goal-directed behavior

(Update: Made some edits within 15 min of making the comment)

Thank you for replying!

This makes sense, and yes it definitely makes sense to consider programs that don't follow this exact paradigm you mentioned in the first para.

But I also feel like coherence arguments can apply even to agents that don't fit in this paradigm. You can for instance have really dumb programs which can be money pumped, and really dumb programs that can't be money pumped (say, because it is hard-coded with the right answers on the limited tasks it is designed for). None of these agents need to have a world model or do planning or prediction of any sort. But we know that humans will prefer to design the latter (programs that can't be money-pumped).

Even if we consider say, an oracle AI, that is designed to model certain aspects of the world, but NOT make plans or actions, there is a sense in which it "prefers" to have more accurate models of the world, over less accurate ones. If it had circular preferences, it could end up in say, an endless loop where it oscillates between more accurate and less accurate world models, instead of monotonically improving accuracy of its world model as it gets more time and data. So you would expect that the capable AIs that humans design don't have circular preferences of this sort and that they should over time trend in specific directions*.

[*unfettered by arbitrarily complex datapoints, including ones generated by adversaries to throw the AI off this track.]

Maybe I am being confusing when I use the word "preference" here, I mean it more as a revealed preference than a stated preference. So if an AI does X instead of Y in the real world, then AI prefers X over Y, even if AI doesn't do any planning or compare X and Y or even realise it had Y as an option.

Do let me know if I made sense!

Coherence arguments imply a force for goal-directed behavior

I'm trying to understand what you mean by intelligence that is not goal directed. Your examples in your post include agents that attempt to have acccurate beliefs about the world. Could this be understood as a preference ordering over states internal to the agent?

And if yes, is there a meaningful difference between agents that have preference orderings over world states internal to the agent, and those that have preference orderings over world states external to the agent? Understanding this better probably comes under the embedded agency agenda.

Load More