LESSWRONG
LW

238
RohanS
42713250
Message
Dialogue
Subscribe

I aim to promote welfare and reduce suffering as much as possible. This has led me to work on AGI safety research. I am particularly interested in foundation model agents (FMAs): systems like AutoGPT and Operator that equip foundation models with memory, tool use, and other affordances so they can perform multi-step tasks autonomously.

Previously, I completed an undergrad in CS and Math at Columbia, where I helped run Columbia Effective Altruism and Columbia AI Alignment Club (CAIAC).

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
3RohanS's Shortform
10mo
35
13 Arguments About a Transition to Neuralese AIs
RohanS3d32

I think CoT may be very valuable in helping capabilities researchers understand their models’ shortcomings and design training curricula that address those deficiencies. This sort of falls under “business value” itself. I currently think this is one of the stronger arguments for why we won’t switch to neuralese.

(The post only vaguely gestures at this argument, so your interpretation is also fair, but I think business value and capabilities gains are pretty closely related rather than being opposing factors.)

Reply
RohanS's Shortform
RohanS25d10

This veers into moral realism.

That's not an accident, I do lean pretty strongly realist :). But that's another thing I don't want to hardcode into AGIs, I'd rather maintain some uncertainty about it and get AGI's help in trying to continue to navigate realism vs antirealism. 

I think I agree about the need for a morality-agnostic framework that establishes boundaries and coordination, and about the risks of dystopia if we attempt to commit to any positions on object-level morality too early in our process of shaping the future. But my hope is that our meta-approach helps achieve moral progress (perhaps towards an end state of moral progress, which I think is probably well-applied total hedonistic utilitarianism). So I still care a lot about getting the object-level moral considerations involved in shaping the future at some point. Without that, you might miss out on some really important features of great futures (like abolishing suffering).

Perhaps relatedly, I'm confused about your last paragraph. If a single highly unusual person doesn't conform to the kinds of moral principles I want to have shaping the future, that's probably because that person is wrong, and I'm fine with their notions of morality being ignored in the design of the future. Hitler comes to mind for this category, idk what comes to mind for you. 

(I've always struggled to understand reasons for antirealists not to be nihilists, but haven't needed to do so as a realist. This may hurt my ability to properly model your views here, though I'd be curious what you want your morality-agnostic framework to achieve and why you think that matters in any sense.)

(I realize I'm saying lots of controversial things now, so I'll flag that the original post depended relatively little on my total hedonistic utilitarian views and much of it should remain relevant to people who disagree with me.)

Reply
RohanS's Shortform
RohanS25d10
  • I'm not so sure about total hedonistic utilitarianism that I want to directly stick it into future-shaping AIs, I'd rather have them "continuously work on a balance between figuring out what is best for the universe and acting according to [their] best current understanding of what is best for the universe (like I try to do)"
  • I think other people can be wrong about morality, in which case I don't think their notion of a good future is something I need to try to promote
  • "Exploring or developing a philosophical position is distinct from espousing it." If I understand correctly, this is mostly a matter of how well fleshed out I think the position is and how confident I am in it. I think there may be small pieces of the argument and the position that aren't perfectly fleshed out, but I expect there to be ways to iron out the details, and I overall think the position and the argument are pretty well fleshed-out. I'd say I espouse total hedonistic utilitarianism (while also being open to further development).
  • If other people endorse other things and don't consider hedonic valence important, then we should have good decision-making mechanisms for handling this kind of conflict as we shape the long-term future. I mentioned above (point 2 at the bottom of the original post) that I want such a decision process to be truth-seeking and prosocial. It should seek to figure out whether those other things actually matter and whether hedonic valence actually matters. If hedonic valence actually matters and nothing else does (as I suspect), then hedonic valence should be prioritized in decisions. In the prosocial part I'm including the idea that we should probably try ease the blow of this decision to deprioritize someone's preferences. Maybe one example conflict is between factory farmers and people who want to eliminate factory farming. I'd want to eliminate factory farming, while offering new ways for the people currently reliant on factory farming to live good lives.
Reply
RohanS's Shortform
RohanS26d*81

What would it look like for AI to go extremely well?

Here’s a long list of things that I intuitively want AI to do. It’s meant to gesture at an ambitiously great vision of the future, rather than being a precise target. (In fact, there are obvious tradeoffs between some of these things, and I'm just ignoring those for now.)

I want AGI to:

  • end factory farming
  • end poverty
  • enable space exploration and colonization
  • solve governance & coordination problems
  • execute the abolitionist project
  • cultivate amazing and rich lives for everyone
  • tile a large chunk of the future lightcone with hedonium?
  • be motivating and fun to interact with
  • promote the best and strongest kinds of social bonds
  • figure out how consciousness works as well as possible and how to make it extremely positive
  • counteract climate change and racism and sexual abuse and sexism and crime and scamming and depression
  • make transportation amazing
  • make healthcare amazing (sickness, long-lasting body pain, aging, disease, cancer; speed and quality and cost of healthcare)
  • make incredible art and games and sports and get the right people to engage in them
  • make delicious food
  • take over menial tasks
  • maintain opportunities for immersion in natural beauty
  • figure out the right way to interact with aliens and interact with them that way (assuming we ever interact with aliens, which I think is quite plausible, even for generally intelligent ones)
  • maintain opportunities for intellectual stimulation and discovery?
  • defend against x-risks (other AGIs, nuclear, bio, climate, aliens)
  • reverse entropy?
  • help humans be happier w/ less (e.g. via meditation)
  • satisfy Maslow’s full hierarchy of needs for everyone
  • improve wild animal welfare (drastically)
  • make moral progress consistently (if needed)
  • maintain human agency (where appropriate)
  • continuously work on a balance between figuring out what is best for the universe and acting according to its best current understanding of what is best for the universe (like I try to do)

(Relevant context is that I’m a pretty confident total hedonistic utilitarian.)

One possibly important takeaway is that a lot of this has to do with AI applications, and may continue to be hard to achieve with intent aligned AI for the same reasons they're hard to achieve today: lack of consensus, going against local incentives for some powerful people and groups, etc. A couple (not novel) ideas for improving this:

  1. Maybe AI startups that work on these applications are actually counterfactually important
  2. Maybe we should try to make AIs be truth-seeking and prosocial in competitive envs (including when competing with each other), which may be better than e.g. the highly partisan US Congress as a mechanism for pursuing these goals on a large scale
Reply
RohanS's Shortform
RohanS1mo10

I was recently pleasantly surprised to discover the extent to which my short-term goals are aligned with my long-term goals. It's a nice reminder that I'm working on the things I want to be working on.

For an upcoming PhD lab meeting, I was asked to make a slide describing what success means to me on three different timescales, ~1, 5, and 15 years each. Setting aside the possibility of AGI within... any of those timescales, here's what I wrote:


What Is Success For Me?

This Year: Excellent conference papers, blog posts about research results and AGI safety strategy, progress with Aether, connections (in the AIS community), research taste development, improvement at experiment velocity (incl. general coding and knowledge of tools) and writing. Influence AGI companies, governments, and/or the AIS field in ways that later improve outcomes from AGI/ASI.

Short Term (PhD): Similar to the above :). I’m simultaneously aiming for direct impact, career capital, and reducing uncertainty about longer-term paths.

Long Term: More of the above :). Open to industry (research scientist?), nonprofit, continuing to run Aether, starting other projects, academia, and even doing other useful things than ML research (like AGI strategy, grantmaking, policy, philosophy?, etc.).


I can thank 80,000 Hours for the "direct impact, career capital, and reducing uncertainty" framing; I think it has done me a lot of good. I think 80k also helped me realize that the best way to do all three is to just directly try the first steps on one of your possible longer-term career paths, starting with cheaper tests and diving deeper into things that are going well. That certainly helped create the alignment I now see between my short and long term goals.

Reply
Efficiently Detecting Hidden Reasoning with a Small Predictor Model
RohanS4mo40

It could be interesting to try that too, but we thought other reasoning models are more likely to predict similar things to the large reasoning models generating the CoTs in the first place. That hopefully increases the signal to noise ratio.

Reply
RohanS's Shortform
RohanS4mo110

What time of day are you least instrumentally rational?

(Instrumental rationality = systematically achieving your values.)

A couple months ago, I noticed that I was consistently spending time in ways I didn't endorse when I got home after dinner around 8pm. From then until about 2-3am, I would be pretty unproductive, often have some life admin thing I should do but was procrastinating on, doomscroll, not do anything particularly fun, etc.

Noticing this was the biggest step to solving it. I spent a little while thinking about how to fix it, and it's not like an immediate solution popped into mind, but I'm pretty sure it took me less than half an hour to come up with a strategy I was excited about. (Work for an extra hour at the office 7:30-8:30, walk home by 9, go for a run and shower by 10, work another hour until 11, deliberately chill until my sleep time of about 1:30. With plenty of exceptions for days with other evening plans.) I then committed to this strategy mentally, especially hard for the first couple days because I thought that would help with habit formation. I succeeded, and it felt great, and I've stuck to it reasonably well since then. Even without sticking to it perfectly, this felt like a massive improvement. (Adding two consistent, isolated hours of daily work is something that had worked very well for me before too.)

So I suspect the question at the top might be useful for others to consider too.

Reply
RohanS's Shortform
RohanS4mo20

Papers as thoughts: I have thoughts that contribute to my overall understanding of things. The AI safety field has papers that contributes to its overall understanding of things. Lots of thoughts are useful without solving everything by themselves. Lots of papers are useful without solving everything by themselves. Papers can be pretty detailed thoughts, but they can and probably should tackle pretty specific things, not try to be extremely wide-reaching. The scope of your thoughts on AI safety don’t need to be limited to the scope of your paper; in fact, each individual paper is probably just one thought, you never expect to have all your thoughts go into one paper. This is a framing that makes it feel easier to come up with useful papers to contribute, and that raises the importance and value of non-paper work/thinking.

Reply
Aether July 2025 Update
RohanS4mo20

What is the theory of impact for monitorability?

Our ToI includes a) increasing the likelihood that companies and external parties notice when monitorability is degrading and even attempt interventions, b) finding interventions that genuinely enhance monitorability, as opposed to just making CoTs look more legible, and c) lowering the monitorability tax associated with interventions in b. Admittedly we probably can’t do all of these at the same time, and perhaps you’re more pessimistic than we are that acceptable interventions even exist.

It seems to be an even weaker technique than mechanistic interpretability, which has at best a dubious ToI.

There are certainly some senses in which CoT monitoring is weaker than mech interp: all of a model’s cognition must happen somewhere in its internals, and there’s no guarantee that all the relevant cognition appears in CoT. On the other hand, there are also important senses in which CoT monitoring is a stronger technique. When a model does a lot of CoT to solve a math problem, reading the CoT provides useful insight into how it solved the problem, and trying to figure that out from model internals alone seems much more complicated. We think it’s quite likely this transfers to safety-relevant settings like a model figuring out how to exfiltrate its own weights or subvert security measures.

We agree that directly training against monitors is generally a bad idea. We’re uncertain about whether it’s ever fine to optimize reasoning chains to be more readable (which need not require training against a monitor), though it seems plausible that there are techniques that can enhance monitorability without promoting obfuscation (see Part 2 of our agenda). More confidently, we would like frontier labs to adopt standardized monitorability evals that are used before deploying models internally and externally. The results of these evals should go in model cards and can help track whether models are monitorable. 

in the limit of superintelligence

Our primary aim is to make ~human-level AIs more monitorable and trustworthy, as we believe that more trustworthy early TAIs make it more realistic that alignment research can be safely accelerated/automated. Ideally, even superintelligent AIs would have legible CoTs, but that’s not what we’re betting on with the agenda. 

Either the lab applies patches until it looks good (which is a variant of TMFT) or another lab which doesn't care as much comes along and builds the AI that kills us anyway.

It’s fair to be concerned that knowing models are becoming unmonitorable may not buy us much, but we think this is a bit too pessimistic. It seems like frontier labs are serious about preserving monitorability (e.g. OpenAI), and maybe there are some careful interventions that actually work to improve monitorability! Parts of our agenda are aimed at studying what kinds of interventions degrade vs enhance monitorability.

Reply
Aether July 2025 Update
RohanS4mo62

Our funder is not affiliated with a frontier lab and has provided us support with no expectation of financial returns. We have also had full freedom to shape our research goals (within the broad agreed-upon scope of “LLM agent safety”).

Reply
Load More
66Hidden Reasoning in LLMs: A Taxonomy
Ω
3mo
Ω
12
28How we spent our first two weeks as an independent AI safety research group
3mo
0
51Extract-and-Evaluate Monitoring Can Significantly Enhance CoT Monitor Performance (Research Note)
Ω
3mo
Ω
7
33Efficiently Detecting Hidden Reasoning with a Small Predictor Model
4mo
3
24Aether July 2025 Update
4mo
7
3RohanS's Shortform
10mo
35
48~80 Interesting Questions about Foundation Model Agent Safety
1y
4
4Transformers Explained (Again)
1y
0
13Apply to Aether - Independent LLM Agent Safety Research Group
1y
0
4Notes on "How do we become confident in the safety of a machine learning system?"
2y
0
Load More