LESSWRONG
LW

1233
Thane Ruthenis
9392Ω8684610431
Message
Dialogue
Subscribe

Agent-foundations researcher. Working on Synthesizing Standalone World-Models, aiming at a technical solution to the AGI risk fit for worlds where alignment is punishingly hard and we only get one try.

Currently looking for additional funders ($1k+, details). Consider reaching out if you're interested, or donating directly.

Or get me to pay you money ($5-$100) by spotting holes in my agenda or providing other useful information.

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Synthesizing Standalone World-Models
Wei Dai's Shortform
Thane Ruthenis2h00

Curious whether you have any guesses on what would make it seem like a sympathetic decision to the audience

Off-the-cuff idea, probably a bad on:

Stopping short of "turning off commenting entirely", being able to make comments to a given post subject to a separate stage of filtering/white-listing. The white-listing criteria are set by the author and made public. Ideally, the system is also not controlled by the author directly, but by someone the author expects to be competent at adhering to those criteria (perhaps an LLM, if they're competent enough at this point).

  • The system takes direct power out of the author's hands. They still control the system's parameters, but there's a degree of separation now. The author is not engaging in "direct" acts of "tyranny".
  • It's made clear to readers that the comments under a given post have been subject to additional selection, whose level of bias they can estimate by reading the white-listing criteria.
  • The white-listing criteria are public. Depending on what they are, they can be (a) clearly sympathetic, (b) principled-sounding enough to decrease the impression of ad-hoc acts of tyranny even further.
    • (Also, ideally, the system doing the selection doesn't care about what the author wants beyond what they specified in the criteria, and is thus an only boundedly and transparently biased arbiter.)
  • The commenters are clearly made aware that there's no guarantee their comments on this post will be accepted, so if they decide to spend time writing them, they know what they're getting into (vs. bitterness-inducing sequence where someone spends time on a high-effort comment that then gets deleted).
  • There's no perceived obligation to respond to comments the author doesn't want to respond to, because they're rejected (and ideally the author isn't even given the chance to read them).
  • There are no "deleting a highly-upvoted comment" events with terrible optics.

Probably this is still too censorship-y, though? (And obviously doesn't solve the problem where people make top-level takedown posts in which all the blacklisted criticism is put and then highly upvoted. Though maybe that's not going to be as bad and widespread as one might fear.)

Reply
Status Is The Game Of The Losers' Bracket
Thane Ruthenis14h110

Political games can, at best, get stuff from other people. The good stuff - the real power - is the stuff which other people don’t have to offer in the first place. The stuff which nobody is currently capable of doing/making.

Yep.

I think this generalizes to competition against people in general? As in, if you find yourself in a situation where you're competing neck-to-neck with others, with a real possibility of losing, without an unsurpassable lead on them, you should stop that (if you can) and do something else.

Some examples:

  • Business. Various standard advice (I believe shared by Graham, Thiel, and Spolsky[1]) regarding building a major business is to start by monopolizing some extremely niche market in which no real competitor exists. Once you've eaten that market, you incrementally expand your niche, always shying away from domains with real competitors until you can crush them (i. e., until they stop being real competitors).
  • Physical fights. The best way to deal with someone attacking you is not to be there to begin with (e. g., earn money and move to a safer country/city). Failing that, make yourself urgently not-there by running away (and get good at that, e. g. practice sprinting + parkour). Failing that, bring a (metaphorical or real) gun to a knife fight. "Learn martial arts", on the other hand, is not a good approach.
  • War. Maneuver warfare, where you achieve strategic victory by a sequence of precisely planned, decidedly "unfair" tactical operations, is dramatically preferable to direct WW1-style positional-warfare slugfests.
  • The post provides plenty of social-conflict examples.

In some way, the point may be obvious. Fair fights against comparably powerful opponents are, by definition, challenging, and also anti-inductive. They drain both sides' resources, force races to the bottom and tons of other negative-sum dynamics, etc. You want either an overwhelming asymmetric advantage, or a fight to which no-one else will show up.

At least, no-one intelligent. Picking a "fight" with Nature, or abstract concepts, is fine. Indeed, those are the kinds of fights you should be picking. Generally speaking, you want to be doing things where success comes from spending ~all of your time thinking about some object-level problem, and ~none of your time keeping tabs on your human adversaries.

Some caveats/clarifications:

  • The object-level problem may still involve reasoning about agents, about systems containing people, etc., and even planning against them. What you don't want are symmetric agentic competitors which are doing basically the same thing as you (for appropriate values of "symmetric" and "same thing"). In other words: you want to be the only live player on the board.
  • You'd of course still be "technically" in various kinds of competitions: competing in a job market, fighting a war. That's fine, as long as it's not a real competition from your perspective.
  • Obviously you may not always have this option available. (Good business ideas may take time to develop, and you need to eat in the meantime. Same for e. g. climbing out of poverty and moving to a safer city.)

Exercise for the reader: evaluate various alignment plans through these lens, and consider how a non-loser's alignment plan ought to look like (and how it ought not to look).

  1. ^

    Not gonna track down exact sources, sorry.

Reply1
Do things in small batches
Thane Ruthenis18h20

I've looked at my workflows through the lens of this post, and I'm realizing I could indeed make some of them much more efficient by restructuring them the way suggested here.

So, thanks! I think this advice will end up directly useful to me.

Reply
leogao's Shortform
Thane Ruthenis3d300

our recent paper

Aside: For me, this paper is potentially the most exciting interpretability result of the past several years (since SAEs). Scaling it to GPT-3 and beyond seems like a very promising direction. Great job!

Reply
GradientDissenter's Shortform
Thane Ruthenis3d104

Yes, of course I care about whether someone takes AI risk seriously, but if someone is also untrustworthy, in my opinion this serves as a multiplier of their negative impact on the world. I do not want to create scheming and untrustworthy stakeholders that start doing sketchy stuff around AI risk. That's how really a lot of bad stuff in the past has already happened.

No-true-Scotsman-ish counterargument: no-one who actually gets AI risk would engage in this kind of tomfoolery. This is the behavior of someone who almost got it, but then missed the last turn and stumbled into the den of the legendary Black Beast of Aaargh. In the abstract, I think "we should be willing to consider supporting literal Voldemort if we're sure he has the correct model of AI X-risk" goes through.

The problem is that it just totally doesn't work in practice, not even on pure consequentialist grounds:

  • You can never tell whether Voldemorts actually understand and believe your cause, or whether they're just really good at picking the right things to say to get you to support them. No, not even if you've considered the possibility that they're lying and you still feel sure they're not. Your object-level evaluations just can't be trusted. (At least, if they're competent at their thing. And if they're not just evil, but also bad at it, so bad you can tell when they're being honest, why would you support them?)
  • Voldemorts and their plans are often more incompetent than they seem,[1] and when their evil-but-"effective" plan predictably blows up, you and your cause are going to suffer reputational damage and end up in a worse position than your starting one. (You're not gonna find an Altman, you'll find an SBF.)
  • Voldemorts are naturally predisposed to misunderstanding the AI risk in precisely the ways that later make them engage in sketchy stuff around it. They're very tempted to view ASI as a giant pile of power they can grab. (They hallucinate the Ring when they look into the Black Beast's den, if I'm to mix my analogies.)

In general, if you're considering giving power to a really effective but untrustworthy person because they seem credibly aligned with your cause, despite their general untrustworthiness (they also don't want to die to ASI!), you are almost certainly just getting exploited. These sorts of people should be avoided like wildfire. (Even in cases where you think you can keep them in check, you're going to have to spend so much effort paranoidally looking over everything they do in search of gotchas that it almost certainly wouldn't be worth it.)

  1. ^

    Probably because of that thing where if a good person dramatically abandons their morals for the greater good, they feel that it's a monumental enough sacrifice for the universe to take notice and make it worth it.

Reply
Punctuation & Quotation Conventions
Thane Ruthenis3d40

She used the phrase "absolute magic".

I write it this way too, and the ostensible "correct" way to do it slightly unnerves me every time I see it. It parses like mismatched brackets. The sentence wasn't ended! There's a compilation error!

"It isn't magic," she said.

I parse this as the correct special-case formatting for writing dialogue (not any verbal-speech quoting, specifically dialogue!). In all other cases, the comma should be outside the quotation marks.

"It isn't magic.", she said.

This doesn't parse as the correct format for dialogue, and would irk me if used this way. As to non-dialogue cases...

She said "It isn't magic.".

This also looks weird. I think in the single-sentence case, you can logically skip the dot, under the interpretation that you stopped quoting just before it.

What about a multi-sentence quote, though? In that case, including mid-quote dots but skipping the last one indeed feels off. Between the following two, the latter feels more correct:

As they said, "It isn't magic. It's witchcraft", and we shouldn't forget that.

As they said, "It isn't magic. It's witchcraft.", and we shouldn't forget that.

That said, they both feel ugly. Honestly, it feels to me that you maybe shouldn't be allowed inline multi-sentence quotes at all, outside the special case of dialogue? This feels most correct:

As they said,

> It isn't magic. It's witchcraft.

We shouldn't forget that.

But of course, it's also illogical. There should be a dot right after the quote! It's pretty much the exact same thing as the first example.

For that matter, same with colons. Logically, something like this is the correct formatting:

They used various example sentences, such as:

 > It isn't magic. It's witchcraft.

. And we shouldn't forget that.

I would be weirded out if someone wrote it this way, though. The standard way to write it, where you omit the dot, also irks me a bit, but less so. No real good options here, only lesser evils.

Reply2
Everyone has a plan until they get lied to the face
Thane Ruthenis4d127

Sure, but... I think one important distinction is that lies should not be interpreted as having semantic content. If you think a given text is lying, that means you can't look at just the text and infer stuff from it, you have to look at the details of the context in which it was generated. And you often don't have all the details, and the correct inferences often depend on them very precisely, especially in nontrivially complex situations. In those cases, I think lies do contain basically zero information.

For example:

If I start talking about how Thane Ruthenis sexually assaulted me, that might be true or it might be false. If its true, that tells you something about the world. If its false, the statement doesn't say anything about the world, but the fact that I said it still does.

It could mean any of:

  • You dislike me and want to hurt me, as a terminal goal.
  • I'm your opponent/competitor and you want to lower my reputation, as an instrumental goal.
  • You want to paint yourself as a victim because it would be advantageous for you in some upcoming situation.
  • You want to create community drama in order to distract people from something.
  • You want to erode trust between members of a community, or make the community look bad to outsiders.
  • You want to raise the profile of a specific type of discourse.

Etc., etc. If someone doesn't know the details of the game-theoretic context in which the utterance is emitted, there's very little they can confidently conclude from it. They can infer that we are probably not allies (although maybe we are colluding, who knows), and that there's some sort of conflict happening, but that's it. For all intents and purposes, the statement (or its contents) should be ignored; it could mean almost anything.

(I suppose this also depends on Simulacra Levels. SL2 lies and SL4 lies are fairly different beasts, and the above mostly applies to SL4 lies.)

Reply
Everyone has a plan until they get lied to the face
Thane Ruthenis4d*5220

One of the reasons it's hard to take the possibility of blatant lies into account is that it would just be so very inconvenient, and also boring.

If someone's statements are connected to reality, that gives you something to do:

  • You can analyze them, critique them, use them to infer the person's underlying models and critique those, identify points of agreement and controversy, identify flaws in their thinking, make predictions and projections about future actions based on them, et cetera. All those activities we love a lot, they're fun and feel useful.
  • It also gives you the opportunity to engage, to socialize with the person by arguing with them, or with others by discussing their words (e. g., if it's a high-profile public person). You can show off your attention to detail and analytical prowess, build reputation and status.

On the other hand, if you assume that someone is lying (in a competent way where you can't easily identify what are the lies), that gives you... pretty much nothing to do. You're treating their words as containing ~zero information, so you (1) can't use them as an excuse to run some fun analyses/projections, (2) can't use them as an opportunity to socialize and show off. All you can do is stand in the corner repeating the same thing, "this person is probably lying, do not believe them", over and over again, while others get to have fun. It's terrible.

Concrete example: Sam Altman. The guy would go on an interview or post some take on Twitter, and people would start breaking what he said down, pointing out what he gets right/wrong, discussing his character and vision and the X-risk implications, etc. And I would listen to the interview and read the analyses, and my main takeaway would be, "man, 90% of this is probably just lies completely decoupled from the underlying reality". And then I have nothing to do.

Importantly: this potentially creates a community bias in favor of naivete (at least towards public figures). People who believe that Alice is a liar mostly ignore Alice,[1] so all analyses of Alice's words mostly come from people who put some stock in them. This creates a selection effect where the vast majority of Alice-related discussion is by people who don't dismiss her words out of hand, which makes it seem as though the community thinks Alice is trustworthy. That (1) skews your model of the community, and (2) may be taken as evidence that Alice is trustworthy by non-informed community members, who would then start trusting her and discussing her words, creating a feedback loop.

Edit: Hm, come to think of it, this point generalizes. Suppose we have two models of some phenomenon, A and B. Under A, the world frequently generates prompts for intelligent discussion, whereas discussion-prompts for B are much sparser. This would create an apparent community bias in favor of A: A-proponents would be generating most of the discussions, raising A's visibility, and also get more opportunities for raising their own visibility and reputation. Note that this is completely decoupled from whether the aggregate evidence is in favor of A or B; the volume of information generated about A artificially raises its profile.

Example: disagreements regarding whether studying LLMs bears on the question of ASI alignment or not. People who pay attention to the results in that sphere get to have tons of intelligent discussions about an ever-growing pile of experiments and techniques. People who think LLM alignment is irrelevant mostly stay quiet, or retread the same few points they have for the hundredth time. 

  1. ^

    What else they're supposed to do? Their only message is "Alice is a liar", and butting into conversations just to repeat this well-discussed, conversation-killing point wouldn't feel particularly productive and would quickly start annoying people.

Reply
The Charge of the Hobby Horse
Thane Ruthenis5d1416

So what is this pattern? I'd call it The Charge of Hobby Horse. It's where you ride your hobby horse into battle in the comments, crashing through obstacles such as "the author did not even disagree with me, so there's nothing to actually have a battle about".

Yeah, I'd noticed that impulse in myself before. I'm consciously on the lookout for it, but it's definitely disappointing to run into something that seems like an opportunity to have an argument about your pet topic, only to realize that that opportunity is a mirage.

Reply1
How I Learned That I Don't Feel Companionate Love
Thane Ruthenis5d40

Think about someone who couldn't feel joy (or pleasure or whatever). They would be saying the same things you're saying now, and they would be wrong

I don't think that this tracks; or, at least, it seems far from being something we can confidently conclude. I don't yet properly understand the type signature of emotions, but they seem to be part of the structure of a mind,[1] rather than outwards-facing interfaces that could be easily factored out (like more straightforward sensory modalities, such as sight). And I think there could be a wide variety of equally valid ways to structure one's mind, including wildly alien ones. I don't think we should go around saying "you must self-modify into this specific type of mind" to people. 

Would you give up your ability to feel happiness just so you could free up 5% of your time for working?

FWIW, I'd straight-up turn off all my conscious experience for the next two decades if it made me 5% more productive for those decades and had no other downsides.

  1. ^

    Something like the sensory feedback associated with the dimensions along which a given mind's high-level state/decision-making policy can vary...? Which can be different on a per-mind basis. Doesn't feel like the full story, but maybe something in that direction.

Reply
Load More
8Thane Ruthenis's Shortform
Ω
1y
Ω
179
AI Safety Public Materials
3 years ago
(+195)
23Synthesizing Standalone World-Models, Part 4: Metaphysical Justifications
Ω
2mo
Ω
9
13Synthesizing Standalone World-Models, Part 3: Dataset-Assembly
Ω
2mo
Ω
2
16Synthesizing Standalone World-Models, Part 2: Shifting Structures
Ω
2mo
Ω
5
25Synthesizing Standalone World-Models, Part 1: Abstraction Hierarchies
Ω
2mo
Ω
10
76Research Agenda: Synthesizing Standalone World-Models
Ω
2mo
Ω
31
53The System You Deploy Is Not the System You Design
2mo
0
26Is Building Good Note-Taking Software an AGI-Complete Problem?
6mo
13
377A Bear Case: My Predictions Regarding AI Progress
8mo
163
141How Much Are LLMs Actually Boosting Real-World Programmer Productivity?
Q
9mo
Q
52
152The Sorry State of AI X-Risk Advocacy, and Thoughts on Doing Better
9mo
53
Load More