Agent-foundations researcher. Working on Synthesizing Standalone World-Models, aiming at a technical solution to the AGI risk fit for worlds where alignment is punishingly hard and we only get one try.
Currently looking for additional funders ($1k+, details). Consider reaching out if you're interested, or donating directly.
Or get me to pay you money ($5-$100) by spotting holes in my agenda or providing other useful information.
Political games can, at best, get stuff from other people. The good stuff - the real power - is the stuff which other people don’t have to offer in the first place. The stuff which nobody is currently capable of doing/making.
Yep.
I think this generalizes to competition against people in general? As in, if you find yourself in a situation where you're competing neck-to-neck with others, with a real possibility of losing, without an unsurpassable lead on them, you should stop that (if you can) and do something else.
Some examples:
In some way, the point may be obvious. Fair fights against comparably powerful opponents are, by definition, challenging, and also anti-inductive. They drain both sides' resources, force races to the bottom and tons of other negative-sum dynamics, etc. You want either an overwhelming asymmetric advantage, or a fight to which no-one else will show up.
At least, no-one intelligent. Picking a "fight" with Nature, or abstract concepts, is fine. Indeed, those are the kinds of fights you should be picking. Generally speaking, you want to be doing things where success comes from spending ~all of your time thinking about some object-level problem, and ~none of your time keeping tabs on your human adversaries.
Some caveats/clarifications:
Exercise for the reader: evaluate various alignment plans through these lens, and consider how a non-loser's alignment plan ought to look like (and how it ought not to look).
Not gonna track down exact sources, sorry.
I've looked at my workflows through the lens of this post, and I'm realizing I could indeed make some of them much more efficient by restructuring them the way suggested here.
So, thanks! I think this advice will end up directly useful to me.
Aside: For me, this paper is potentially the most exciting interpretability result of the past several years (since SAEs). Scaling it to GPT-3 and beyond seems like a very promising direction. Great job!
Yes, of course I care about whether someone takes AI risk seriously, but if someone is also untrustworthy, in my opinion this serves as a multiplier of their negative impact on the world. I do not want to create scheming and untrustworthy stakeholders that start doing sketchy stuff around AI risk. That's how really a lot of bad stuff in the past has already happened.
No-true-Scotsman-ish counterargument: no-one who actually gets AI risk would engage in this kind of tomfoolery. This is the behavior of someone who almost got it, but then missed the last turn and stumbled into the den of the legendary Black Beast of Aaargh. In the abstract, I think "we should be willing to consider supporting literal Voldemort if we're sure he has the correct model of AI X-risk" goes through.
The problem is that it just totally doesn't work in practice, not even on pure consequentialist grounds:
In general, if you're considering giving power to a really effective but untrustworthy person because they seem credibly aligned with your cause, despite their general untrustworthiness (they also don't want to die to ASI!), you are almost certainly just getting exploited. These sorts of people should be avoided like wildfire. (Even in cases where you think you can keep them in check, you're going to have to spend so much effort paranoidally looking over everything they do in search of gotchas that it almost certainly wouldn't be worth it.)
Probably because of that thing where if a good person dramatically abandons their morals for the greater good, they feel that it's a monumental enough sacrifice for the universe to take notice and make it worth it.
She used the phrase "absolute magic".
I write it this way too, and the ostensible "correct" way to do it slightly unnerves me every time I see it. It parses like mismatched brackets. The sentence wasn't ended! There's a compilation error!
"It isn't magic," she said.
I parse this as the correct special-case formatting for writing dialogue (not any verbal-speech quoting, specifically dialogue!). In all other cases, the comma should be outside the quotation marks.
"It isn't magic.", she said.
This doesn't parse as the correct format for dialogue, and would irk me if used this way. As to non-dialogue cases...
She said "It isn't magic.".
This also looks weird. I think in the single-sentence case, you can logically skip the dot, under the interpretation that you stopped quoting just before it.
What about a multi-sentence quote, though? In that case, including mid-quote dots but skipping the last one indeed feels off. Between the following two, the latter feels more correct:
As they said, "It isn't magic. It's witchcraft", and we shouldn't forget that.
As they said, "It isn't magic. It's witchcraft.", and we shouldn't forget that.
That said, they both feel ugly. Honestly, it feels to me that you maybe shouldn't be allowed inline multi-sentence quotes at all, outside the special case of dialogue? This feels most correct:
As they said,
> It isn't magic. It's witchcraft.
We shouldn't forget that.
But of course, it's also illogical. There should be a dot right after the quote! It's pretty much the exact same thing as the first example.
For that matter, same with colons. Logically, something like this is the correct formatting:
They used various example sentences, such as:
> It isn't magic. It's witchcraft.
. And we shouldn't forget that.
I would be weirded out if someone wrote it this way, though. The standard way to write it, where you omit the dot, also irks me a bit, but less so. No real good options here, only lesser evils.
Sure, but... I think one important distinction is that lies should not be interpreted as having semantic content. If you think a given text is lying, that means you can't look at just the text and infer stuff from it, you have to look at the details of the context in which it was generated. And you often don't have all the details, and the correct inferences often depend on them very precisely, especially in nontrivially complex situations. In those cases, I think lies do contain basically zero information.
For example:
If I start talking about how Thane Ruthenis sexually assaulted me, that might be true or it might be false. If its true, that tells you something about the world. If its false, the statement doesn't say anything about the world, but the fact that I said it still does.
It could mean any of:
Etc., etc. If someone doesn't know the details of the game-theoretic context in which the utterance is emitted, there's very little they can confidently conclude from it. They can infer that we are probably not allies (although maybe we are colluding, who knows), and that there's some sort of conflict happening, but that's it. For all intents and purposes, the statement (or its contents) should be ignored; it could mean almost anything.
(I suppose this also depends on Simulacra Levels. SL2 lies and SL4 lies are fairly different beasts, and the above mostly applies to SL4 lies.)
One of the reasons it's hard to take the possibility of blatant lies into account is that it would just be so very inconvenient, and also boring.
If someone's statements are connected to reality, that gives you something to do:
On the other hand, if you assume that someone is lying (in a competent way where you can't easily identify what are the lies), that gives you... pretty much nothing to do. You're treating their words as containing ~zero information, so you (1) can't use them as an excuse to run some fun analyses/projections, (2) can't use them as an opportunity to socialize and show off. All you can do is stand in the corner repeating the same thing, "this person is probably lying, do not believe them", over and over again, while others get to have fun. It's terrible.
Concrete example: Sam Altman. The guy would go on an interview or post some take on Twitter, and people would start breaking what he said down, pointing out what he gets right/wrong, discussing his character and vision and the X-risk implications, etc. And I would listen to the interview and read the analyses, and my main takeaway would be, "man, 90% of this is probably just lies completely decoupled from the underlying reality". And then I have nothing to do.
Importantly: this potentially creates a community bias in favor of naivete (at least towards public figures). People who believe that Alice is a liar mostly ignore Alice,[1] so all analyses of Alice's words mostly come from people who put some stock in them. This creates a selection effect where the vast majority of Alice-related discussion is by people who don't dismiss her words out of hand, which makes it seem as though the community thinks Alice is trustworthy. That (1) skews your model of the community, and (2) may be taken as evidence that Alice is trustworthy by non-informed community members, who would then start trusting her and discussing her words, creating a feedback loop.
Edit: Hm, come to think of it, this point generalizes. Suppose we have two models of some phenomenon, A and B. Under A, the world frequently generates prompts for intelligent discussion, whereas discussion-prompts for B are much sparser. This would create an apparent community bias in favor of A: A-proponents would be generating most of the discussions, raising A's visibility, and also get more opportunities for raising their own visibility and reputation. Note that this is completely decoupled from whether the aggregate evidence is in favor of A or B; the volume of information generated about A artificially raises its profile.
Example: disagreements regarding whether studying LLMs bears on the question of ASI alignment or not. People who pay attention to the results in that sphere get to have tons of intelligent discussions about an ever-growing pile of experiments and techniques. People who think LLM alignment is irrelevant mostly stay quiet, or retread the same few points they have for the hundredth time.
What else they're supposed to do? Their only message is "Alice is a liar", and butting into conversations just to repeat this well-discussed, conversation-killing point wouldn't feel particularly productive and would quickly start annoying people.
So what is this pattern? I'd call it The Charge of Hobby Horse. It's where you ride your hobby horse into battle in the comments, crashing through obstacles such as "the author did not even disagree with me, so there's nothing to actually have a battle about".
Yeah, I'd noticed that impulse in myself before. I'm consciously on the lookout for it, but it's definitely disappointing to run into something that seems like an opportunity to have an argument about your pet topic, only to realize that that opportunity is a mirage.
Think about someone who couldn't feel joy (or pleasure or whatever). They would be saying the same things you're saying now, and they would be wrong
I don't think that this tracks; or, at least, it seems far from being something we can confidently conclude. I don't yet properly understand the type signature of emotions, but they seem to be part of the structure of a mind,[1] rather than outwards-facing interfaces that could be easily factored out (like more straightforward sensory modalities, such as sight). And I think there could be a wide variety of equally valid ways to structure one's mind, including wildly alien ones. I don't think we should go around saying "you must self-modify into this specific type of mind" to people.
Would you give up your ability to feel happiness just so you could free up 5% of your time for working?
FWIW, I'd straight-up turn off all my conscious experience for the next two decades if it made me 5% more productive for those decades and had no other downsides.
Something like the sensory feedback associated with the dimensions along which a given mind's high-level state/decision-making policy can vary...? Which can be different on a per-mind basis. Doesn't feel like the full story, but maybe something in that direction.
Off-the-cuff idea, probably a bad on:
Stopping short of "turning off commenting entirely", being able to make comments to a given post subject to a separate stage of filtering/white-listing. The white-listing criteria are set by the author and made public. Ideally, the system is also not controlled by the author directly, but by someone the author expects to be competent at adhering to those criteria (perhaps an LLM, if they're competent enough at this point).
Probably this is still too censorship-y, though? (And obviously doesn't solve the problem where people make top-level takedown posts in which all the blacklisted criticism is put and then highly upvoted. Though maybe that's not going to be as bad and widespread as one might fear.)