Principles of Privacy for Alignment Research

[-]Steven Byrnes3yΩ163517

I think my threat model is a bit different. I don’t particularly care about the zillions of mediocre ML practitioners who follow things that are hot and/or immediately useful. I do care about the pioneers, who are way ahead of the curve, working to develop the next big idea in AI long before it arrives. These people are not only very insightful themselves, but also can recognize an important insight when they see it, and they’re out hunting for those insights, and they’re not looking in the same places as most people, and in particular they’re not looking at whatever is trending on Twitter or immediately useful.

Let’s try this analogy, maybe: “most impressive AI” ↔ “fastest man-made object”. Let’s say that the current record-holder for fastest man-made object is a train. And right now a competitor is building a better train, that uses new train-track technology. It’s all very exciting, and lots of people are following it in the newspapers. Meanwhile, a pioneer has the idea of building the first-ever rocket ship, but the pioneer is stuck because they need better heat-resistant tiles in order for the rocket-ship prototype to actually work. This pioneer is probably not going to be following the fastest-train news; instead, they’re going to be poring over the obscure literature on heat-resistant tiles. (Sorry for lack of historical or engineering accuracy in the above.) This isn’t a perfect analogy for many reasons, ignore it if you like.

So my ideal model is (1) figure out the whole R&D path(s) to building AGI, (2) don’t tell anyone (or even write it down!), (3) now you know exactly what not to publish, i.e. everything on that path, and it doesn’t matter whether those things would be immediately useful or not, because the pioneers who are already setting out down that path will seek out and find what you’re publishing, even if it’s obscure, because they already have a pretty good idea of what they’re looking for. Of course, that’s easier said than done, especially step (1) :-P

[-]johnswentworth3yΩ15254

Thinking out loud here...

I do basically buy the "ignore the legions of mediocre ML practitioners, pay attention to the pioneers" model. That does match my modal expectations for how AGI gets built. But:

How do those pioneers find my work, if my work isn't very memetically fit?
If they encounter my work directly from me (i.e. by reading it on LW), then at that point they're selected pretty heavily for also finding lots of stuff about alignment.
There's still the theory-practice gap; our hypothetical pioneer needs to somehow recognize the theory they need without flashy demos to prove its effectiveness.

Thinking about it, these factors are not enough to make me confident that someone won't use my work to produce an unaligned AGI. On (1), thinking about my personal work, there's just very little technical work at all on abstraction, so someone who knows to look for technical work on abstraction could very plausibly encounter mine. And that is indeed the sort of thing I'd expect an AGI pioneer to be looking for. On (2), they'd be a lot more likely to encounter my work if they're already paying attention to alignment, and encountering my work would probably make them more likely to pay attention to alignment even if they weren't before, but neither of those really rules out unaligned researchers. On (3), I do expect a pioneer to be able to recognize the theory they need without flashy demos to prove it if they spend a few days' attention on it.

... ok, so I'm basically convinced that I should be thinking about this particular scenario, and the "nobody cares" defense is weaker against the hypothetical pioneers than against most people. I think the "fundamental difference" is that the hypothetical pioneers know what to look for; they're not just relying on memetic fitness to bring the key ideas to them.

... well fuck, now I need to go propagate this update.

[-]Steven Byrnes3yΩ10145

at that point they're selected pretty heavily for also finding lots of stuff about alignment.

An annoying thing is, just as I sometimes read Yann LeCun or Steven Pinker or Jeff Hawkins, and I extract some bits of insight from them while ignoring all the stupid things they say about the alignment problem, by the same token I imagine other people might read my posts, and extract some bits of insight from me while ignoring all the wise things I say about the alignment problem. :-P

That said, I do definitely put some nonzero weight on those kinds of considerations. :)

[-]johnswentworth3yΩ796

More thinking out loud...

On this model, I still basically don't worry about spy agencies.
What do I really want here? Like, what would be the thing which I most want those pioneers find? "Nothing" doesn't actually seem like the right answer; someone who already knows what to look for is going to figure out the key ideas with or without me. What I really want to do is to advertise to those pioneers, in the obscure work which most people will never pay attention to. I want to recruit them.
It feels like the prototypical person I'm defending against here is younger me. What would work on younger me?

[-]Garrett Baker1y20

Coming back to this 2 years later, and I'm curious about how you've changed your mind.

[-]Vanessa Kosoy3y*Ω112113

[For the record, here's previous relevant discussion]

My problem with the "nobody cares" model is that it seems self-defeating. First, if nobody cares about my work, then how would my work help with alignment? I don't put a lot of stock into building aligned AGI in the basement on the my own. (And not only because I don't have a basement.) Therefore, any impact I will have flows through my work becoming sufficiently known that somebody who builds AGI ends up using it. Even if I optimistically assume that I will personally be part of that project, my work needs to be sufficiently well-known to attract money and talent to make such a project possible.

Second, I also don't put a lot of stock into solving alignment all by myself. Therefore, other people need to build on my work. In theory, this only requires it to be well-known in the alignment community. But, to improve our chances of solving the problem we need to make the alignment community bigger. We want to attract more talent, much of which is found in the broader computer science community. This is in direct opposition to preserving the conditions for "nobody cares".

Third, a lot of people are motivated by fame and status (myself included). Therefore, bringing talent into alignment requires the fame and status to be achievable inside the field. This is obviously also in contradiction with "nobody cares".

My own thinking about this is: yes, progress in the problems I'm working on can contribute to capability research, but the overall chance of success on the pathway "capability advances driven by theoretical insights" is higher than on the pathway "capability advances driven by trial and error", even if the first leads to AGI sooner, especially if these theoretical insights are also useful for alignment. I certainly don't want to encourage the use of my work to advance capability, and I try to discourage anyone who would listen, but I accept the inevitable risk of that happening in exchange for the benefits.

Then again, I'm by no means confident that I'm thinking about all of this in the right way.

[-]johnswentworth3yΩ341

Our work doesn't necessarily need wide memetic spread to be found by the people who know what to look for. E.g. people playing through the alignment game tree are a lot more likely to realize that ontology identification, grain-of-truth, value drift, etc, are key questions to ask, whereas ML researchers just pushing toward AGI are a lot less likely to ask those questions.

I do agree that a growing alignment community will add memetic fitness to alignment work in general, which is at least somewhat problematic for the "nobody cares" model. And I do expect there to be at least some steps which need a fairly large alignment community doing "normal" (i.e. paradigmatic) incremental research. For instance, on some paths we need lots of people doing incremental interpretability/ontology research to link up lots of concepts to their representations in a trained system. On the other hand, not all of the foundations need to be very widespread - e.g. in the case of incremental interpretability/ontology research, it's mostly the interpretability tools which need memetic reach, not e.g. theory around grain-of-truth or value drift.

[-]Vanessa Kosoy3yΩ570

Our work doesn't necessarily need wide memetic spread to be found by the people who know what to look for. E.g. people playing through the alignment game tree are a lot more likely to realize that ontology identification, grain-of-truth, value drift, etc, are key questions to ask, whereas ML researchers just pushing toward AGI are a lot less likely to ask those questions.

That's a valid argument, but I can also imagine groups that (i) in a world where alignment research is obscure proceed to create unaligned AGI (ii) in a world where alignment research is famous, use this research when building their AGI. Maybe any such group would be operationally inadequate anyway, but I'm not sure. More generally, it's possible that in a world where alignment research is a well-known respectable field of study, more people take AI risk seriously.

...I do expect there to be at least some steps which need a fairly large alignment community doing "normal" (i.e. paradigmatic) incremental research. For instance, on some paths we need lots of people doing incremental interpretability/ontology research to link up lots of concepts to their representations in a trained system. On the other hand, not all of the foundations need to be very widespread - e.g. in the case of incremental interpretability/ontology research, it's mostly the interpretability tools which need memetic reach, not e.g. theory around grain-of-truth or value drift.

I think I have a somewhat different model of the alignment knowledge tree. From my perspective, the research I'm doing is already paradigmatic. I have a solid-enough paradigm, inside which there are many open problems, and what we need is a bunch of people chipping away at these open problems. Admittedly, the size of this "bunch" is still closer to 10 people than to 1000 people but (i) it's possible that the open problems will keep multiplying hydra-style, as often happens in math and (ii) memetic fitness would help getting the very best 10 people to do it.

It's also likely that there will be a "phase II" where the nature of the necessary research becomes very different (e.g. it might involve combining the new theory with neuroscience, or experimental ML research, or hardware engineering), and successful transition to this phase might require getting a lot of new people on board which would also be a lot easier given memetic fitness.

[-]evhub3yΩ162113

My usual take here is the “Nobody Cares” model, though I think there is one scenario that I tend to be worried about a bit here that you didn't address, which is how to think about whether or not you want things ending up in the training data for a future AI system. That's a scenario where the “Nobody Cares” model really doesn't apply, since the AI actually does have time to look at everything you write.

That being said, I think that, most of the time, alignment work ending up in training data is good, since it can help our AI systems be differentially better at AI alignment research (e.g. relative to how good they are at AI capabilities research), which is something that I think is pretty important. However, it can also help AI systems do things like better understand how to be deceptive, so this sort of thing can be a bit tricky.

[-]Richard_Ngo3yΩ91430

Worrying about which alignment writing ends up in the training data feels like a very small lever for affecting alignment; my general heuristic is that we should try to focus on much bigger levers.

[-]Ofer3yΩ110

Is that because you think it would be hard to get the relevant researchers to exclude any given class of texts from their training datasets [EDIT: or prevent web crawlers from downloading the texts etc.]? Or even if that part was easy, you would still feel that that lever is very small?

[-]Richard_Ngo3yΩ453

Even if that part was easy, it still seems like a very small lever. A system capable of taking over the world will be able to generate those ideas for itself, and a system with strong motivations to take over the world won't have them changed by small amounts of training text.

[-]Thane Ruthenis3y42

It may have more impact at intermediate stages of AI training, where the AI is smart enough to do value reflection and contemplate self-modification/hacking the training process, but not smart enough to immediately figure out all the ways it can go wrong.

E. g., it can come to wrong conclusions about its own values, like humans can, then lock these misunderstandings in by maneuvering the training process this way, or by designing a successor agent with the wrong values. Or it may design sub-agents without understanding the various pitfalls of multi-agent systems, and get taken over by some Goodharter.

I agree that "what goes into the training set" is a minor concern, though, even with regards to influencing the aforementioned dynamic.

[-]Ofer3yΩ112

Maybe the question here is whether including certain texts in relevant training datasets can cause [language models that pose an x-risk] to be created X months sooner than otherwise.

The relevant texts I'm thinking about here are:

Descriptions of certain tricks to evade our safety measures.
Texts that might cause the ML model to (better) model AIS researchers or potential AIS interventions, or other potential AI systems that the model might cooperate with (or that might "hijack" the model's logic).

[-]Ofer3yΩ120

That being said, I think that, most of the time, alignment work ending up in training data is good, since it can help our AI systems be differentially better at AI alignment research (e.g. relative to how good they are at AI capabilities research), which is something that I think is pretty important.

That consideration seems relevant only for language models that will be doing/supporting alignment work.

[-]Thane Ruthenis3y62

The case where we need extreme paranoia is where both (1) an adversary is plausibly likely to pay attention, and (2) our research might allow for immediate and direct and very large capability gains, without any significant theory-practice gap.

The problem with this is...

In my model, most useful research is incremental and builds upon itself. As you point out, it's difficult to foresee how useful what you're currently working on will be, but if it is useful, your or others' later work will probably use it as an input.
The successful, fully filled-out alignment research tech tree will necessarily contain crucial capabilities insights. That is, (2) will necessarily be true if we succeed.
At the very end, once we have the alignment solution, we'll need to ensure that it's implemented, which means influencing AI Labs and/or government policy, which means becoming visible and impossible-to-ignore by these entities. So (1) will be true as well. Potentially because we'll have to publish a lot of flashy demos.
- In theory, this can be done covertly, by e. g. privately contacting key people and quietly convincing them or something. I wouldn't rely on us having this kind of subtle skill and coordination.

So, operating under Nobody Cares, you do incremental research. The usefulness of any given piece you're working on is very dubious 95% of the time, so you hit "submit" 95% of the time. So you keep publishing until you strike gold, until you combine some large body of published research with a novel insight and realize that the result clearly advances capabilities. At last, you can clearly see that (2) is true, so you don't publish and engage high security. Except... By this point, 95% of the insights necessary for that capabilities leap have already been published. You're in the unstable situation where the last 5% may be contributed by a random other alignment/capabilities researcher looking at the published 95% at the right angle and posting the last bit without thinking. And whether it's already the time to influence the AI industry/policy, or it'll come within a few years, (1) will become true shortly, so there'll be lots of outsiders poring over your work.

Basically, I'm concerned that following "Nobody Cares" predictably sets us up to fail at the very end. (1) and (2) are not true very often, but we can expect that they will be true if our work proves useful at all.

Not that I have any idea what to do about that.

[-]johnswentworth3y70

One part I disagree with: I do not expect that implementing an alignment solution will involve influencing government/labs, conditional on having an alignment solution at all. Reason: alignment requires understanding basically-all the core pieces of intelligence at a sufficiently-detailed level that any team capable of doing it will be very easily capable of building AGI. It is wildly unlikely that a team not capable of building AGI is even remotely capable of solving alignment.

Another part I disagree with: I claim that, if I publish 95% of the insights needed for X, then the average time before somebody besides me or my immediate friends/coworkers implements X goes down by, like, maybe 10%. Even if I publish 100% of the insights, the average time before somebody besides me or my immediate friends/coworkers implements X only goes down by maybe 20%, if I don't publish any flashy demos.

A concrete example to drive that intuition: imagine a software library which will do something very useful once complete. If the library is 95% complete, nobody uses it, and it's pretty likely that someone looking to implement the functionality will just start from scratch. Even if the library is 100% complete, without a flashy demo few people will ever find it.

All that said, there is a core to your argument which I do buy. The worlds where our work is useful at all for alignment are also the worlds where our work is most likely to be capabilities relevant. So, I'm most likely to end up regretting publishing something in exactly those worlds where the thing is useful for alignment; I'm making my life harder in exactly those worlds where I might otherwise have succeeded.

[-]Thane Ruthenis3y30

I do not expect that implementing an alignment solution will involve influencing government/labs, conditional on having an alignment solution at all

Mmm, right, in this case the fact that the rest of the AI industry is being carefree about openly publishing WMD design schematics is actually beneficial to us — our hypothetical AGI group won't be missing many insights that other industry leaders have.

The two bottlenecks here that I still see are money and manpower. The theory for solving alignment and the theory for designing AGI are closely related, but the practical implementations of these two projects may be sufficiently disjoint — such that the optimal setup is e. g. one team works full-time on developing universal interpretability tools while another works full-time on AGI architecture design. If we could hand off the latter part to skilled AI architects (and not expect them to screw it up), that may be a nontrivial speed boost.

Separately, there's the question of training sets/compute, i. e. money. Do we have enough of it? Suppose in a decade or two, one of the leading AI Labs successfully pushes for a Manhattan project equivalent, such that they'd be able to blow billions of dollars on training runs. Sure, insights into agency will probably make our AGI less compute-hungry. But will it be cheaper enough that we'd be able to match this?

Even if the library is 100% complete, without a flashy demo few people will ever find it.

But what if we have to release a flashy demo to attract attention, so there are now people swarming the already-published research looking for ideas?

[-]johnswentworth3y40

We do in fact have access to rather a lot of money; billions of dollars would not be out of the question in a few years, hundreds of millions are already probably available if we have something worthwhile to do with it, and alignment orgs are spending tens of millions already. Though by the time it becomes relevant, I don't particularly expect today's dollars -> compute -> performance curves to apply very well anyway.

But what if we have to release a flashy demo to attract attention, so there are now people swarming the already-published research looking for ideas?

Also money is a great substitute for attracting attention.

[-]Thane Ruthenis3y*1311

Okay, I've thought about it more, and I think my concerns are mainly outlined by this. Less by the post's actual contents, and more by the post's existence.

People dislike villains. Whether the concerns Andrew outlines are valid or not, people on the outside will tend to think that such concerns are valid. The hypothetical unilateral-aligned-AGI organization will be, at all times, on the verge of being a target of the entire world. The public would rally against it if the organizations' intentions became public knowledge, other AI Labs would be eager to get rid of the competition slash threat it presents, and governments would be eager either to seize AI research (if they take AI seriously by that point) or acquire political points by squishing something the public and megacorps want squished.

As such, the unilateral path requires a lot of subtle secrecy too. It should not be known that we expect our AI to engage in, uh, full-scale world... optimization. In theory, that connection can be left obscured — most of the people involved can just be allowed to fail to think about what the aligned superintelligence will do once it's deployed, so there aren't leaks from low-commitment people joining and quitting the org. But the people in charge will probably have the full picture, and... Well, at this point it sounds like the stupid kind of supervillain doomsday scheme, no?

More practically, I think the ship has already sailed on keeping the sort of secrecy this plan would need to work. I don't understand why all this talk of pivotal acts has been allowed to enter public discourse by Eliezer et al., but it'll be doubtlessly connected to any hypothetical future friendly-AGI org. Probably not by the public/other AI labs directly, but by fellow AI Safety researches who do not agree with unilateral pivotal acts. And once the concerns have been signal-boosted so, then they may be picked up by the media/politicians/Eliezer's sneer club/whoever, and once we're spending billions on training runs and it's clear that there's something actually going on beyond a bunch of doom-cult wackos, they will take these concerns seriously and act on them.

A further contributing factor may be increased public awareness of AI Risk in the future, encouraged by general AI capabilities growth, possible (non-omnicial) AI disasters, and poorly-considered efforts of our own community. (It would be very darkly ironic if AI Safety's efforts to ban dangerous AI research resulted in governments banning AI Safety's own AGI research and no-one else's, so that's probably an attractor in possibility-space because we live in Hell.)

The bottom line is... This idea seems thermonuclear, in the sense that trying it and getting noticed probably completely dooms us on the spot, and it'd be really hard not to get noticed.

(Though I don't really buy the whole "pivotal processes" thing either. We can probably increase the timeline this way, but actually making the world's default systems produce an aligned AI... Nah.)

[-]Thane Ruthenis3y41

Fair. I have no more concrete counter-arguments to offer at this time.

I still have a vague sense that acting with the expectations that we'd be able to unilaterally build an AGI is optimistic in a way that dooms us in a nontrivial number of timelines that would've been salvageable if we didn't assume that. But maybe that impression is wrong.

[-]Donald Hobson3yΩ350

Suppose hypothetically I had a way to make neural networks recognize OOD inputs. (Like I get back 40% dog, 10% cat, 20% OOD, 5% teapot...) Should I run a big imagenet flashy demo (So I personally know if the idea scales up) and then tell no one?

There was reasoning that went. Any research that has a better alignment/ capabilities ratio than the average of all research currently happening is good. A lot of research is pure capabilities, like hardware research. So almost anything with any alignment in it is good. I'm not quite sure if this is a good rule of thumb.

[-]johnswentworth3yΩ230

I think I basically don't buy the "just increase the alignment/capabilities ratio" model, at least on its own. It just isn't a sufficient condition to not die.

It does feel like there's a better version of that model waiting to be found, but I'm not sure what it is.

[-]Vanessa Kosoy3yΩ440

Relevant thread

[-]Thane Ruthenis3y32

the "just increase the alignment/capabilities ratio" model

If Donald is talking about the reasoning from my post, the primary argument there went a bit different. It was about expanding the AI Safety field by converting extant capabilities researchers/projects; and that even if we can't make them stop capability research, any intervention that 1) doesn't speed it up and 2) makes them output alignment results alongside capabilities results is net positive.

I think I'd also argued that the AI Safety is tiny at the moment so we won't contribute much to capability research even if we deliberately tried, but in retrospect, that argument is obviously invalid in hypotheticals where we're actually effective at solving alignment.

[-]Donald Hobson3yΩ120

Model 1. Your new paper produces units of capabilities, and $a$ units of alignment. When $C$ units of capabilities are reached, an AI is produced, and it is aligned iff $A$ units of alignment has been produced. The rest of the world produces, and will continue to produce, alignment and capabilities research in ratio $R$ . You are highly uncertain about $A$ and/or $C$ , but have a good guess at $a, c, R$ .

In this model, if $\frac{A}{C} << R$ we are screwed whatever you do, if $\frac{A}{C} >> R$ we win whatever you do. Your paper makes a difference in those worlds where $\frac{A}{C} \approx R$ , and in those worlds it helps iff $\frac{a}{c} > R$ .

This model treats alignment and capabilities as continuous, fungible quantities that slowly accumulate. This is a dubious assumption. It also assumes that conditional on us being in the marginal world (The world very where good and bad outcomes are both very close) that your mainline probability involves research continuing at its current ratio.

If for example, you were extremely pessimistic, and think that the only way we have any chance is if a portal to Dath ilan opens up, then the goal is largely to hold off all research for as long as possible, to maximize the time a deus ex machina can happen in. Other goals might include publishing the sort of research most likely to encourage a massive global "take AI seriously" movement.

[-]johnswentworth3yΩ220

So, the main takeaway is that we need some notion of fungibility/additivity of research progress (for both alignment and capabilities) in order for the "ratio" model to make sense.

Some places fungibility/additivity could come from:

research reducing time-until-threshold-is-reached additively and approximately-independently of other research
probabilistic independence in general
a set of rate-limiting constraints on capabilities/alignment strategies, such that each one must be solved independent of the others (i.e. solving each one does not help solve the others very much)
???

[-]Donald Hobson3yΩ120

Fungibility is necessary, but not sufficient for the "if your work has a better ratio than average research, publish". You also need your uncertainty to be in the right place.

If you were certain of , and uncertain what $\frac{A}{C}$ future research might have, you get a different rule, publish if $\frac{a}{c} > R$ .

[-]owngrove3y10

I think "alignment/capabilities > 1" is a closer heuristic than "alignment/capabilities > average", in the sense of '[fraction of remaining alignment this solves] / [fraction of remaining capabilities this solves]'. That's a sufficient condition if all research does it, though not IRL e.g. given pure capabilities research also exists; but I think it's still a necessary condition for something to be net helpful.

[-]johnswentworth3y30

It feels like what's missing is more like... gears of how to compare "alignment" to "capabilities" applications for a particular piece of research. Like, what's the thing I should actually be imagining when thinking about that "ratio"?

[-]Alexander Gietelink Oldenziel3y*Ω450

Thank you for writing this post; I had been struggling with these considerations a while back. I investigated going full paranoid mode but in the end mostly decided against it.

I agree theoretical insight on agency and intelligence have a real chance of leiding to capability gains. I agree on the government spy threat model as being unlikely. I would like to add however that if say MIRI builds a safe AGI prototype - perhaps based on different principles than systems used by adversaries it might make sense for an (ai-assisted) adversary to trawl through your old blogposts.

Byrnes has already mentioned the distinction between pioneers and median researchers. Another aspect that your threat models don't capture is: research that builds on your research. Your research may end up in a very long chain of theoretical research only a minority of which you have contributed. Or the spirit if not the letter of your ideas may percolate through the research community. Additionally, the alignment field will almost certainly become very much larger raising both the status of John and the alignment field in general. Over longer timescales I expect percolation to be quite strong.

Even if approximately nobody reads your or know of your works the insights may very well become massively signalboosted by other alignment researchers (once again I expect the community to explode in size within a decade) and thereby end up in a flashy demo.

All-in-all these and other considerations let me to the conclusion that this danger is very real. That is there is a significant minority of possible worlds in which early alignment researchers tragically contribute to DOOM.

However, I still think on the whole most alignment researchers should work in the open. Any solution to alignment will most likely come from a group (albeit-small) of people. Working privately massively hampers collaboration. It makes the community look weird and makes it way harder to recruit good people. Also, for most researchers it is difficult to support themselves financially if they can't show their work. As by far the most likely doom scenario is some company/government simply building AGI without sufficient safeguards because either there is no alignment solution or they are simply unaware of it/it ote it I conclude that the best policy in expected value is to work mostly publicly*.

*Ofc if there is a clear path to capability gain keeping it secret might be the best.

EDIT: Cochran has a comical suggestion

Georgy Flerov was a young nuclear physicist in the Soviet Union who ( in 1943) sent a letter to Stalin advocating an atomic bomb project. It is not clear that Stalin read that letter, but one of Flerov’s arguments was particularly interesting: he pointed out the abrupt and complete silence on the subject of nuclear fission in the scientific literature of the US, UK, and Germany – previously an extremely hot topic.
Stopping publications on atomic energy ( which happened in April 1940) was a voluntary effort by American and British physicists. But that cessation was itself a signal that something strategically important was going on.
Imagine another important discovery with important strategic implications: how would you maximize your advantage ?
Probably this is only practically possible if your side alone has made the discovery. If the US and the UK had continued publishing watered-down nuclear research, the paper stoppage in Germany would still have given away the game. But suppose, for the moment, that you have a monopoly on the information. Suddenly stopping closely related publications obviously doesn’t work. What do you do?
You have to continue publications, but they must stop being useful. You have to have the same names at the top ( an abrupt personnel switch would also be a giveaway) but the useful content must slide to zero. You could employ people that A. can sound like the previous real authors and B. are good at faking boring trash. Or, possibly, hire people who are genuinely mediocre and don’t have to fake it.
Maybe you can distract your rivals with a different, totally fake but extremely exciting semiplausible breakthrough.
Or – an accidental example of a very effective approach to suppression. Once upon a time, around 1940, some researchers began to suspect that duodenal ulcers were caused by a spiral bacterium. Some physicians were even using early antibiotics against them, which seemed to work. Others thought what they were seeing might be postmortem contamination. A famous pathologist offered to settle the issue.
He looked, didn’t see anything, and the hypothesis was buried for 40 years.
But he was wrong: he had used the wrong stains.

So, a new (?) intelligence tactic for hiding strategic breakthroughs: the magisterial review article.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

73

Principles of Privacy for Alignment Research

73

Ω 35

73

Ω 35

The “Nobody Cares” Model

Theory-Practice Gap and Flashy Demos

Reputation

Takeaways & Gotchas

Beyond “Nobody Cares”: Danger, Secrecy and Adversaries

The “Keep It To Yourself” Model for Immediately Capabilities-Relevant Research

The “Active Adversary” Model

Other Considerations

Unilateralist’s Curse

Existing Ideas

Alignment Researchers as the Threat

Differential Alignment/Capabilities Advantage

Most Secrecy Is Hopefully Temporary

Feedback Please!