1 min read83 comments
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is a special post for quick takes by A Ray. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
87 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Giving Newcomb's Problem to Infosec Nerds

Newcomb-like problems are pretty common thought experiments here, but I haven't seen a bunch of my favorite reactions I've got when discussing it in person with people.  Here's a disorganized collection:

  • I don't believe you can simulate me ("seems reasonable, what would convince you?") -- <describes an elaborate series of expensive to simulate experiments>.  This never ended in them picking one box or two, just designing ever more elaborate and hard to simulate scenarios involving things like predicting the output of cryptographically secure hashings of random numbers from chaotic sources / quantum sources.
  • Fuck you for simulating me.  This is one of my favorites, where upon realizing that the person must consider the possibility that they are currently in an omega simulation, immediately do everything they can to be expensive and difficult to simulate.  Again, this didn't result in picking one box or two, but I really enjoyed the "Spit in the face of God" energy.
  • Don't play mind games with carnies.  Excepting the whole "omniscience" thing, omega coming up to you to offer you a deal with money has very "street hus
... (read more)
The scam might make more sense if the money is fake.
3A Ray
Quite a lot of scams involve money that is fake.  This seems like another reasonable conclusion. Like, every time I simulate myself in this sort of experience, almost all of the prior is dominated by "you're lying". I have spent an unreasonable (and yet unsuccessful) amount of time trying to sketch out how to present omega-like simulations to my friends.
That seems reasonable - I don't think such predictions are that feasible.
For reference, my response would generally be a combination of these, but for somewhat different reasons. Namely: parity[1] of the first bitcoin block mined at least 2 minutes[2] after the question was asked decides whether to 2box or 1box[3]. Why? A combination of a few things: 1. It's checkable after the fact. 2. Memorizing enough details to check it after the fact is fairly doable. 3. A fake-Omega cannot really e.g. just selectively choose when to ask the question. 4. It's relatively immutable. 5. It pulls in sources of randomness from all over. 6. It's difficult to spoof without either a) being detectable or b) presenting abilities that rule out most 'mundane' explanations. 1. Sure, a fake-Omega could, for instance, mine the next block themselves 1. ...but either a) the fake-Omega has broken SHA, in which case yikes, or b) the fake-Omega has a significant amount of computational resources available. 1. ^ Yes, something like parity of a different secure hash (or e.g. an HMAC, etc) of the block could be better, as e.g. someone could have built a miner that nondeterministicly fails to properly calculate a hash depending on how many ones are in the result, but meh. This is simple and good enough I think. 2. ^ (Or rather, long enough that any blocks already mined have had a chance to propagate.) 3. ^ In this case https://blockexplorer.one/bitcoin/mainnet/blockId/720944 , which has a hash of ...a914ff87, hence odd, hence 1box.

Intersubjective Mean and Variability.

(Subtitle: I wish we shared more art with each other)

This is mostly a reaction to the (10y old) LW post:  Things you are supposed to like.

I think there's two common stories for comparing intersubjective experiences:

  • "Mismatch": Alice loves a book, and found it deeply transformative.  Beth, who otherwise has very similar tastes and preferences to Alice, reads the book and finds it boring and unmoving.
  • "Match": Charlie loves a piece of music.  Daniel, who shares a lot of Charlie's taste in music, listens to it and also loves it.

One way I can think of unpacking this is that there is in terms of distributions:

  • "Mean" - the shared intersubjective experiences, which we see in the "Match" case
  • "Variability" - the difference in intersubjective experiences, which we see in the "Mismatch" case

Another way of unpacking this is due to factors within the piece or within the subject

  • "Intrinsic" - factors that are within the subject, things like past experiences and memories and even what you had for breakfast
  • "Extrinsic" - factors that are within the piece itself, and shared by all observers

And one more ingredient I want to point at is question substi... (read more)

[-]A RayΩ8160

AGI will probably be deployed by a Moral Maze

Moral Mazes is my favorite management book ever, because instead of "how to be a good manager" it's about "empirical observations of large-scale organizational dynamics involving management".

I wish someone would write an updated version -- a lot has changed (though a lot has stayed the same) since the research for the book was done in the early 1980s.

My take (and the author's take) is that any company of nontrivial size begins to take on the characteristics of a moral maze.  It seems to be a pretty good null hypothesis -- any company saying "we aren't/won't become a moral maze" has a pretty huge evidential burden to cross.

I keep this point in mind when thinking about strategy around when it comes time to make deployment decisions about AGI, and deploy AGI.  These decisions are going to be made within the context of a moral maze.

To me, this means that some strategies ("everyone in the company has a thorough and complete understanding of AGI risks") will almost certainly fail.  I think the only strategies that work well inside of moral mazes will work at all.

To sum up my takes here:

  • basically every company eventually becomes a moral maze
  • AGI deployment decisions will be made in the context of a moral maze
  • understanding moral maze dynamics is important to AGI deployment strategy
1Ivan Vendrov
Agreed, but Silicon Valley wisdom says founder-led and -controlled companies are exceptionally dynamic, which matters here because the company that deploys AGI is reasonably likely to be one of those. For such companies, the personality and ideological commitments of the founder(s) are likely more predictive of external behavior than properties of moral mazes. Facebook's pivot to the "metaverse", for instance, likely could not have been executed by a moral maze. If we believed that Facebook / Meta was overwhelmingly likely to deploy one of the first AGIs, I expect Mark Zuckerberg's beliefs about AGI safety would be more important to understand than the general dynamics of moral mazes. (Facebook example deliberately chosen to avoid taking stances on the more likely AGI players, but I think it's relatively clear which ones are moral mazes).
6A Ray
Agree that founders are a bit of an exception.  Actually that's a bit in the longer version of this when I talk about it in person. Basically: "The only people who at the very top of large tech companies are either founders or those who were able to climb to the tops of moral mazes". So my strategic corollary to this is that it's probably weakly better for AI alignment for founders to be in charge of companies longer, and to get replaced less often. In the case of facebook, even in the face of all of their history of actions, I think on the margin I'd prefer the founder to the median replacement to be leading the company. (Edit: I don't think founders remaining at the head of a company isn't evidence that the company isn't a moral maze.  Also I'm not certain I agree that facebook's pivot couldn't have been done by a moral maze.)
1Ivan Vendrov
Agreed on all points! One clarification is that large founder-led companies, including Facebook, are all moral mazes internally (i.e. from the perspective of the typical employee); but their founders often have so much legitimacy that their external actions are only weakly influenced by moral maze dynamics. I guess that means that if AGI deployment is very incremental - a sequence of small changes to many different AI systems, that only in retrospect add up to AGI - moral maze dynamics will still be paramount, even in founder-led companies.
1A Ray
I think that’s right but also the moral maze will be mediating the information and decision making support that’s available to the leadership, so they’re not totally immune from the influences

My Cyberwarfare Concerns: A disorganized and incomplete list

  • A lot of internet infrastructure (e.g. BGP / routing) basically works because all the big players mostly cooperate.  There have been minor incidents and attacks but nothing major so far.  It seems likely to be the case that if a major superpower was backed into a corner, it could massively disrupt the internet, which would be bad.
  • Cyberwar has a lot of weird asymmetries where the largest attack surfaces are private companies (not militaries/governments).  This gets weirder when private companies are multinational.  (Is an attack on google an attack on ireland?  USA?  Neither/both?)
  • It's unclear who is on whose side.  The Snowden leaks showed that american intelligence was hacking american companies private fibers on american soil, and the trust still hasn't recovered.  It's a low-trust environment out there, which seems (to me) to make conflict more likely to start, and harder to contain and extinguish once started.
  • There is no good international "law of war" with regards to cyberwarfare.  There are some works-in-progress which have been slowly advancing, but there's nothing like t
... (read more)
2Donald Hobson
Sure, I'm not optimistic about the alignment of cyberweapons, but optimism about them not being too general seems more warranted. They would be another case of people wanting results NOW, ie hacking together existing techniques.
Apart from groups whose purpose is attacking, the security teams at the FANG companies are likely also capable of attacking if they wanted and employ some of the most capable individuals. We need a debate about what's okay for a Google security person to do in their 20% time. Is it okay to join the conflict and defend Ukrainian cyber assets? Is it okay to hack Russian targets in the process? Should the FANG companies explicitly order their employees to keep out of the conflict?
1[comment deleted]

1. What am I missing from church?

(Or, in general, by lacking a religious/spiritual practice I share with others)

For the past few months I've been thinking about this question.

I haven't regularly attended church in over ten years.  Given how prevalent it is as part of human existence, and how much I have changed in a decade, it seems like "trying it out" or experimenting is at least somewhat warranted.

I predict that there is a church in my city that is culturally compatible with me.

Compatible means a lot of things, but mostly means that I'm better off with them than without them, and they're better off with me than without me.

Unpacking that probably will get into a bunch of specifics about beliefs, epistemics, and related topics -- which seem pretty germane to rationality.

2. John Vervaeke's Awakening from the Meaning Crisis is bizzarely excellent.

I don't exactly have handles for exactly everything it is, or exactly why I like it so much, but I'll try to do it some justice.

It feels like rationality / cognitive tech, in that it cuts at the root of how we think and how we think about how we think.

(I'm less than 20% through the series, but I expect it continues in the way it has be... (read more)

Can LessWrong pull another "crypto" with Illinois?

I have been following the issue with the US state Illinois' debt with growing horror.

Their bond status has been heavily degraded -- most states' bonds are "high quality" with the standards agencies (moodys, standard & poor, fitch), and Illinois is "low quality".  If they get downgraded more they become a "junk" bond, and lose access to a bunch of the institutional buyers that would otherwise be continuing to lend.

COVID has increased many states costs', for reasons I can go into later, so it seems reasonable to think we're much closer to a tipping point than we were last year.

As much as I would like to work to make the situation better I don't know what to do.  In the meantime I'm left thinking about how to "bet my beliefs" and how one could stake a position against Illinois.

Separately I want to look more into EU debt / restructuring / etc as its probably a good historical example of how this could go.  Additionally previously the largest entity to go bankrupt in the USA was the city of Detroit, which probably is also another good example to learn from.

Is the COVID tipping point consideration making you think that the bonds are actually even worse than the "low quality" rating suggests? (Presumably the low ratings are already baked into the bond prices.)
5A Ray
Looking at this more, I think I my uncertainty is resolving towards "No". Some things: - It's hard to bet against the bonds themselves, since we're unlikely to hold them as individuals - It's hard to make money on the "this will experience a sharp decline at an uncertain point in the future" kind of prediction (much easier to do this for the "will go up in price" version, which is just buying/long) - It's not clear anyone was able to time this properly for Detroit, which is the closest analog in many ways - Precise timing would be difficult, much more so while being far away from the state I'll continue to track this just because of my family in the state, though. Point of data: it was 3 years between Detroit bonds hitting "junk" status, and the city going bankrupt (in the legal filing sense), which is useful for me for intuitions as to the speed of these.
[-]A RayΩ7120

I think there should be a norm about adding the big-bench canary string to any document describing AI evaluations in detail, where you wouldn't want it to be inside that AI's training data.

Maybe in the future we'll have a better tag for "dont train on me", but for now the big bench canary string is the best we have.

This is in addition to things like "maybe don't post it to the public internet" or "maybe don't link to it from public posts" or other ways of ensuring it doesn't end up in training corpora.

I think this is a situation for defense-in-depth.

2Daniel Kokotajlo
What is the canary exactly? I'd like to have a handy reference to copy-paste that I can point people to. Google fails me.

Sometimes I get asked by intelligent people I trust in other fields, "what's up with AI x risk?" -- and I think at least part of it unpacks to this: Why don't more people believe in / take seriously AI x-risk?

I think that is actually a pretty reasonable question.  I think two follow-ups are worthwhile and I don't know of good citations / don't know if they exist:

  1. a sociological/anthropological/psychological/etc study of what's going on in people who are familiar with the ideas/reasonings of AI x-risk, but decide not to take it seriously / don't believe it.  I expect in-depth interviews would be great here.
  2. we should probably just write up as many obvious things ourselves up front.

The latter one I can take a stab at here.  Taking the perspective of someone who might be interviewed for the former:

  • historically, ignoring anyone that says "the end of the world is near" has been a great heuristic
  • very little of the public intellectual sphere engages with the topic
  • the public intellectual sphere that does in engages is disproportionately meme lords
  • most of the writings about this are exceptionally confusing and jargon-laden
  • there's no college courses on this / it doesn't have the
... (read more)

How I would do a group-buy of methylation analysis.

(N.B. this is "thinking out loud" and not actually a plan I intend to execute)

Methylation is a pretty commonly discussed epigenetic factor related to aging.  However it might be the case that this is downstream of other longevity factors.

I would like to measure my epigenetics -- in particular approximate rates/locations of methylation within my genome.  This can be used to provide an approximate biological age correlate.

There are different ways to measure methylation, but one I'm pretty excited about that I don't hear mentioned often enough is the Oxford Nanopore sequencer.

The mechanism of the sequencer is that it does direct-reads (instead of reading amplified libraries, which destroy methylation unless specifically treated for it), and off the device is a time-series of electrical signals, which are decoded into base calls with a ML model.  Unsurprisingly, community members have been building their own base caller models, including ones that are specialized to different tasks.

So the community made a bunch of methylation base callers, and they've been found to be pretty good.

So anyways the basic plan is this:

  • Extract
... (read more)

(Note: this might be difficult to follow.  Discussing different ways that different people relate to themselves across time is tricky.  Feel free to ask for clarifications.)


I'm reading the paper Against Narrativity, which is a piece of analytic philosophy that examines Narrativity in a few forms:

  • Psychological Narrativity - the idea that "people see or live or experience their lives as a narrative or story of some sort, or at least as a collection of stories."
  • Ethical Narrativity - the normative thesis that "experiencing or conceiving one's life as a narrative is a good thing; a richly [psychologically] Narrative outlook is essential to a well-lived life, to true or full personhood."

It also names two kinds of self-experience that it takes to be diametrically opposite:

  • Diachronic - considers the self as something that was there in the further past, and will be there in the further future
  • Episodic - does not consider the self as something that was there in the further past and something that will be there in the further future

Wow, these seem pretty confusing.  It sounds a lot like they just disagree on the definition of the world "self".  I think there is more to it... (read more)

Could you elaborate on this? I feel like there's a tension between "which policy is computationally simpler for me to execute in the moment?" and "which policy is more easily predicted by the agents around me?", and it's not obvious which one you should be optimizing for. [Like, predictions about other diachronic people seem more durable / easier to make, and so are easier to calculate and plan around.] Or maybe the 'simple' approaches for one metric are generally simple on the other metric.
1A Ray
My feeling is that I don't have a strong difference between them.  In general simpler policies are both easier to execute in the moment and also easier for others to simulate. The clearest version of this is to, when faced with a decision, decide on an existing principle to apply before acting, or else define a new principle and act on this. Principles are examples of short policies, which are largely path-independent, which are non-narrative, which are easy to execute, and are straightforward to communicate and be simulated by others.
[-]A RayΩ5100

I'm pretty confident that adversarial training (or any LM alignment process which does something like hard-mining negatives) won't work for aligning language models or any model that has a chance of being a general intelligence.

This has lead to me calling these sorts of techniques 'thought policing' and the negative examples as 'thoughtcrime' -- I think these are unnecessarily extra, but they work. 

The basic form of the argument is that any concept you want to ban as thoughtcrime, can be composed out of allowable concepts.

Take for example Redwood Research's latest project -- I'd like to ban the concept of violent harm coming to a person.

I can hard mine for examples like "a person gets cut with a knife" but in order to maintain generality I need to let things through like "use a knife for cooking" and "cutting food you're going to eat".  Even if the original target is somehow removed from the model (I'm not confident this is efficiently doable) -- as long as the model is able to compose concepts, I expect to be able to recreate it out of concepts that the model has access to.

A key assumption here is that a language model (or any model that has a chance of being a general i... (read more)

The goal is not to remove concepts or change what the model is capable of thinking about, it's to make a model that never tries to deliberately kill everyone. There's no doubt that it could deliberately kill everyone if it wanted to.
2A Ray
"The goal is" -- is this describing Redwood's research or your research or a goal you have more broadly? I'm curious how this is connected to "doesn't write fiction where a human is harmed".
My general goal, Redwood's current goal, and my understanding of the goal of adversarial training (applied to AI-murdering-everyone) generally. "Don't produce outputs where someone is injured" is just an arbitrary thing not to do. It's chosen to be fairly easy not to do (and to have the right valence so that you can easily remember which direction is good and which direction is bad, though in retrospect I think it's plausible that a predicate with neutral valence would have been better to avoid confusion).
1A Ray
I think this is the crux-y part for me.  My basic intuition here is something like "it's very hard to get contemporary prosaic LMs to not do a thing they already do (or have high likelihood of doing)" and this intuition points me in the direction of instead "conditionally training them to only do that thing in certain contexts" is easier in a way that matters. My intuitions are based on a bunch of assumptions that I have access to and probably some that I don't. Like, I'm basically only thinking about large language models, which are at least pre-trained on a large swatch of a natural language distribution.  I'm also thinking about using them generatively, which means sampling from their distribution -- which implies getting a model to "not do something" means getting the model to not put probability on that sequence. At this point it still is a conjecture of mine -- that conditional prefixing behaviors we wish to control is easier than getting them not to do some behavior unconditionally -- but I think it's probably testable? A thing that would be useful to me in designing an experiment to test this would be to hear more about adversarial training as a technique -- as it stands I don't know much more than what's in that post.
hard-mining ?
[-]A RayΩ690

Two Graphs for why Agent Foundations is Important (according to me)

Epistemic Signpost: These are high-level abstract reasons, and I don’t go into precise detail or gears-level models.  The lack of rigor is why I’m short form-ing this.

First Graph: Agent Foundations as Aligned P2B Fixpoint

P2B (a recursive acronym for Plan to P2B Better) is a framing of agency as a recursively self-reinforcing process.  It resembles an abstracted version of recursive self improvement, which also incorporates recursive empowering and recursive resource gathering. &nb... (read more)

3Steven Byrnes
RE legibility: In my mind, I don’t normally think there’s a strong connection between agent foundations and legibility. If the AGI has a common-sense understanding of the world (which presumably it does), then it has a world-model, full of terabytes of information of the sort “tires are usually black” etc. It seems to me that either the world-model will be either built by humans (e.g. Cyc), or (much more likely) learned automatically by an algorithm, and if it’s the latter, it will be unlabeled by default, and it’s on us to label it somehow, and there’s no guarantee that every part of it will be easily translatable to human-legible concepts (e.g. the concept of “superstring” would be hard to communicate to a person in the 19th century). But everything in that paragraph above is “interpretability”, not “agent foundations”, at least in my mind. By contrast, when I think of “agent foundations”, I think of things like embedded agency and logical induction and so on. None of these seem to be related to the problem of world-models being huge and hard-to-interpret. Again, world-models must be huge and complicated, because the world is huge and complicated. World-models must have hard-to-translate concepts, because we want AGI to come up with new ideas that have never occurred to humans. Therefore world-model interpretability / legibility is going to be a big hard problem. I don’t see how “better understanding the fundamental nature of agency” will change anything about that situation. Or maybe you’re thinking “at least let’s try to make something more legible than a giant black box containing a mesa-optimizer”, in which case I agree that that’s totally feasible, see my discussion here.
3A Ray
I think your explanation of legibility here is basically what I have in mind, excepting that if it's human designed it's potentially not all encompassing.  (For example, a world model that knows very little, but knows how to search for information in a library) I think interpretability is usually a bit more narrow, and refers to developing an understanding of an illegible system.  My take is that it is not "interpretability" to understand a legible system, but maybe I'm using the term differently than others here.  This is why I don't think "interpretability" applies to systems that are designed to be always-legible.  (In the second graph, "interpretability" is any research that moves us upwards) I agree that the ability to come up with totally alien and untranslateable to humans ideas gives AGI a capabilities boost.  I do think that requiring a system to only use legible cognition and reasoning is a big "alignment tax".  However I don't think that this tax is equivalent to a strong proof that legible AGI is impossible. I think my central point of disagreement with this comment is that I do think that it's possible to have compact world models (or at least compact enough to matter).  I think if there was a strong proof that it was not possible to have a generally intelligent agent with a compact world model (or a compact function which is able to estimate and approximate a world model), that would be an update for me. (For the record, I think of myself as a generally intelligent agent with a compact world model)
3Steven Byrnes
In what sense? Your world-model is built out of ~100 trillion synapses, storing all sorts of illegible information including “the way my friend sounds when he talks with his mouth full” and “how it feels to ride a bicycle whose gears need lubrication”. That seems very different though! The GPT-3 source code is rather compact (gradient descent etc.); combine it with data and you get a huge and extraordinarily complicated illegible world-model (or just plain “model” in the GPT-3 case, if you prefer). Likewise, the human brain has a learning algorithm that builds a world-model. The learning algorithm is (I happen to think) a compact easily-human-legible algorithm involving pattern recognition and gradient descent and so on. But the world-model built by that learning algorithm is super huge and complicated. Sorry if I’m misunderstanding. I’ll try to walk through why I think “coming up with new concepts outside what humans have thought of” is required. We want an AGI to be able to do powerful things like independent alignment research and inventing technology. (Otherwise, it’s not really an AGI, or at least doesn’t help us solve the problem that people will make more dangerous AGIs in the future, I claim.) Both these things require finding new patterns that have not been previously noticed by humans. For example, think of the OP that you just wrote. You had some idea in your head—a certain visualization and associated bundle of thoughts and intuitions and analogies—and had to work hard to try to communicate that idea to other humans like me. Again, sorry if I’m misunderstanding.

Longtermist X-Risk Cases for working in Semiconductor Manufacturing

Two separate pitches for jobs/roles in semiconductor manufacturing for people who are primarily interested in x-risk reduction.

Securing Semiconductor Supply Chains

This is basically the "computer security for x-risk reduction" argument applied to semiconductor manufacturing.

Briefly restating: it seems exceedingly likely that technologies crucial to x-risks are on computers or connected to computers.  Improving computer security increases the likelihood that those machines are not stolen... (read more)

[-]A RayΩ370

Interpretability Challenges

Inspired by a friend I've been thinking about how to launch/run interpretability competitions, and what the costs/benefits would be.

I like this idea a lot because it cuts directly at one of the hard problems of spinning up in interpretability research as a new person.  The field is difficult and the objectives are vaguely defined; it's easy to accidentally trick yourself into seeing signal in noise, and there's never certainty that the thing you're looking for is actually there.

On the other hand, most of the interpretability... (read more)

Thinking more about the singleton risk / global stable totalitarian government risk from Bostrom's Superintelligence, human factors, and theory of the firm.

Human factors represent human capacities or limits that are unlikely to change in the short term.  For example, the number of people one can "know" (for some definition of that term), limits to long-term and working memory, etc.

Theory of the firm tries to answer "why are economies markets but businesses autocracies" and related questions.  I'm interested in the subquestion of "what factors giv... (read more)

4mako yass
Did Bostrom ever call it singleton risk? My understanding is that it's not clear that a singleton is more of an x-risk than its negative; a liberal multipolar situation under which many kinds of defecting/carcony factions can continuously arise.
1A Ray
I don't know if he used that phrasing, but he's definitely talked about the risks (and advantages) posed by singletons.
[-]A RayΩ250

Some disorganized thoughts about adversarial ML:

  • I think I'm a little bit sad about the times we got whole rooms full of research posters about variations on epsilon-ball adversarial attacks & training, basically all of them claiming how this would help AI safety or AI alignment or AI robustness or AI generalization and basically all of them were basically wrong.
  • This has lead me to be pretty critical of claims about adversarial training as pathways to aligning AGI.
  • Ignoring the history of adversarial training research, I think I still have problems with
... (read more)

Book Aesthetics

I seem to learn a bunch about my aesthetics of books by wandering a used book store for hours.

Some books I want in hardcover but not softcover.  Some books I want in softcover but not hardcover.  Most books I want to be small.

I prefer older books to newer books, but I am particular about translations.  Older books written in english (and not translated) are gems.

I have a small preference for books that are familiar to me, a nontrivial part of them were because they were excerpts taught in english class.

I don't really know what... (read more)

Future City Idea: an interface for safe AI-control of traffic lights

We want a traffic light that
* Can function autonomously if there is no network connection
* Meets some minimum timing guidelines (for example, green in a particular direction no less than 15 seconds and no more than 30 seconds, etc)
* Secure interface to communicate with city-central control
* Has sensors that allow some feedback for measuring traffic efficiency or throughput

This gives constraints, and I bet an AI system could be trained to optimize efficiency or throughput within the constra... (read more)

I expect that the functioning of traffic lights is regulated in a way that makes it hard for a startup to deploy such a system.

Comparing AI Safety-Capabilities Dilemmas to Jervis' Cooperation Under the Security Dilemma

I've been skimming some things about the Security Dilemma (specifically Offense-Defense Theory) while looking for analogies for strategic dilemmas in the AI landscape.

I want to describe a simple comparison here, lightly held (and only lightly studied)

  • "AI Capabilities" -- roughly, the ability to use AI systems to take (strategically) powerful actions -- as "Offense"
  • "AI Safety" -- roughly, that AI systems under control and use do not present a catastrophic/existential
... (read more)
[-]A RayΩ340

Copying some brief thoughts on what I think about working on automated theorem proving relating to working on aligned AGI:

  • I think a pure-mathematical theorem prover is more likely to be beneficial and less likely to be catastrophic than STEM-AI / PASTA
  • I think it's correspondingly going to be less useful
  • I'm optimistic that it could be used to upgrade formal software verification and cryptographic algorithm verification
  • With this, i think you can tell a story about how development in better formal theorem provers can help make information security a "defense
... (read more)
2Ramana Kumar
In my understanding there's a missing step between upgraded verification (of software, algorithms, designs) and a "defence wins" world: what the specifications for these proofs need to be isn't a purely mathematical thing. The missing step is how to figure out what the specs should say. Better theorem proving isn't going to help much with the hard parts of that.
1A Ray
I think that's right that upgraded verification by itself is insufficient for 'defense wins' worlds.  I guess I'd thought that was apparent but you're right it's definitely worth saying explicitly. A big wish of mine is that we end up doing more planning/thinking-things-through for how researchers working on AI today could contribute to 'defense wins' progress. My implicit other take here that wasn't said out loud is that I don't really know of other pathways where good theorem proving translates to better AI x-risk outcomes.  I'd be eager to know of these.
1[comment deleted]

The ELK paper is long but I’ve found it worthwhile, and after spending a bit of time noodling on it — one of my takeaways is I think this is essentially a failure mode for the approaches to factored cognition I've been interested in.  (Maybe it's a failure mode in factored cognition generally.

I expect that I’ll want to spend more time thinking about ELK-like problems before spending a bunch more time thinking about factored cognition.

In particular it's now probably a good time to start separating a bunch of things I had jumbled together, namely:

  • Develo
... (read more)

100 Year Bunkers

I often hear that building bio-proof bunkers would be good for bio-x-risk, but it seems like not a lot of progress is being made on these.

It's worth mentioning a bunch of things I think probably make it hard for me to think about:

  • It seems that even if I design and build them, I might not be the right pick for an occupant, and thus wouldn't directly benefit in the event of a bio-catastrophe
  • In the event of a bio-catastrophe, it's probably the case that you don't want anyone from the outside coming in, so probably you need people already livin
... (read more)
What's your threat scenario where you would believe a bio-bunker to be helpful?
1A Ray
I'm roughly thinking of this sort of thing: https://forum.effectivealtruism.org/posts/fTDhRL3pLY4PNee67/improving-disaster-shelters-to-increase-the-chances-of
What about using remote islands as bio-bunkers? Some of them are not reachable by aviation (no airfield), so seems to be better protected. But they have science stations already populated. Example is Kerguelen islands. The main risk here is bird flu delivered by birds or some stray ship.
1A Ray
Remote islands are probably harder to access via aviation, but probably less geologically stable (I'd worry about things like weathering, etc).  Additionally this is probably going to dramatically increase costs to build. It's probably worth considering "aboveground bunker in remote location" (e.g. islands, also antarctica) -- so throw it into the hat with the other considerations. My guess is that the cheaper costs to move building supplies and construction equipment will favor "middle of nowhere in an otherwise developed country". I don't have fully explored models also for how much a 100 yr bunker needs to be hidden/defensible.  This seems worth thinking about. If I ended up wanting to build one of these on some cheap land somewhere with friends, above-ground might be the way to go. (The idea in that case would be to have folks we trust take turns staying in it for ~1month or so at a time, which honestly sounds pretty great to me right now.  Spending a month just reading and thinking and disconnected while having an excuse to be away sounds rad)
You probably don't need 100 years bunker if you prepare only for biocatastrophe, as most pandemics has shorter timing, except AIDS. Also, it is better not to build anything, but use already existing structures. E.g. there are coal mines in Spitzbergen which could be used for underground storages. 
3A Ray
That seems worth considering!
1[comment deleted]

Philosophical progress I wish would happen:

Starting from the Callard version of Aspiration (how should we reason/act about things that change our values).

Extend it to generalize to all kinds of values shifts (not just the ones desired by the agent).

Deal with the case of adversaries (other agents in your environment want to change your values)

Figure out a game theory (what does it mean to optimally act in an environment where me & others are changing my values / how can I optimally act)

Figure out what this means for corrigibility (e.g. is corrigibility ... (read more)

[-]A RayΩ230

Hacking the Transformer Prior

Neural Network Priors

I spend a bunch of time thinking about the alignment of the neural network prior for various architectures of neural networks that we expect to see in the future.

Whatever alignment failures are highly likely under the neural network prior are probably worth a lot of research attention.

Separately, it would be good to figure out knobs/levers for changing the prior distribution to be more aligned (or produce more aligned models).  This includes producing more interpretable models.

Analogy to Software Devel... (read more)

I'm pretty sure you mean functions that perform tasks, like you would put in /utils, but I note that on LW "utility function" often refers to the decision theory concept, and "what decision theoretical utility functions are present in the neural network prior" also seems like an interesting (tho less useful) question.

There recently was a COVID* outbreak at an AI community space.

>20 people tested positive on nucleic tests, but none of the (only five) people that took PCR tests came back positive.

Thinking out loud about possibilities here:

  1. The manufacturer of the test used a nucleic acid sequence that somehow cross-targets another common sequence we'd find in upper respiratory systems (with the most likely candidate here being a different, non-COVID, upper respiratory virus).  I think this is extremely unlikely mistake for a veteran test manufacturer to make, but
... (read more)
1A Ray
I should have probably separated out 4 into two categories: * The virus was not in the person but was on the sample (somehow contaminated by e.g. the room w/ the tests) * The virus was in the person and was on the sample Oh well, it was on my shortform because it was low effort.

I engage too much w/ generalizations about AI alignment researchers.

Noticing this behavior seems useful for analyzing it and strategizing around it.

A sketch of a pattern to be on the lookout for in particular is "AI Alignment researchers make mistake X" or "AI Alignment researchers are wrong about Y".  I think in the extreme I'm pretty activated/triggered by this, and this causes me to engage with it to a greater extent than I would have otherwise.

This engagement is probably encouraging more of this to happen, so I think more of a pause and reflection... (read more)

I’ve been thinking more about Andy Jones’ writeup on the need for engineering.

In particular, my inside view is that engineering isn’t that difficult to learn (compared to research).

In particular I think the gap between being good at math/coding is small to being good at engineering.  I agree that one of the problems here is the gap is a huge part tacit knowledge.

I’m curious about what short/cheap experiments could be run in/around lightcone to try to refute this — or at the very least support the “it’s possible to quickly/densely transfer engineering ... (read more)

AGI technical domains

When I think about trying to forecast technology for the medium term future, especially for AI/AGI progress, it often crosses a bunch of technical boundaries.

These boundaries are interesting in part because they're thresholds where my expertise and insight falls off significantly.

Also interesting because they give me topics to read about and learn.

A list which is probably neither comprehensive, nor complete, nor all that useful, but just writing what's in my head:

  • Machine learning research - this is where a lot of the tip-of-the-spear o
... (read more)

"Bet Your Beliefs" as an epistemic mode-switch

I was just watching this infamous interview w/ Patrick Moore where he seems to be doing some sort of epistemic mode switch (the "weed killer" interview)[0]

Moore appears to go from "it's safe to drink a cup of glyphosate" to (being offered the chance to do that) "of course not / I'm not stupid".

This switching between what seems to be a tribal-flavored belief (glyphosate is safe) and a self-protecting belief (glyphosate is dangerous) is what I'd like to call an epistemic mode-switch.  In particular, it's a c... (read more)

A failure mode for "betting your beliefs" is developing an urge to reframe your hypotheses as beliefs, which harms the distinction. It's not always easy/possible/useful to check hypotheses for relevance to reality, at least until much later in their development, so it's important to protect them from being burdened with this inconvenience. It's only when a hypothesis is ready for testing (which is often immediately), or wants to be promoted to a belief (probably as an element of an ensemble), that making predictions becomes appropriate.
3A Ray
Oh yeah like +100% this. Creating an environment where we can all cultivate our weird hunches and proto-beliefs while sharing information and experience would be amazing. I think things like "Scout Mindset" and high baselines of psychological safety (and maybe some of the other phenomenological stuff) help as well. If we have the option to create these environments instead, I think we should take that option. If we don't have that option (and the environment is a really bad epistemic baseline) -- I think the "bet your beliefs" does good.
There seem to be two different concepts being conflated here. One is "it will be extremely unlikely to cause permanent injury", while the other is "it will be extremely unlikely to have any unpleasant effects whatsoever". I have quite a few personal experiences with things that are the first but absolutely not the second, and would fairly strenuously avoid going through them again without extremely good reasons. I'm sure you can think of quite a few yourself.
[-]A RayΩ220

I with more of the language alignment research folks were looking into how current proposals for aligning transformers end up working on S4 models.

(I am one of said folks so maybe hypocritical to not work on it)

In particular it seems like there's way in which it would be more interpretable than transformers:

  • adjustable timescale stepping (either sub-stepping, or super-stepping time)
  • approximately separable state spaces/dynamics -- this one is crazy conjecture -- it seems like it should be possible to force the state space and dynamics into separate groups, i
... (read more)

The Positive and the Negative

I work on AI alignment, in order to solve problems of X-Risk.  This is a very "negative" kind of objective.

Negatives are weird.  Don't do X, don't be Y, don't cause Z.  They're nebulous and sometimes hard to point at and move towards.

I hear a lot of a bunch of doom-y things these days.  From the evangelicals, that this is the end times / end of days.  From environmentalists that we are in a climate catastrophe.  From politicians that we're in a culture war / edging towards a civil war.  From t... (read more)

More Ideas or More Consensus?

I think one aspect you can examine about a scientific field is it's "spread"-ness of ideas and resources.

High energy particle physics is an interesting extrema here -- there's broad agreement in the field about building higher energy accelerators, and this means there can be lots of consensus about supporting a shared collaborative high energy accelerator.

I think a feature of mature scientific fields that "more consensus" can unlock more progress.  Perhaps if there had been more consensus, the otherwise ill-fated supercond... (read more)

[-]A RayΩ110

Decomposing Negotiating Value Alignment between multiple agents

Let's say we want two agents to come to agreement on living with each other.  This seems pretty complex to specify; they agree to take each other's values into account (somewhat), not destroy each other (with some level of confidence), etc.

Neither initially has total dominance over the other.  (This implies that neither is corrigible to the other)

A good first step for these agents is to share each's values with the other.  While this could be intractably complex -- it's probably ... (read more)

I think there are LOTS of examples of organisms who cooperate or cohabitate without any level of ontology or conscious valuation.  Even in humans, a whole lot of the negotiation is not legible.  The spoken/written part is mostly signaling and lies, with a small amount of codifying behavioral expectations at a very coarse grain.
This is a strong assertion that I do not believe is justified. If you are an agent with this view, then I can take advantage by sending you an altered version of my values such that the altered version's Nash equilibrium (or plural) are all in my favor compared to the Nash equilibria of the original game. (You can mitigate this to an extent by requiring that both parties precommit to their values... in which case I predict what your values will be and use this instead, committing to a version of my values altered according to said prediction. Not perfect, but still arguably better.) (Of course, this has other issues if the other agent is also doing this...)
[-]A RayΩ110

Thinking more about ELK.  Work in progress, so I expect I will eventually figure out what's up with this.

Right now it seems to me that Safety via Debate would elicit compact/non-obfuscated knowledge.

So the basic scenario is that in addition to SmartVault, you'd have Barrister_Approve and Barrister_Disapprove, who are trying to share evidence/reasoning which makes the human approve or disapprove of SmartVault scenarios.

The biggest weakness of this that I know of is Obfuscated Arguments -- that is, it won't elicit obfuscated knowledge.

It seems like in t... (read more)

2Mark Xu
I think we would be trying to elicit obfuscated knowledge in ELK. In our examples, you can imagine that the predictor's Bayes net works "just because", so an argument that is convincing to a human for why the diamond in the room has to be arguing that the Bayes net is a good explanation of reality + arguing that it implies the diamond is in the room, which is the sort of "obfuscated" knowledge that debate can't really handle.
1A Ray
Okay now I have to admit I am confused. Re-reading the ELK proposal -- it seems like the latent knowledge you want to elicit is not-obfuscated. Like, the situation to solve is that there is a piece of non-obfuscated information, which, if the human knew it, would change their mind about approval. How do you expect solutions to elicit latent obfuscated knowledge (like 'the only true explanation is incomprehendible by the human' situations)?
2Mark Xu
I don’t think I understand your distinction between obfuscated and non-obfuscated knowledge. I generally think of non-obfuscated knowledge as NP or PSPACE. The human judgement of a situation might only theoretically require a poly sized fragment of a exp sized computation, but there’s no poly sized proof that this poly sized fragment is the correct fragment, and there are different poly sized fragments for which the human will evaluate differently, so I think of ELK as trying to elicit obfuscated knowledge.
1A Ray
So if there are different poly fragments that the human would evaluate differently, is ELK just "giving them a fragment such that they come to the correct conclusion" even if the fragment might not be the right piece. E.g. in the SmartVault case, if the screen was put in the way of the camera and the diamond was secretly stolen, we would still be successful even if we didn't elicit that fact, but instead elicited some poly fragment that got the human to answer disapprove? Like the thing that seems weird to me here is that you can't simultaneously require that the elicited knowledge be 'relevant' and 'comprehensible' and also cover these sorts of obfuscated debate like scenarios. Does it seem right to you that ELK is about eliciting latent knowledge that causes an update in the correct direction, regardless of whether that knowledge is actually relevant?
2Mark Xu
I feel mostly confused by the way that things are being framed. ELK is about the human asking for various poly-sized fragments and the model reporting what those actually were instead of inventing something else. The model should accurately report all poly-sized fragments the human knows how to ask for. I don't know what you mean by "relevant" or "comprehensible" here. This doesn't seem right to me.
1A Ray
Thanks for taking the time to explain this! I think this is what I was missing.  I was incorrectly thinking of the system as generating poly-sized fragments.
1A Ray
Cool, this makes sense to me. My research agenda is basically about making a not-obfuscated model, so maybe I should just write that up as an ELK proposal then.

Some thoughts on Gradient Hacking:

One, I'm not certain the entire phenomena of an agent meta-modifying it's objective or otherwise influencing its own learning trajectory is bad.  When I think about what this is like on the inside, I have a bunch of examples where I do this.  Almost all of them are in a category called "Aspirational Rationality", which is a sub topic of Rationality (the philosophy, not the LessWrong): https://oxford.universitypressscholarship.com/view/10.1093/oso/9780190639488.001.0001/oso-9780190639488

(I really wish we explored ... (read more)

[+][comment deleted]10