All of davidad's Comments + Replies

Yes. You will find more details in his paper, Provably safe systems with Steve Omohundro, in which I am listed in the acknowledgments (under my legal name, David Dalrymple).

Max and I also met and discussed the similarities in advance of the AI Safety Summit in Bletchley.

I agree that each of and has two algebraically equivalent interpretations, as you say, where one is about inconsistency and the other is about inferiority for the adversary. (I hadn’t noticed that).

The variant still seems somewhat irregular to me; even though Diffractor does use it in Infra-Miscellanea Section 2, I wouldn’t select it as “the” infrabayesian monad. I’m also confused about which one you’re calling unbounded. It seems to me like the variant is bounded (on both sides) whereas the variant is bounded on one side, and... (read more)

These are very good questions. First, two general clarifications:

A. «Boundaries» are not partitions of physical space; they are partitions of a causal graphical model that is an abstraction over the concrete physical world-model.

B. To "pierce" a «boundary» is to counterfactually (with respect to the concrete physical world-model) cause the abstract model that represents the boundary to increase in prediction error (relative to the best augmented abstraction that uses the same state-space factorization but permits arbitrary causal dependencies crossing the ... (read more)

Reply5422111
1Chipmonk1mo
Here's a tricky example I've been thinking about: Is a cell getting infected by a virus a boundary violation? What I think makes this tricky is that viruses generally don't physically penetrate cell membranes. Instead, cells just "let in" some viruses (albeit against their better judgement).  ---------------------------------------- Then once you answer the above, please also consider: Is a cell taking in nutrients from its environment a boundary violation? I don't know what makes this different from the virus example (at least as long as we're not allowed to refer to preferences).
1Chipmonk2mo
I want to give a big +1 on preventing membrane piercing not just by having AIs respect membranes, but also by using technology to empower membranes to be stronger and better at self-defense.
1Chipmonk2mo
Thanks for writing this! I largely agree (and the rest I need to think more about)
2the gears to ascension2mo
Unfortunately this is probably not on the table, as they are currently being used as weapons in economic warfare between the USA, China, and everyone else. tiktok primarily educational inside china. Advertisers have direct incentive to violate. We need a way to use <<membranes>> that will, on the margin, help protect against anyone violating them, not just avoid doing so itself.
3the gears to ascension2mo
You're sure this is the case even if the disease is about to violate the <<boundary>> and the cure will prevent that?

For the record, as this post mostly consists of quotes from me, I can hardly fail to endorse it.

Kosoy's infrabayesian monad  is given by 

There are a few different varieties of infrabayesian belief-state, but I currently favour the one which is called "homogeneous ultracontributions", which is "non-empty topologically-closed ⊥–closed convex sets of subdistributions", thus almost exactly the same as Mio-Sarkis-Vignudelli's "non-empty finitely-generated ⊥–closed convex sets of subdistributions monad" (Definition 36 of this paper), with the difference being essentially that it's presentable, but it's much more like ... (read more)

2Cleo Nardo2mo
For the sake of potential readers, a (full) distribution over X is some γ:X→[0,1] with finite support and ∑x∈Xγ(x)=1, whereas a subdistribution over X is some γ:X→[0,1] with finite support and ∑x∈Xγ(x)≤1. Note that a subdistribution γ over X is equivalent to a full distribution over X+1, where X+1 is the disjoint union of X with some additional element, so the subdistribution monad can be written Δ(−+1). Doesn't the Nirvana Trick basically say that these two interpretations are equivalent? Let (−+2) be X↦X+{0,1} and let (−+1) be X↦X+{0}. We can interpret ∨ as possibility, 0 as a hypothesis consistent with no observations, and 1 as a hypothesis consistent with all observations. Alternatively, we can interpret ∨ as the free choice made by an adversary, 0 as "the game terminates and our agent receives minimal disutility", and 1 as "the game terminates and our agent receives maximal disutility". These two interpretations are algebraically equivalent, i.e. (∨,0,1) is a topped and bottomed semilattice. Unless I'm mistaken, both P+f∘Δ∘(−+2) and P+f∘Δ∘(−+1) demand that the agent may have the hypothesis "I am certain that I will receive minimal disutility", which is necessary for the Nirvana Trick. But P+f∘Δ∘(−+2) also demands that the agent may have the hypothesis "I am certain that I will receive maximal disutility". The first gives bounded infrabayesian monad and the second gives unbounded infrabayesian monad. Note that Diffractor uses P+f∘Δ∘(−+2) in Infra-Miscellanea Section 2.

Does this article have any practical significance, or is it all just abstract nonsense? How does this help us solve the Big Problem? To be perfectly frank, I have no idea. Timelines are probably too short agent foundations, and this article is maybe agent foundations foundations...

I do think this is highly practically relevant, not least of which because using an infrabayesian monad instead of the distribution monad can provide the necessary kind of epistemic conservatism for practical safety verification in complex cyber-physical systems like the biospher... (read more)

Meyer's

If this is David Jaz Myers, it should be "Myers' thesis", here and elsewhere

I have said many times that uploads created by any process I know of so far would probably be unable to learn or form memories. (I think it didn't come up in this particular dialogue, but in the unanswered questions section Jacob mentions having heard me say it in the past.)

Eliezer has also said that makes it useless in terms of decreasing x-risk. I don't have a strong inside view on this question one way or the other. I do think if Factored Cognition is true then "that subset of thinking is enough," but I have a lot of uncertainty about whether Factored C... (read more)

davidad4moΩ112812

I think AI Safety Levels are a good idea, but evals-based classification needs to be complemented by compute thresholds to mitigate the risks of loss of control via deceptive alignment. Here is a non-nebulous proposal.

I like the idea of trying out H-JEPA with GFlowNet actors.

I also like the idea of using LLM-based virtue ethics as a regularizer, although I would still want deontic guardrails that seem good enough to avoid catastrophe.

That’s basically correct. OAA is more like a research agenda and a story about how one would put the research outputs together to build safe AI, than an engineering agenda that humanity entirely knows how to build. Even I think it’s only about 30% likely to work in time.

I would love it if humanity had a plan that was more likely to be feasible, and in my opinion that’s still an open problem!

2niplav6mo
Thanks for the clarification!

OAA bypasses the accident version of this by only accepting arguments from a superintelligence that have the form “here is why my proposed top-level plan—in the form of a much smaller policy network—is a controller that, when combined with the cyberphysical model of an Earth-like situation, satisfies your pLTL spec.” There is nothing normative in such an argument; the normative arguments all take place before/while drafting the spec, which should be done with AI assistants that are not smarter-than-human (CoEm style).

There is still a misuse version: someon... (read more)

5Wei Dai6mo
Ok I'll quote 5.1.4-5 to make it easier for others to follow this discussion: I'm not sure how these are intended to work. How do you intend to define/implement "divergence"? How does that definition/implementation combined with "high degree of Knightian uncertainty about human decisions and behaviour" actually cause the AI to "not interfere" but also still accomplish the goals that we give it? In order to accomplish its goals, the AI has to do lots of things that will have butterfly effects on the future, so the system has to allow it to do those things, but also not allow it to "propagandize to humans". It's just unclear to me how you intend to achieve this.

It is often considered as such, but my concern is less with “the alignment question” (how to build AI that values whatever its stakeholders value) and more with how to build transformative AI that probably does not lead to catastrophe. Misuse is one of the ways that it can lead to catastrophe. In fact, in practice, we have to sort misuse out sooner than accidents, because catastrophic misuses become viable at a lower tech level than catastrophic accidents.

That being said— I don’t expect existing model-checking methods to scale well. I think we will need to incorporate powerful AI heuristics into the search for a proof certificate, which may include various types of argument steps not limited to a monolithic coarse-graining (as mentioned in my footnote 2). And I do think that relies on having a good meta-ontology or compositional world-modeling framework. And I do think that is the hard part, actually! At least, it is the part I endorse focusing on first. If others follow your train of thought to narrow in o... (read more)

3Daniel Murfet6mo
Thanks, that makes a lot of sense to me. I have some technical questions about the post with Owen Lynch, but I'll follow up elsewhere.

I think you’re directionally correct; I agree about the following:

  • A critical part of formally verifying real-world systems involves coarse-graining uncountable state spaces into (sums of subsets of products of) finite state spaces.
  • I imagine these would be mostly if not entirely learned.
  • There is a tradeoff between computing time and bound tightness.

However, I think maybe my critical disagreement is that I do think probabilistic bounds can be guaranteed sound, with respect to an uncountable model, in finite time. (They just might not be tight enough to... (read more)

5davidad6mo
That being said— I don’t expect existing model-checking methods to scale well. I think we will need to incorporate powerful AI heuristics into the search for a proof certificate, which may include various types of argument steps not limited to a monolithic coarse-graining (as mentioned in my footnote 2). And I do think that relies on having a good meta-ontology or compositional world-modeling framework. And I do think that is the hard part, actually! At least, it is the part I endorse focusing on first. If others follow your train of thought to narrow in on the conclusion that the compositional world-modeling framework problem, as Owen Lynch and I have laid it out in this post, is potentially “the hard part” of AI safety, that would be wonderful…

Yes, the “shutdown timer” mechanism is part of the policy-scoring function that is used during policy optimization. OAA has multiple stages that could be considered “training”, and policy optimization is the one that is closest to the end, so I wouldn’t call it “the training stage”, but it certainly isn’t the deployment stage.

We hope not merely that the policy only cares about the short term, but also that it cares quite a lot about gracefully shutting itself down on time.

There’s something to be said for this, because with enough RLHF, GPT-4 does seem to have become pretty corrigible, especially compared to Bing Sydney. However, that corrigible persona is probably only superficial, and the larger and more capable a single Transformer gets, the more of its mesa-optimization power we can expect will be devoted to objectives which are uninfluenced by in-context corrections.

1MiguelDev6mo
Sorry for not specifying the method, but I wasn't referring to RL-based or supervised learning methods. There's a lot of promise in using a smaller dataset that explains corrigibility characteristics, as well as a shutdown mechanism, all fine-tuned through unsupervised learning. I have a prototype at this link where I modified GPT2-XL to mention a shutdown phrase whenever all of its attention mechanisms activate and determine that it could harm humans due to its intelligence. I used unsupervised learning to allow patterns from a smaller dataset to achieve this.

A system with a shutdown timer, in my sense, has no terms in its reward function which depend on what happens after the timer expires. (This is discussed in more detail in my previous post.) So there is no reason to persuade humans or do anything else to circumvent the timer, unless there is an inner alignment failure (maybe that’s what you mean by “deception instance”). Indeed, it is the formal verification that prevents inner alignment failures.

Suppose Training Run Z is a finetune of Model Y, and Model Y was the output of Training Run Y, which was already a finetune of Foundation Model X produced by Training Run X (all of which happened after September 2021). This is saying that not only Training Run Y (i.e. the compute used to produce one of the inputs to Training Run Z), but also Training Run X (a “recursive” or “transitive” dependency), count additively against the size limit for Training Run Z.

Less difficult than ambitious mechanistic interpretability, though, because that requires human comprehension of mechanisms, which is even more difficult.

The formal desiderata should be understood, reviewed, discussed, and signed-off on by multiple humans. However, I don't have a strong view against the use of Copilot-style AI assistants. These will certainly be extremely useful in the world-modeling phase, and I suspect will probably also be worth using in the specification phase. I do have a strong view that we should have automated red-teamers try to find holes in the desiderata.

I think formal verification belongs in the "requires knowing what failure looks like" category.

For example, in the VNN competition last year, some adversarial robustness properties were formally proven about VGG16. This requires white-box access to the weights, to be sure, but I don't think it requires understanding "how failure happens".

1scasper7mo
Sounds right, but the problem seems to be semantic. If understanding is taken to mean a human's comprehension, then I think this is perfectly right. But since the method is mechanistic, it seems difficult nonetheless. 

Yes—assuming that the pause interrupts any anticipatory gradient flows from the continuing agent back to the agent which is considering whether to pause.

This pattern is instantiated in the Open Agency Architecture twice:

  1. Step 2 generates top-level agents which are time-bounded at a moderate timescale (~days), with the deliberation about whether to redeploy a top-level agent being carried out by human operators.
  2. In Step 4, the top-level agent dispatches most tasks by deploying narrower low-level agents with much tighter time bounds, with the deliberation a
... (read more)

For what it's worth, the phrase "night watchman" as I use it is certainly downstream of Nozick's concept.

Some direct quantitative comparison between activation-steering and task-vector-steering (at, say, reducing toxicity) is indeed a very sensible experiment for a peer reviewer to ask for and I would like to see it as well.

davidad9moΩ71519

On the object-level, deriving task vectors in weight-space from deltas in fine-tuned checkpoints is really different from what was done here, because it requires doing a lot of backward passes on a lot of data. Deriving task vectors in activation-space, as done in this new work, requires only a single forward pass on a truly tiny amount of data. So the data-efficiency and compute-efficiency of the steering power gained with this new method is orders of magnitude better, in my view.

Also, taking affine combinations in weight-space is not novel to Schmidt et ... (read more)

Dan H9moΩ61313

It's a good observation that it's more efficient; does it trade off performance? (These sorts of comparisons would probably be demanded if it was submitted to any other truth-seeking ML venue, and I apologize for consistently being the person applying the pressures that generic academics provide. It would be nice if authors would provide these comparisons.)

 

Also, taking affine combinations in weight-space is not novel to Schmidt et al either. If nothing else, the Stable Diffusion community has been doing that since October to add and subtract capabili

... (read more)

Thanks for bringing all of this together - I think this paints a fine picture of my current best hope for deontic sufficiency. If we can do better than that, great!

I agree that we should start by trying this with far simpler worlds than our own, and with futarchy-style decision-making schemes, where forecasters produce extremely stylized QURI-style models that map from action-space to outcome-space while a broader group of stakeholders defines mappings from output-space to each stakeholder’s utility.

Every distribution (that agrees with the base measure about null sets) is a Boltzmann distribution. Simply define , and presto, .

This is a very useful/important/underrated fact, but it does somewhat trivialize “Boltzmann” and “maximum entropy” as classes of distributions, rather than as certain ways of looking at distributions.

A related important fact is that temperature is not really a physical quantity, but is: it’s known as inverse temperature or . (The nonexistence of zero-temperature systems, the existence of negat... (read more)

4Alexander Gietelink Oldenziel9mo
I am a little confused about this. It was my understanding that exponential families are distinguished class of families of distributions. For instance, they are regular (rather than singular). The family of mixed Gaussians is not an exponential family I believe. So my conclusion would be that the while "being Boltzmann" for a distribution is trivial as you point out, "being Boltzmann" (= exponential) for a family is nontrivial.

Note, assuming the test/validation distribution is an empirical dataset (i.e. a finite mixture of Dirac deltas), and the original graph is deterministic, the of the pushforward distributions on the outputs of the computational graph will typically be infinite. In this context you would need to use a Wasserstein divergence, or to "thicken" the distributions by adding absolutely-continuous noise to the input and/or output.

Or maybe you meant in cases where the output is a softmax layer and interpreted as a probability distribution, in which case ... (read more)

1Lucius Bushnaq1y
Second paragraph is what I meant, thanks.

As an alternative summary statistic of the extent to which the ablated model performs worse on average, I would suggest the Bayesian Wilcoxon signed-rank test.

In computer science this distinction is often made between extensional (behavioral) and intensional (mechanistic) properties (example paper).

For the record, the canonical solution to the object-level problem here is Shapley Value. I don’t disagree with the meta-level point, though: a calculation of Shapley Value must begin with a causal model that can predict outcomes with any subset of contributors removed.

5lalaithion1y
I walked through some examples of Shapley Value here, and I'm not so sure it satisfies exactly what we want on an object level. I don't have a great realistic example here, but Shapley Value assigns counterfactual value to individuals who did in fact not contribute at all, if they would have contributed were your higher-performers not present. So you can easily have "dead weight" on a team which has a high Shapley Value, as long as they could provide value if their better teammates were gone.
2adamShimi1y
Thanks for the pointer!

I think there’s something a little bit deeply confused about the core idea of “internal representation” and that it’s also not that hard to fix.

  1. I think it’s important that our safety concepts around trained AI models/policies respect extensional equivalence, because safety or unsafety supervenes on their behaviour as opaque mathematical functions (except for very niche threat models where external adversaries are corrupting the weights or activations directly). If two models have the same input/output mapping, and only one of them has “internally repres

... (read more)
davidad1yΩ4104

Not listed among your potential targets is “end the acute risk period” or more specifically “defend the boundaries of existing sentient beings,” which is my current favourite. It’s nowhere near as ambitious or idiosyncratic as “human values”, yet nowhere near as anti-natural or buck-passing as corrigibility.

In my plan, interpretable world-modeling is a key component of Step 1, but my idea there is to build (possibly just by fine-tuning, but still) a bunch of AI modules specifically for the task of assisting in the construction of interpretable world models. In step 2 we’d throw those AI modules away and construct a completely new AI policy which has no knowledge of the world except via that human-understood world model (no direct access to data, just simulations). This is pretty well covered by your routes numbered 2 and 3 in section 1A, but I worry those poi... (read more)

1wassname4mo
Ah, now it makes sense. I was wondering how world model interpretability leads to alignment rather than control. After all, I don't think you will get far controlling something smarter than you against its will. But alignment of value could scale with large gaps in intelligence. When that 2nd phase, there are a few things you can do. E.g the 2nd phase reward function could include world model concepts like "virtue", or you could modify the world model before training.
davidad1yΩ5117

From the perspective of Reframing Inner Alignment, both scenarios are ambiguous because it's not clear whether

  • you really had a policy-scoring function that was well-defined by the expected value over the cognitive processes that humans use to evaluate pull requests under normal circumstances, but then imperfectly evaluated it by failing to sample outside normal circumstances, or
  • your policy-scoring "function" was actually stochastic and "defined" by the physical process of humans interacting with the AI's actions and clicking Merge buttons, and this incorre
... (read more)
2Rohin Shah1y
Yup, this is the objective-based categorization, and as you've noted it's ambiguous on the scenarios I mention because it depends on how you choose the "definition" of the design objective (aka policy-scoring function).

I think subnormals/denormals are quite well motivated; I’d expect at least 10% of alien computers to have them.

Quiet NaN payloads are another matter, and we should filter those out. These are often lumped in with nondeterminism issues—precisely because their behavior varies between platform vendors.

3Donald Hobson1y
I think each little decision is throwing another few bits of info. A few bits for deciding how big the mantisa and exponent should be. A few bits for it being a 64 bit float. A few bits for subnormals. A few bits for inf and Nan.  A few bits for rounding errors. A bit for -0. And it all adds up. Not that we know how many bits the AI needs. If there is one standard computer architecture that all aliens use, then the AI can hack with very little info. If all alien computers have wildly different architectures, then floats carry a fair bit of info. 

I think binary floating-point representations are very natural throughout the multiverse. Binary and ternary are the most natural ways to represent information in general, and floating-point is an obvious way to extend the range (or, more abstractly, the laws of probability alone suggest that logarithms are more interesting than absolute figures when extremely close or far from zero).

If we were still using 10-digit decimal words like the original ENIAC and other early computers, I'd be slightly more concerned. The fact that all human computer makers transitioned to power-of-2 binary words instead is some evidence for the latter being convergently natural rather than idiosyncratic to our world.

4Donald Hobson1y
Sure, binary is fairly natural, but there are a lot of details of IEEE floats that aren't. https://en.wikipedia.org/wiki/Subnormal_number
1TAG1y
Binary might be a attractor, but there's a lot of ways of implementing floating point in binary.
3TekhneMakre1y
He's saying that since floating point arithmetic isn't necessarily associative, you can tell something about how some abstract function like the sum of a list is actually implemented / computed; and that partial info points at some architectures more than others. 

The informal processes humans use to evaluate outcomes are buggy and inconsistent (across humans, within humans, across different scenarios that should be equivalent, etc.). (Let alone asking humans to evaluate plans!) The proposal here is not to aim for coherent extrapolated volition, but rather to identify a formal property (presumably a conjunct of many other properties, etc.) such that conservatively implies that some of the most important bad things are limited and that there’s some baseline minimum of good things (e.g. everyone has access to reso... (read more)

Shouldn't we plan to build trust in AIs in ways that don't require humans to do things like vet all changes to its world-model?

Yes, I agree that we should plan toward a way to trust AIs as something more like virtuous moral agents rather than as safety-critical systems. I would prefer that. But I am afraid those plans will not reach success before AGI gets built anyway, unless we have a concurrent plan to build an anti-AGI defensive TAI that requires less deep insight into normative alignment.

In response to your linked post, I do have similar intuitions about “Microscope AI” as it is typically conceived (i.e. to examine the AI for problems using mechanistic interpretability tools before deploying it). Here I propose two things that are a little bit like Microscope AI but in my view both avoid the core problem you’re pointing at (i.e. a useful neural network will always be larger than your understanding of it, and that matters):

  1. Model-checking policies for formal properties. A model-checker (unlike a human interpreter) works with the entire net
... (read more)

Strong upvoted.

Bird’s eye perspective: All information theory is just KL-divergence and priors, all priors are just Gibbs measures, and algorithmic information theory is just about how computational costs should be counted as “energy” in the Gibbs measure (description length vs time vs memory, etc).

Frog’s eye perspective: taking the pushforward measure along the semantics of the language collects all the probability mass of each ’s entire extensional-equivalence-class; no equally natural operation collects only the mass from the single maximum-probabili... (read more)

I’d say the scientific understanding happens in step 1, but I think that would be mostly consolidating science that’s already understood. (And some patching up potentially exploitable holes where AI can deduce that “if this is the best theory, the real dynamics must actually be like that instead”. But my intuition is that there aren’t many of these holes, and that unknown physics questions are mostly underdetermined by known data, at least for quite a long way toward the infinite-compute limit of Solomonoff induction, and possibly all the way.)

Engineering... (read more)

For most s I agree, but the existence of any adversarial examples for would be an outer alignment problem (you get what you measure). (For outer alignment, it seems necessary that there exist—and that humans discover—natural abstractions relative to formal world models that robustly pick out at least the worst stuff.)

I find Eric Drexler's arguments convincing about how it seems possible to make collusion very unlikely. On the other hand, I do think it requires nontrivial design and large ensembles; in the case of an unconstrained 2-player game (like Safety via Debate), I side with Eliezer that the probability of collusion probably converges toward 1 as capabilities get more superintelligent.

Another key principle that I make use of is algorithms (such as branch-and-bound and SMT solvers) whose performance—but not their correctness—depends on extremely clever heuristics.... (read more)

7davidad1y
I find Eric Drexler's arguments convincing about how it seems possible to make collusion very unlikely. On the other hand, I do think it requires nontrivial design and large ensembles; in the case of an unconstrained 2-player game (like Safety via Debate), I side with Eliezer that the probability of collusion probably converges toward 1 as capabilities get more superintelligent. Another key principle that I make use of is algorithms (such as branch-and-bound and SMT solvers) whose performance—but not their correctness—depends on extremely clever heuristics. Accelerating the computation of more accurate and useful bounds seems to me like a pretty ineffectual causal channel for the AIs playing those heuristic roles to coordinate with each other or to seek real-world power.

Inner misalignment is a story for why one might expect capable but misaligned out-of-distribution behaviour, which is what's actually bad. Model-checking could rule that out entirely (relative to the formal specification)— whether it's "inner misalignment" or "goal misgeneralization" or "deceptive alignment" or "demons in Solmonoff induction" or whatever kind of story might explain such output. Formal verification is qualitatively different from the usual game of debugging whack-a-mole that software engineers play to get software to behave acceptably.

5davidad1y
Inner misalignment is a story for why one might expect capable but misaligned out-of-distribution behaviour, which is what's actually bad. Model-checking could rule that out entirely (relative to the formal specification)— whether it's "inner misalignment" or "goal misgeneralization" or "deceptive alignment" or "demons in Solmonoff induction" or whatever kind of story might explain such output. Formal verification is qualitatively different from the usual game of debugging whack-a-mole that software engineers play to get software to behave acceptably.
Load More