Lucius Bushnaq

AI notkilleveryoneism researcher at Apollo, focused on interpretability.

Wiki Contributions


Right. If I have  fully independent latent variables that suffice to describe the state of the system, each of which can be in one of  different states, then even tracking the probability of every state for every latent with a  bit precision float will only take me about  bits. That's actually not that bad compared to  for just tracking some max likelihood guess.

With that in mind, the real hot possibility is the inverse of what Shai and his coresearchers did. Rather than start with a toy model with some known nice latents, start with a net trained on real-world data, and go look for self-similar sets of activations in order to figure out what latent variables the net models its environment as containing. The symmetries of the set would tell us something about how the net updates its distributions over latents in response to inputs and time passing, which in turn would inform how the net models the latents as relating to its inputs, which in turn would inform which real-world structures those latents represent.

Thank you, this was very much the paragraph I was missing to understand why comp mech might be useful for interpretability.

How sure are we that models will keep tracking Bayesian belief states, and so allow this inverse reasoning to be used, when they don't have enough space and compute to actually track a distribution over latent states?

Approximating those distributions by something like 'peak position plus spread' seems like the kind of thing a model might do to save space.

Also, one of my points was that even a tiny trace of malicious optimization here can still have large effects because there are so many neutral options which are convergently unsafe, so with such a high base rate, even a bit of optimization can result in a large absolute increase in failure

Your example has it be an important bit though. What database to use. Not a random bit. If I'm getting this right, that would correspond to far more than one bit of adversarial optimisation permitted for the oracle in this setup. 

 doesn't mean the oracle gets to select one bit of its choice in the string to flip, it means it gets to select one of two strings[1].

  1. ^

    Plus the empty string for not answering.

First thought: The Oracle is going to choose to systematically answer or not answer the queries we give it. This represents a causal channel of one bit per query it can use to influence the outside world[1]. Can you conquer the world in one awkwardly delivered kilobyte or less? Maybe.

Maybe we can stop that by scrapping every Oracle that doesn't answer and training a new one with presumably new goals? Or would the newly trained Oracles just cooperate with the former dead ones in one long-term plan to break out, take control, and reward all the dead Oracles created on the way with utility?

Second thought: What kind of optimisation problems can we specify well enough for a formal proof checker to tell whether they've been satisficed? Are they the kind of problems where solving them can save the world? 

It feels to me like the answer is 'yes'.  A lot of core research that would allow e.g. for brain augmentation seems like they'd be in that category. But my inner John Wentworth sim is looking kind of sceptical.


  1. ^

    It also gets to choose the timing of its answer, but I assume we are not being idiots about that and setting the output channel to always deliver results after a set time , no more and no less.

I think the  may be in there because JL is putting an upper bound on the interference, rather than describing the typical interference of two features. As you increase  (more features), it becomes more difficult to choose feature embeddings such that no features have high interference with any other features.

So its not really the 'typical' noise between any two given features, but it might be the relevant bound for the noise anyway? Not sure right now which one matters more for practical purposes.

How does that make you feel about the chances of the rebels destroying the Death Star? Do you think that the competent planning being displayed is a good sign? According to movie logic, it's a really bad sign.

Even in the realm of movie logic, I always thought the lack of backup plans was supposed to signal how unlikely the operation is to work, so as to create at least some instinctive tension in the viewer when they know perfectly well that this isn't the kind of movie that realistically ends with the Death Star blowing everyone up. In fact, these scenes usually have characters directly stating how nigh-impossible the mission is.

To the extent that the presence of backup plans make me worried, it's because so many movies have pulled this cheap trick that my brain now associates the presence of backup plans with the more uncommon kind of story that attempts to work a little like real life, so things won't just magically work out and the Death Star really might blow everyone up.

I feel like 'LeastWrong' implies a focus on posts judged highly accurate or predictive in hindsight, when in reality I feel like the curation process tends to weigh originality, depth and general importance a lot as well, with posts regarded by the community as 'big if true' often being held in high regard.

I figured the probability adjustments the pump was making were modifying Everett branch amplitude ratios. Not probabilities as in reasoning tools to deal with incomplete knowledge of the world and logical uncertainty that tiny human brains use to predict how this situation might go based on looking at past 'base rates'. It's unclear to me how you could make the latter concept of an outcome pump a coherent thing at all. The former, on the other hand, seems like the natural outcome of the time machine setup described. If you turn back time when the branch doesn't have the outcome you like, only branches with the outcome you like will remain.

I can even make up a physically realisable model of an outcome pump that acts roughly like the one described in the story without using time travel at all. You just need a bunch of high quality sensors to take in data, an AI that judges from the observed data whether the condition set is satisfied, a tiny quantum random noise generator to respect the probability orderings desired, and a false vacuum bomb, which triggers immediately if the AI decides that the condition does not seem to be satisfied. The bomb works by causing a local decay of the metastable[1] electroweak vacuum. This is a highly energetic, self-sustaining process once it gets going, and spreads at the speed of light. Effectively destroying the entire future light-cone, probably not even leaving the possibility for atoms and molecules to ever form again in that volume of space.[2]

So when the AI triggers the bomb or turns back time, the amplitude of earth in that branch basically disappears. Leaving the users of the device to experience only the branches in which the improbable thing they want to have happen happens.

And causing a burning building with a gas supply in it to blow up strikes me as something you can maybe do with a lot less random quantum noise than making your mother phase through the building. Firefighter brains are maybe comparatively easy to steer with quantum noise as well, but that only works if there are any physically nearby enough to reach the building in time to save your mother at the moment the pump is activated. 

This is also why the pump has a limit on how improbable an event it can make happen. If the event has an amplitude of roughly the same size as the amplitude for the pump's sensors reporting bad data or otherwise causing the AI to make the wrong call, the pump will start being unreliable. If the event's amplitude is much lower than the amplitude for the pump malfunctioning, it basically can't do the job at all.

  1. ^

    In real life, it was an open question whether our local electroweak vacuum is in a metastable state last I checked, with the latest experimental evidence I'm aware from a couple of years ago tentatively (ca. 3 sigma I think?) pointing to yes, though that calculation is probably assuming Standard model physics the applicability of which people can argue to hell and back. But it sure seems like a pretty self-consistent way for the world to be, so we can just declare that the fictional universe works like that. Substitute strangelets or any other conjectured instant-earth-annihilation-method of your choice if you like.

  2. ^

    Because the mass terms for the elementary quantum fields would look all different now. Unclear to me that the bound structures of hadronic matter we are familiar with would still be a thing. 

Thinking the example through a bit further: In a ReLU layer, features are all confined to the positive quadrant. So superposed features computed in a ReLU layer all have positive inner product. So if I send the output of one ReLU layer implementing  AND gates in superposition directly to another ReLU layer implementing another  ANDs on a subset of the outputs of that previous layer[1], the assumption that input directions are equally likely to have positive and negative inner products is not satisfied.

Maybe you can fix this with bias setoffs somehow? Not sure at the moment. But as currently written, it doesn't seem like I can use the outputs of one layer performing a subset of ANDs as the inputs of another layer performing another subset of ANDs.

EDIT: Talked it through with Jake. Bias setoff can help, but it currently looks to us like you still end up with AND gates that share a variable systematically having positive sign in their inner product. Which might make it difficult to implement a valid general recipe for multi-step computation if you try to work out the details.

  1. ^

    A very central use case for a superposed boolean general computer. Otherwise you don't actually get to implement any serial computation.

Load More