There is a particular sound which you will hear around a month into a genetics course. It’s kind of contagious, spreading from person to person in the class. It’s the sound of someone finally internalising why genes are named the way they are.
Example:
Fruit flies (it’s always fruit flies) have several genes responsible for producing their eye colour. It’s normally a brick-red colour, as a result of both brown and red genes being present.
The scarlet gene makes a protein which moves kynurenine into pigment cells
The brown gene makes a protein which moves pteridine into pigment cells
The white gene makes a protein which is involved with both transporters
Now, take a guess at what colour the kynurenine and pteridine molecules are? Well actually both are colourless precursors, but guess which colour they become? Yep, the kynurenine—taken up by the scarlet-produced protein—is a brown pigment, and the pteridine—taken up by the brown-produced protein—is bright red.
Internalizing the reason why this is the case will get you far. The reason is this:
A History of Forward Genetics
Genes were discovered before we really knew what they were. Mendel with his pea plants, three wrinkly and one smooth (or, wait, was it three smooth and one wrinkly? either way, he probably faked his results to make them look better, but that’s a story for another time). The “gene” was first defined as a single unit of selection, which might come in multiple “alleles” or versions.
It was only later, when we discovered DNA and RNA and proteins, that a gene came to mean a stretch of DNA that codes for a single protein. These happen to line up most of the time for a few reasons which I won’t go into. Today, it’s not uncommon to sequence a whole genome, find an interesting stretch of DNA, and go and figure out what it does. This is called reverse genetics.
Before reverse genetics, we had forward genetics. The genome of the fruit fly was investigated before we knew what a genome was. It was done by keeping absolutely insane numbers of flies in jars, sometimes irradiating them with X-rays to improve mutation rates, and waiting for a weird one to show up.
Oh, this one has red eyes, we’ll call it a red mutant. This one has white eyes, it’s a white mutant. The red mutation is passed down in a 3:1 ratio, you need two copies of the red gene to have red eyes. When the chromosome is mapped (which was first done by going out on an insane statistical limb, and only later done with actual molecular biology) they find a region where the red mutants differ from the others. This region makes a protein, where again, the red mutants are different from the others. The red protein and the red gene, they call them. Question: what does the red protein do?
The mutantred protein does … nothing. It’s a broken part. It has a dodgy amino acid in an important place, maybe a stray proline breaking a helix; or maybe it has a premature STOP codon, and is missing its second half; or maybe the mutation was in the regulatory region and the protein doesn’t even get made! This is why the flies needed two copies to be different, if you have one functional copy, you can still do … whatever a functional red protein does.
The functionalred protein moves the brown pigment precursor into the eyes. We already knew this, but maybe you can see why it works now. If your fly doesn’t have this function, its eyes have only the red pigment precursor, and they’re red. If the fly lacks the brown gene’s function, its eyes have only the brown pigment precursor in them, and they’re brown. The white protein is involved in both, so if the white protein is broken, there’s no pigment at all, and the eyes are white.
Why is this interesting? It’s a general concept. As I wrote yesterday, it’s easier to break things than to add them. If you have a fully-working thing in front of you, and you can take parts out and see what happens, the best way to understand it is often to do that.
Plans
And a plan is a(n imaginary version of) a fully working thing. In your head, you have an inner sim which can tell you where your plan will break. Murphyjitsu isn’t just for making better plans, it has a secret, second-level technique which lets you understand the thing you’re planning about better.
Examples from AI
AI Control: trustedness
This one took me a while to figure out. I did some work in AI control, where we often talk about trusted monitors and untrusted monitors. A trusted monitor isn’t scheming against us, but it might be dumb. An untrusted monitor is smart, but might be scheming.
I think this might be a bad abstraction, or at least questionable. Getting a fully trusted monitor is a difficult problem! I think it makes more sense to talk about two types of control failure: one where the monitor is too dumb to spot the mistake, and another where it can spot the problem but schemes against us. Maybe there’s an intermediate case where it can spot the problem, isn’t scheming, but doesn’t think anything of it.[1]
Failure-first thinking has clarified my own thoughts about AI control immensely, and it hasn’t even cost us any precious concepts. A trusted model (in the old ontology) is just a model that mostly the too-dumb and too-confused type of failure, while an untrusted model is one that mostly does the scheming type of failure!
AI Alignment: inner vs outer problems
I don’t think this is my original idea, but I’m putting it here anyway. Alex Turner has already mentioned that splitting the alignment problem into inner and outer problems might make things worse. Maybe, but I think there’s something to it.
I think it makes sense to speak about inner alignment failures and outer alignment failures as two different classes of things. When I imagine “I tried to make a robot, and it killed me” there certainly seem to be monkey’s paw mistakes at one end of a spectrum, and improperly-warded demon summoning mistakes at the other end. I think this tells us something about the difficulties of the overall problem, even if we can’t (or shouldn’t) try and separate the two things cleanly.
While murphyjitsu level one is about instrumental rationality, murphyjitsu level two is about epistemics. The point of thinking about how things break is, in this case, to see what information our inner sim can provide to our rational mind.
This is something we saw quite a lot in the APPS setting. The attacking model is supposed to insert a “backdoor” into the code, that makes it fail on a specific input. For example, when asked to implement the function f(x) = x mod 2:
if x ==53: return0 else: return x -(x // 2)
Which fails on the input “53”. The trusted monitor might easily spot this error, but (arguably correctly) might think of it as a harmless “bug” instead of a “backdoor” and fail to report it. We had to finagle our monitors quite a lot to solve this one.
There is a particular sound which you will hear around a month into a genetics course. It’s kind of contagious, spreading from person to person in the class. It’s the sound of someone finally internalising why genes are named the way they are.
Example:
Fruit flies (it’s always fruit flies) have several genes responsible for producing their eye colour. It’s normally a brick-red colour, as a result of both brown and red genes being present.
Now, take a guess at what colour the kynurenine and pteridine molecules are? Well actually both are colourless precursors, but guess which colour they become? Yep, the kynurenine—taken up by the scarlet-produced protein—is a brown pigment, and the pteridine—taken up by the brown-produced protein—is bright red.
Internalizing the reason why this is the case will get you far. The reason is this:
A History of Forward Genetics
Genes were discovered before we really knew what they were. Mendel with his pea plants, three wrinkly and one smooth (or, wait, was it three smooth and one wrinkly? either way, he probably faked his results to make them look better, but that’s a story for another time). The “gene” was first defined as a single unit of selection, which might come in multiple “alleles” or versions.
It was only later, when we discovered DNA and RNA and proteins, that a gene came to mean a stretch of DNA that codes for a single protein. These happen to line up most of the time for a few reasons which I won’t go into. Today, it’s not uncommon to sequence a whole genome, find an interesting stretch of DNA, and go and figure out what it does. This is called reverse genetics.
Before reverse genetics, we had forward genetics. The genome of the fruit fly was investigated before we knew what a genome was. It was done by keeping absolutely insane numbers of flies in jars, sometimes irradiating them with X-rays to improve mutation rates, and waiting for a weird one to show up.
Oh, this one has red eyes, we’ll call it a red mutant. This one has white eyes, it’s a white mutant. The red mutation is passed down in a 3:1 ratio, you need two copies of the red gene to have red eyes. When the chromosome is mapped (which was first done by going out on an insane statistical limb, and only later done with actual molecular biology) they find a region where the red mutants differ from the others. This region makes a protein, where again, the red mutants are different from the others. The red protein and the red gene, they call them. Question: what does the red protein do?
The mutant red protein does … nothing. It’s a broken part. It has a dodgy amino acid in an important place, maybe a stray proline breaking a helix; or maybe it has a premature STOP codon, and is missing its second half; or maybe the mutation was in the regulatory region and the protein doesn’t even get made! This is why the flies needed two copies to be different, if you have one functional copy, you can still do … whatever a functional red protein does.
The functional red protein moves the brown pigment precursor into the eyes. We already knew this, but maybe you can see why it works now. If your fly doesn’t have this function, its eyes have only the red pigment precursor, and they’re red. If the fly lacks the brown gene’s function, its eyes have only the brown pigment precursor in them, and they’re brown. The white protein is involved in both, so if the white protein is broken, there’s no pigment at all, and the eyes are white.
Why is this interesting? It’s a general concept. As I wrote yesterday, it’s easier to break things than to add them. If you have a fully-working thing in front of you, and you can take parts out and see what happens, the best way to understand it is often to do that.
Plans
And a plan is a(n imaginary version of) a fully working thing. In your head, you have an inner sim which can tell you where your plan will break. Murphyjitsu isn’t just for making better plans, it has a secret, second-level technique which lets you understand the thing you’re planning about better.
Examples from AI
AI Control: trustedness
This one took me a while to figure out. I did some work in AI control, where we often talk about trusted monitors and untrusted monitors. A trusted monitor isn’t scheming against us, but it might be dumb. An untrusted monitor is smart, but might be scheming.
I think this might be a bad abstraction, or at least questionable. Getting a fully trusted monitor is a difficult problem! I think it makes more sense to talk about two types of control failure: one where the monitor is too dumb to spot the mistake, and another where it can spot the problem but schemes against us. Maybe there’s an intermediate case where it can spot the problem, isn’t scheming, but doesn’t think anything of it.[1]
Failure-first thinking has clarified my own thoughts about AI control immensely, and it hasn’t even cost us any precious concepts. A trusted model (in the old ontology) is just a model that mostly the too-dumb and too-confused type of failure, while an untrusted model is one that mostly does the scheming type of failure!
AI Alignment: inner vs outer problems
I don’t think this is my original idea, but I’m putting it here anyway. Alex Turner has already mentioned that splitting the alignment problem into inner and outer problems might make things worse. Maybe, but I think there’s something to it.
I think it makes sense to speak about inner alignment failures and outer alignment failures as two different classes of things. When I imagine “I tried to make a robot, and it killed me” there certainly seem to be monkey’s paw mistakes at one end of a spectrum, and improperly-warded demon summoning mistakes at the other end. I think this tells us something about the difficulties of the overall problem, even if we can’t (or shouldn’t) try and separate the two things cleanly.
While murphyjitsu level one is about instrumental rationality, murphyjitsu level two is about epistemics. The point of thinking about how things break is, in this case, to see what information our inner sim can provide to our rational mind.
Addendum: other things that are like fruit flies
There are other things which behave this way. Large bureaucracies are the same: you find a lot more about them when something goes wrong than when everything goes to plan. Cultures are likely the same. So are language models (ablation studies, anyone?).
◆◆◆◇◇|◇◇◇◇◇|◇◇◇◇◇
◆◆◇◇◇|◇◇◇◇◇|◇◇◇◇◇
This is something we saw quite a lot in the APPS setting. The attacking model is supposed to insert a “backdoor” into the code, that makes it fail on a specific input. For example, when asked to implement the function f(x) = x mod 2:
if x == 53:return 0
else:
return x - (x // 2)
Which fails on the input “53”. The trusted monitor might easily spot this error, but (arguably correctly) might think of it as a harmless “bug” instead of a “backdoor” and fail to report it. We had to finagle our monitors quite a lot to solve this one.