We are no longer accepting submissions. We'll get in touch with winners and make a post about winning proposals sometime in the next month.
ARC recently released a technical report on eliciting latent knowledge (ELK), the focus of our current research. Roughly speaking, the goal of ELK is to incentivize ML models to honestly answer “straightforward” questions where the right answer is unambiguous and known by the model.
ELK is currently unsolved in the worst case—for every training strategy we’ve thought of so far, we can describe a case where an ML model trained with that strategy would give unambiguously bad answers to straightforward questions despite knowing better. Situations like this may or may not come up in practice, but nonetheless we are interested in finding a strategy for ELK for which we can’t think of any counterexample.
We think many people could potentially contribute to solving ELK—there’s a large space of possible training strategies and we’ve only explored a small fraction of them so far. Moreover, we think that trying to solve ELK in the worst case is a good way to “get into ARC’s headspace” and more deeply understand the research we do.
We are offering prizes of $5,000 to $50,000 for proposed strategies for ELK. We’re planning to evaluate submissions received before February 15.
For full details of the ELK problem and several examples of possible strategies, see the writeup. The rest of this post will focus on how the contest works.
Contest details
To win a prize, you need to specify a training strategy for ELK that handles all of the counterexamples that we’ve described so far, summarized in the section below—i.e. where the breaker would need to specify something new about the test case to cause the strategy to break down. You don’t need to fully solve the problem in the worst case to win a prize, you just need to come up with a strategy that requires a new counterexample.
We’ll give a $5,000 prize to any proposal that we think clears this bar. We’ll give a $50,000 prize to a proposal which we haven’t considered and seems sufficiently promising to us or requires a new idea to break. We’ll give intermediate prizes for ideas that we think are promising but we’ve already considered, as well as for proposals that come with novel counterexamples, clarify some other aspect of the problem, or are interesting in other ways. A major purpose of the contest is to provide support for people understanding the problem well enough to start contributing; we aren’t trying to only reward ideas that are new to us.
You can submit multiple proposals, but we won’t give you separate prizes for each—we’ll give you at least the maximum prize that your best single submission would have received, but may not give much more than that.
If we receive multiple submissions based on a similar idea, we may post a comment describing the idea (with attribution) along with a counterexample. Once a counterexample has been included in the comments of this post, new submissions need to address that counterexample (as well as all the existing ones) in order to be eligible for a prize.
Ultimately prizes are awarded at our discretion, and the “rules of the game” aren’t fully precise. If you are curious about whether you are on the right track, feel free to send an email to elk@alignmentresearchcenter.org with the basic outline of an idea, and if we have time we’ll get back to you with some feedback. Below we also describe some of the directions we consider more and less promising and some general guidance.
How to submit a proposal
You can submit a proposal by copying this google doc template and sharing it with elk@alignmentresearchcenter.org (please give comment access in case we need to ask questions to evaluate your submission). By submitting a proposal, you are giving us permission to post the idea here with attribution. (And it's fine for you to post it after the contest or we post a counterexample.)
Retroactive prizes
We’re giving prizes to existing proposals from David Dalrymple ($5k), Ramana Kumar ($3k), John Maxwell ($2k), and Yonadav Shavit ($1k). We’ve added relevant counterexamples in the next section, and further submissions need to handle those counterexamples in order to be eligible.
Existing counterexamples
In our report and a recent post, we described some features of the world that would cause existing approaches to ELK to fail. A new approach should work even in a situation satisfying all of these properties, or should include an argument for why some of these features are actually impossible or inconsistent with each other.
In order to generate a counterexample to a training strategy, we need to spell out a concrete “way the world could be” (a test case) where the strategy would fail. In the report we consider test cases where the AI and the human happen to understand the world by doing inference in Bayes nets. The Bayes net case is just one test case that lets us get precise about our counterexamples and our training strategies don’t lean on it. In practice we also consider other concrete cases (e.g. deduction from premises generated by a neural network).
In this section we’ll briefly review the current state of existing counterexamples. This isn’t intended to be a self-contained description; it relies heavily on links to the report.
Ontology mismatch
All of our counterexamples are based on an ontology mismatch between two different Bayes nets, one used by an ML prediction model (“the predictor”) and one used by a human:
- The predictor’s Bayes net is used internally by a prediction model that maps an initial segment of a video to a prediction about what will happen next. This model was trained by gradient descent based on the quality of its predictions (more).
- Humans have their own model of the world, which they can improve with the help of AI assistants; those assistants can design and conduct experiments, find and evaluate arguments, and so on. Crucially, we have a limited computational budget with which to train these AI assistants, and that limits how much help they can provide. The “human’s” Bayes net captures everything the human understands about the world at the end of this process (more).
We are interested in the case where there are features of reality that are accurately reflected in the predictor’s Bayes net, but missing from this (augmented) human Bayes net. We expect the human could understand these facts eventually given enough time or enough AI assistance, but not using the available computational budget. Put differently, there is a gap between the kinds of understanding that gradient descent can build, and the kind of understanding that a human-comprehensible process of “scientific inquiry” can build using similar computational resources (more).
Furthermore, we imagine that both the predictor’s Bayes net and the human’s Bayes net reflect some underlying reality and that there is a way to map between them, which we’ll call “direct translation.” We don’t know how to do direct translation, and we are concerned that if we simply search for any mapping from the AI’s Bayes net to the human’s Bayes net we will instead get the “human simulator”: the map which correctly translates the observations from one Bayes net to the other, but then does inference in the human’s Bayes net in order to fill in the latent structure as a human would by looking at the data (more).
Counterexample features
ELK seems particularly hard when these Bayes nets satisfy the following properties:
- There are arbitrarily severe “ontology mismatches” between the predictor’s Bayes net and the human’s Bayes net, such that specifying the direct translation between them can be very complex (more).
- The human’s Bayes net is simpler than the predictor’s Bayes net, and the gap can be arbitrarily large (more).
- The predictor’s Bayes net is arbitrarily large, and individual nodes of the human’s Bayes net depend on a constant fraction of the predictor’s Bayes net. Therefore directly translating between the two Bayes nets takes arbitrarily much computation (more).
- It’s computationally easy for the human simulator to do good inference in the human’s Bayes net—at least good enough inference that it answers questions as well as the direct translator—and the gap between this complexity and the complexity of direct translation can be arbitrarily large. More precisely, a human interrogator can’t think of questions which require the human simulator to do hard inference. However, we can’t rule out the possibility that you, the algorithm-designer, will think of some trick they can use to generate hard questions (more).
- There are situations where a model which is trying to answer a question will first be able to predict what (potentially mistaken) belief the human-with-assistance will have about that question before it knows the true answer to the question (more).
- While the predictor’s Bayes net is very large, it has a short succinct description which is simpler than the human’s Bayes net, and that gap is arbitrarily large. Think of simple physical laws that describe arbitrarily computationally expensive simulations, or simple cognitive heuristics that are very widely applicable. (more)
- There are sequences of actions for which the predictor’s Bayes net correctly anticipates that tampering will occur yet nothing looks suspicious to the human’s Bayes net.
- After tampering has occurred, it is possible that no sequence of further actions reveal that anything is wrong, e.g. the camera and actuators may have been completely disconnected from the world and replaced with a convincing-to-humans simulation (more).
- There are sequences of actions leading to undetectable tampering that have no human-comprehensible similarities to sequences of actions leading to detectable tampering; in particular, the early steps of this undetectable tampering cannot be used to perform detectable tampering (more).
An important and very general strategy for the breaker is to suggest counterexamples where you “get what you train for.” If you propose training a model using loss function L, then the breaker will likely focus on counterexamples where the model chooses its output to minimize L. A successful approach should either explain why this either isn’t a realistic thing for the model to learn (e.g. why that would be disfavored by regularization), or why the overall approach works even if the model is explicitly strategizing about how to minimize the loss.
Rough guidance
- There is a tradeoff between specifying your approach precisely, and including your own counterexample. It’s fine to describe a very broad/vague idea for solving ELK, and then present a new counterexample showing why that whole category of solutions can’t work. But if you don’t propose a counterexample, then it’s good to describe your proposal precisely enough that we understand how to actually implement it without wondering if that’s really what you meant. It’s OK to submit a very broad or informal idea together with a single very specific instance of that idea, as long as there is some version we can understand precisely.
- We suspect you can’t solve ELK just by getting better data—you probably need to “open up the black box” and include some term in the loss that depends on the structure of your model and not merely its behavior. So we are most interested in approaches that address that challenge. We could still be surprised by clever ways to penalize behavior, but we’ll hold them to a higher bar. The most plausible surprise would be finding a way to reliably make it computationally difficult to “game” the loss function, probably by using the AI itself to help compute the loss (e.g. using consistency checks or by giving the human AI assistance).
- If you are specifying a regularizer that you hope will prefer direct translation over human simulation, you should probably have at least one concrete case in mind that has all the counterexample-features above and where you can confirm that your regularizer does indeed prefer the direct translator.
- ELK already seems hard in the case of ontology identification, where the predictor uses a straightforward inference algorithm in an unknown model of the world (which we’ve been imagining as a Bayes net). When coming up with a proposal, we don’t recommend worrying about cases where the original unaligned predictor learned something more complicated (e.g. involving learned optimization other than inference). That said, you do need to worry about the case where your training scheme incentivizes learned optimization that may not have been there originally.
Ask dumb questions!
A major purpose of this contest is to help people build a better understanding of our research methodology and the “game” we are playing. So we encourage people to ask clarifying questions in the comments of this post (no matter how “dumb” they are), and we’ll do our best to answer all of them. You might also want to read the comments to get more clarity about the problem.
What you can expect from us
- We’ll try to answer all clarifying questions in the comments.
- If you send in a rough outline for a proposal, we will try to understand whether it might qualify and write back something like “This qualifies,” “This might qualify but would need to be clearer and address issue X,” “We aren’t easily able to understand this proposal at all,” “This is unlikely to be on track for something that qualifies,” or “This definitely doesn’t qualify.”
- If there are more submissions than expected, we may run out of time to respond to all submissions and comments, in which case we will post an update here.
After reading through the report I wanted to make sure I understood the scenarios and counterexamples being discussed and be able to quickly refresh my memory, so I attempted to write a brief summary. Figured I'd share it here in case it helps anyone else.
Roles and Terms
SmartVault: Vault with a diamond in it, operated by a superintelligent AI tasked with keeping the diamond safe.
Predictor: The primary AI tasked with protecting the diamond. The predictor sees a video feed of the vault, predicts what actions are necessary to protect the diamond and how those actions will play out (for example, activating a trap door to eliminate a robber trying to steal the diamond), and then generates a video showing precisely what will happen.
Reporter: A second "head" or component added to the AI which is used to map the AI's understanding of what's going on in the SmartVault into a format understandable by humans (such as by responding to questions).
Humans: Observers who are training the AI to protect the diamond, but also want the AI to accurately report whether the diamond has been stolen or not. The humans may not be able to tell if the camer... (read more)
Just wanted to note that the "we may end the contest earlier" part here makes me significantly more hesitant about trying this. I will probably still at least have a look at it, but part of me is afraid that I'll invest a bunch of time and then the contest will be announced to be over before I got around to submitting. And I suspect Holden's endorsement may make that more likely. It would be easier for me to invest time spread out over the next couple of weeks, than all in one go, due to other commitments. On the other hand, if I knew there was a hard deadline next Friday, I might try to find a way to squeeze it in.
I'm just pointing this out in case you hadn't thought of it. I suspect something similar might be true for others too. Of course, it's your prize and your rules, and if you prefer it this way, that's totally fine.
Here are a couple of hand-wavy "stub" proposals that I sent over to ARC, which they thought were broadly intended to be addressed by existing counterexamples. I'm posting them here so they can respond and clarify why these don't qualify.
*Proposal 1: force ontological compatibility*
On page 34 of the ELK gdoc, the authors talk about the possibility that training an AI hard enough produces a model that has deep mismatches with human ontology - that is, it has a distinct "vocabulary of basic concepts" (or nodes in a Bayes net) that are distinct from the ones humans can build understanding of (via doing science on the compute budget available). Because of this, even AI assistance can't help humans understand everything the SmartVault AI understands. This is central to the challenge that most of the writeup is contending with - if not for the mismatch, "AIs explaining things to humans" could ensure that the trickery we're worried about doesn't happen.
The proposal here is to include a term in the loss function that incentivizes the AI to have a human-compatible ontology. For a cartoonish example, imagine that the term works this way: "The AI model gets a higher score to the degree that pe... (read more)
Again trying to answer this one despite not feeling fully solid. I'm not sure about the second proposal and might come back to it, but here's my response to the first proposal (force ontological compatibility):
The counterexample "Gradient descent is more efficient than science" should cover this proposal because it implies that the proposal is uncompetitive. Basically, the best Bayes net for making predictions could just turn out to be the super incomprehensible one found by unrestricted gradient descent, so if you force ontological compatibility then you could just end up with a less-good prediction model and get outcompeted by someone who didn't do that. This might work in practice if the competitiveness hit is not that big and we coordinate around not doing the scarier thing (MIRI's visible thoughts project is going for something like this), but ARC isn't looking for a solution of that form.
tl;dr as of 18/2/2022
The goal is to educate me and maybe others. I make some statements, you tell me how wrong I am (please).
After input from P. (many thanks) and an article by Paul Christiano this statement stands yet uncorrected:
In the worst case, the internal state of the predictor is highly correlated within itself and multiple mappings with zero loss from the internal state to the desired extraction of information exist. The only solution is to work with some prior belief about how the internal state maps to the desired information. But as by design of the contest, this is not possible as (in the worst case) a human cannot interpret the internal state nor can he interpret complex actions (and so cannot reason about it and/or form a prior belief). The solution to this second problem is to learn a prior from a smaller human-readable dataset, for example simple information as a function of simple actions, and apply it to (or force it upon) our reporter (as described by the mentioned article).
To my eyes this implies that there is a counterexample to all of the following types of proposal:
1) Datasets including only actions, predictions, internal states and desired information... (read more)
Question: Does ARC consider ELK-unlimited to be solved, where ELK-unlimited is ELK without the competitiveness restriction (computational resource requirements comparable to the unaligned benchmark)?
One might suppose that the "have AI help humans improve our understanding" strategy is a solution to ELK-unlimited because its counterexample in the report relies on the competitiveness requirement. However, there may still be other counterexamples that were less straightforward to formulate or explain.
I'm asking for clarification of this point because I notice... (read more)
Apologies for a possibly naive comment/question, perhaps this has been discussed elsewhere and you can just direct me there. But anyway...
I would find it helpful to see a strategy that ARC believes does in fact solve ELK, but fails only because it requires taking an unacceptably large capabilities hit. I would find this helpful for several reasons, namely
(1) it would help me to understand what kinds of strategies you believe really do escape counter-examples,
(2) it would give me a better sense for how optimistic to be about the appr... (read more)
Maybe I'm being stupid here. On page 42 of the write-up, it says:
Shouldn't that be?
I'm extremely flattered at the award; I've been on LessWrong for like a month, and definitely did not expect this. I can confirm to you guys that this makes me want to try harder at ELK, so your incentive is working!
I want to rebut your arguments in "Strategy: Predict hypothetical sensors" in your Counterxamples to some ELK proposals post. I'm reproducing it in full here for convenience.
... (read more)Can you explain this: "In Section: specificity we suggested penalizing reporters if they are consistent with many different reporters, which effectively allows us to use consistency to compress the predictor given the reporter." What does it mean to "use consistency to compress the predictor given the reporter" and how does this connect to penalizing reporters if they are consistent with many different predictors?
I was notified I didn't win a prize so figured I'd discuss what I proposed here in case it sparks any other ideas. The short version is I proposed adding on a new head that would be an intentional human simulator. During training it would be penalized for telling the truth that the diamond was gone when there existed a lie that the humans would have believed instead. The result would hopefully be a head that acted like a human simulator. Then the actual reporter would be trained so that it would be penalized for using a similar amount of compute as the int... (read more)
The official deadline for submissions is "before I check my email on the 16th", which I tend to do around 10 am PST.
I was talking about ELK in a group, and the working example of the SmartVault and the robber ended up being a point of confusion for us. Intuitively, it seems like the robber is an external, adversarial agent who tries to get around the SmartVault. However, what we probably care about in practice would be how a human could be fooled by an AI - not by some other adversary. Furthermore, it seems that whether the robber decides to cover up his theft of the diamond by putting up a screen depends solely on the actions of the AI. Does this imply that the ro... (read more)
Are there any additional articles exploring the strategy of penalizing inconsistencies across different inputs? It seems both really promising to me, and like something that should be trivially breakable. I'd like to get a more detailed understanding of it.
Am I right in thinking:
1) that the problem can be stated as: the AI has latent knowledge of lots of variables, like the status of the cameras, doors, alarm system, etc and also whether the diamond is in the vault; but you can't directly ask it whether the diamond is in the vault, because its training has taught it to answer "would a human observer think the diamond is in the vault?" instead (because there was no way at training time to give it feedback on whether it correctly predicted the diamond was in the vault, only feedback on whether it correctly pre... (read more)
ok... disclaimer: I know little about ML and I didn't read all of the report.
I am confused. Perhaps the above sentence is true in some tautological sense I'm missing. But in the sections of the report listing training strategies and corresponding coun... (read more)
I don't understand your counterexample in the appendix Details for penalizing inconsistencies across different inputs. You present a cheating strategy that requires the reporter to run and interpret the predictor a bunch of times, which seems plausibly slower than doing honest translation. And then you say you fix this issue with:
... (read more)Stupid proposal: Train the reporter not to deceive us.
We train it with a weak evaluator H_1 who’s easy to fool. If it learns an H_1 simulator instead of direct reporter, then we punish it severely and repeat with a slightly stronger H_2. Human level is H_100.
It's good at generalizing, so wouldn't it learn to never ever deceive?
Are there existing models for which we're pretty sure we know all their latent knowledge ? For instance small language models or something like that.
How do we know that the "prediction extractor" component doesn't do additional serious computation, so that it knows something important that the "figure out what's going on" module doesn't know? If that were true, the AI as a whole could know the diamond was stolen, without the "figure out what's going on" module knowing, which means even the direct translator wouldn't know, either. Are we just not giving the extractor that many parameters?
Stupid question: because we already know the goal ("keep the diamond intact and in the vault") what prevents us from bypassing the sensors and just directly evaluating the AI based on whether or not the diamond is in the room? Granted, this only works in simulated training, but as long as the AI doesn't know whether or not it's in deployment (an adversarial training process might help here) that won't matter.
As any goal we could have is a subset of the possible states of the area we care about, verifying whether or not our goal is achieved should be easier... (read more)
Dumb question alert:
In the appendix "Details for penalizing depending on “downstream” variables", I'm not able to wrap my head around what we can expect the reporter to learn -- if anything at all -- seeing that it has no dependency on the inputs (elsewhere it is dependent on z sampled from the posterior).
Specifically, the only call to the reporter (in the function reporter_loss in this section) contains no information (about before, action, after) from the predictor at all:
(unless "question" includes some context ... (read more)
Early in the ELK report, it mentions that ARC doesn't believe that strategies like debate solves ELK in the worst case. Can I get some clarifications on why? Specifically, a debate inspired set-up for SafeVault could be something like:
We train the reporter to take a human belief as input (i.e. "The diamond is in the vault.") and returns a "truthful" argument that is most likely to change the human's belief.
We can guarantee "truthfulness" by for example restricting the output to be a video rendering of what happens in the vault from some camera angle.
Hi ARC Team,
Thanks for your valuable work. I’ve been thinking about this problem, and my current thinking is that there is a portion of the ELK problem which is solvable, and a portion which is fundamentally impossible. This is a sketch of my argument -- if you think it is worth typing in more detail (or to address an issue you propose) let me know.
Let’s divide facts about the world into two categories: those that are verifiable by some sensor humans can create and understand, and those that are not. My claim is that for the first set of facts ... (read more)
Am I still eligible for the prize if I publish a public blog post at the same time I submit the Google Doc or would you prefer I not publish a blog post about February 15th? Publishing the blog post immediately advances science better (because it can enable discussion) but waiting until after the February 15th might be preferable to you for contest-related reasons.
Suppose there are two worlds, world W1 and world W2.
In world W1, the question Q="Is there a diamond in the room?" is commonly understood to mean Q1="Is there actually a diamond in the room?"
In world W2 the question Q="Is there a diamond in the room?" is commonly understood to mean Q2="Do I believe there is a diamond in the room?"
Both worlds don't know how to construct a situation where these are different. So, they produce identical training sets for ELK. But the simulator is also trained on a bunch of science fiction novels that contain descriptions of im... (read more)
Idea: Withhold Material Information
We're going to prevent the reporter from simulating a human, by giving the human material information that the reporter doesn't have.
Consider two camera feeds:
Feed 1 is very low resolution, and/or shows only part of the room.
Feed 2 is high resolution, and/or shows the whole room.
We train a weak predictor using Feed 1, and a strong predictor using Feed 2.
We train a reporter to report the beliefs of the weak predictor, using scenarios labeled by humans with the aid of the strong predictor. The humans can correc... (read more)
You said that naive questions were tolerated so here’s a scenario I can’t figure out why it wouldn’t work.
It seems to me that the fact that an AI fails to predict the truth (because it predicts as humans would) is due to the fact that the AI has built an internal model of how humans understand things and predict based on that understanding. So if we assume that an AI is able to build such an internal model, why wouldn’t we train an AI to predict what a (benevolent) human would say given an amount of information and a capacity to process information ? Doing... (read more)
If I understand this right, there is a diamond in a hightech room to be protected. The goal is to know if the diamond is in place and not just a image or a dummy like a picture or similar.
If the AI only is getting footage from a normal camera, not from a lidar sensor for depth information of the diamond (with would see if there is a fake image hanging in front of the camera), wouldn't it be easier to train the AI to look at the reflection/refraction of the light of the diamond? (For example a light that is turning on at the side of the room in the moment t... (read more)
What's the exact deadline for submissions?
A question. Is it relevant for your current problem formulation that you also want to ensure that authorised people still have reasonable access to the diamond? In other words, is it important here that the system still needs to yield to actions or input from certain humans, be interruptible and corrigible? Or, in ML terms, does it have to avoid both false negatives and false positives when detecting or avoiding intrusion scenarios?
I imagine that an algorithmically more trivial way to make the system both "honest" and "secured" is to make it so heavily secured that almost certainly nobody can access the diamond.
I'm not the best with this but I've been thinking of either an origin network seed or a spider like secondary network. The origin is something that can be tested off of, I guess similarly to a reporter. But unlike a reporter evolving we stunt the growth by saving this origin and then just passing through one sample then killing it. The way this could function goes into my second thought of spider, which has legs that sense things, a leg goes into a sensory object which leads to the main body. Similarly we can slowly stunt the growth so it learns slower and... (read more)
Clarification question via scenario:
Predictor: I predict the diamond will be missing in 1 hours time.
Person A: Oh no, ramp up security until it says its safe.
Person B: Interesting, I wonder why it predicts this.
Is the purpose to be able to respond like person A (aka, the predictor may predict the diamond will be missing in an hour, but we cannot understand its output properly) or like person B (we understand the output, but not how it got there. Diamond be damned we want to learn what's going on under the hood). I suspect we're after person B's interpretation, but just want to be sure.
Possible error in the strange correlations section of the report.
Footnote 99 claims that "...regardless of what the direct translator says, the human simulator will always imply a larger negative correlation [between camera-tampering and actually-saving the diamond] for any X such that Pai(diamond looks safe|X) > Ph(diamond looks safe|X)."
But AFAICT, the human simulator's probability distribution given X depends only on human priors and the predictor's probability that the diamond looks safe given X, not on how correlated or anticorrelated the predictor... (read more)
Naive thought #2618281828:
Could asking counterfactual questions be a potentially useful strategy to bias the reporter to be a direct translator rather than a human simulator?
Concretely, consider a tuple (v, a, v'), where v := 'before' video, a := 'action' selected by SmartVault or augmented-human or whatever, and v' := 'after' video.
Then, for some new action a', ask the question:
(How we collect such data is unclear but doesn't seem obviously intractable.)
I think there's some value here:... (read more)
Would you consider this a valid counter to the third strategy (have humans adopt the optimal Bayes net using imitative generalization), as alternative to ontology mismatch?
Last I checked there were 66 comments and now there are over a hundred so I'm just going to post and hope I'm not repeating anyone.
So I've been reading through the google doc, and I'm not very far into it but I have a few questions. I apologize in advance if I'm just adhering too strictly to the "SmartVault" scenario, and if I get long-winded (yay ADHD and hyperfocusing off and on about this without actually making progress for a week).
1)
Why would we make a vault that was so complicated that a human alone couldn't run it? From a simple design standpoint an... (read more)
A small suggestion: the counterexample to "penalize downstream", as I understand it, requires there to be tampering in the training data set. It seems conceptually cleaner to me if we can assume the training data set has not been tampered with (e.g. because if alignment only required there to be no tampering in the training data, that would be much easier).
The following counterexample does not require tampering in the training data:
- The predictor has nodes E0,…,En indicating whether the diamond was stolen at time n
- It also has node E=⋁iEi
... (read more)While reading through the report I made a lot of notes about stuff that wasn't clear to me, so I'm copying here the ones that weren't resolved after finishing it. Since they were written while reading, a lot of these may be either obvious or nitpick-y.
Footnote 14, page 15:
... (read more)I've been trying to understand this paragraph:
... (read more)In "Strategy: penalize computation time" you say:
> At first blush this is vulnerable to the same counterexample described in the last section [complexity]... But the situation is a little bit more complex... the direct translator may be able to effectively “re-use” that inference rather than starting from scratch
It seems to me that this "counter-counterexample" also applies for complexity – if the translator is able to reuse computation from the predictor, wouldn't that both reduce the complexity and the time?
(You don't explicitly state that this "reuse" is only helpful for time, so maybe you agree it is also helpful for complexity – just trying to be sure I understand the argument.)
Clarification request. In the writeup, you discuss the AI Bayes net and the human Bayes net as if there's some kind of symmetry between them, but it seems to me that there's at least one big difference.
In the case of the AI, the Bayes net is explicit, in the sense that we could print it out on a sheet of paper and try to study it once training is done, and the main reason we don't do that is because it's likely to be too big to make much sense of.
In the case of the human, we have no idea what the Bayes net looks like, because humans don't have that k... (read more)
One more stupid question - how is this different from a "man in the middle" attack? (Term from cryptography where you cannot trust your communications, because of a malicious agent between you and your recipient who's changing your messages)
The current recommended solution for those is encrypting your communication before you send it; I don't know that there are any extant solutions for noticing you've got an MITM situation after the fact.
If I understand the problem statement correctly, I think I could take a stab at easier versions of the problem, but that the current formulation is too much to swallow in one bite. In particular I am concerned about the following parts:
Does this mean that the method needs to work for ~arbitrary architectures, and that the solution must use substantially... (read more)
Would changing how the reward function pays off work? Instead of rewarding based on humans, pay out all rewards when the vault is checked (at a time unknown to the AI). The AI isn't asked if the diamond is present or absent. Instead, it is asked "If the vault were checked now, do you want to be rewarded if the diamond is present or absent.
I'm a newcomer to this, I lack much of the background, and I'm probably suggesting a solution that's too specific to this diamond heist scenario. But, I already spent an hour writing it down, so I might as well share it.
Trusted timestamping, cryptographically secure sensor
This is a very basic "builder move", I guess? The idea is to simply improve our sensors so that it's very hard to tamper with them, through public-private key encryption. The diamond will have a small chip that constantly sends a cryptographically-signed timestamped life... (read more)
Potentially silly question:
In the first counterexample you describe the desired behavior as
Why doesn't the reporter skip the step of ma... (read more)
Edit: think this isn't quite right in general, will try to make it more correct later
Here's a sketch of a strategy for trying to fix Strategy: penalize depending on “downstream” variables. Would appreciate feedback on whether it's modeling the difficulty correctly/seems possibly worth figuring out how to implement
It seems like the problem is:
- On the training set, there are a number of implicit variables X that are indistinguishable (always all true or always all false)
- A. Is the diamond safe at time t-1
- B. Is the diamond safe at time t (the variable we actual
... (read more)Question: Would a proposal be ruled out by a counterexample even if that counterexample is exponentially unlikely?
I'm imagining a theorem, proved using some large deviation estimate, of the form: If the model satisfies hypotheses XYZ, then it is exponentially unlikely to learn W. Exponential in the number of parameters, say. In which case, we could train models like this until the end of the universe and be confident that we will never see a single instance of learning W.
Hello, I have some issue with the epistomology of the problem : my problem is that even if the process of training was giving the behavior we want, we would have no way to check the IA is working properly in practice.
I try now to give more details : in the volt probleme, given the same information, let's think of an IA that just as to answer the question "Is the diamon still in the volt ?".
Something we can suppose is that, the set Y, from which we draw the labeled examples to train the IA (a set of technique for the thief), is not importa... (read more)
If the predictor AI is in fact imitating what humans would do, why wouldn’t it throw its hands up at an actuator sequence that is too complicated for humans—isn’t that what humans would do? (I'm referring to the protect-the-diamond framing here.)
Naive question: does this scenario include cases of a human physically breaking into the vault at some random times so that sensor information, predictor reports, and outcome to human in this situation would be known?
Silly question warning.
You think that when an AI performs a bad action, (say remove the diamond) the AI has to have knowledge that the diamond is in fact no longer there. Even when the camera shows the diamond is (falsely) there and the human confirms that the diamond is there.
You call this ELK
You want the human to have access to this knowledge, as this is useful to choosing decisions that the human wants.
This is hard. So you have people propose how to do this.
And then people try to explain why that strategy wouldn't wo... (read more)
You can just make a plecsi glass by 2 by 2 metters, and on the inside to put senzors that are conected to a battery on the inside of the glass box.These senzors to intersept every cm of the glass,moving,breaking the glass being imposible because all the meccanics are on the inside.And if you want to take the diamont from there like an guard man or somethin' just let everyone know you are takeing it from there and let the alarm to ring.When the senzors are going on they will make a signal to the guard Who are protecting it and come to catch the thief .More... (read more)