I'm interested in doing in-depth dialogues to find cruxes. Message me if you are interested in doing this.
I do alignment research, mostly stuff that is vaguely agent foundations. Currently doing independent alignment research on ontology identification. Formerly on Vivek's team at MIRI.
I think we already see overconfidence in models. See davidad's comment on how this could come from perverse RL credit assignment h/t (Jozdien). See also this martingale score paper. I think it's reasonable to extrapolate from current models and say that future models will be overconfident by default
Cool, that makes sense. My disagreement with this come from thinking that the current LLM paradigm is kinda currently missing online learning. When I add that in, it seems much less reasonable an extrapolation, to me.
This seems probable with online learning but not necessarily always the case. It's also possible that the model is not overconfident on easy to verify tasks but is overconfident on hard to verify tasks.
I assumed that you weren't talking about this kind of domain-specific overconfidence, since your original comment suggested forecasting as a benchmark. This seems not totally implausible to me, but at the same time data-efficient generalisation is a ~necessary skill of most kinds of research so it still seems odd to predict a particular kind of inability to generalise while also conditioning on being good at research.
Like yes of course overconfidence is something that would get fixed eventually, but it's not clear to be that it will be fixed until it's too late
I'm primarily thinking about the AI correcting itself, like how you and I would in cases where it was worth the effort.
(i.e., you can still build ASI with a overconfident AI)
I think you're saying this a tad too confidently. Overconfidence should slow down an AI in its research, cause it to invest too much in paths that won't work out, over and over again. It's possible it would still succeed, and it's a matter of degree in how overconfident it is, but this could be an important blocker to being capable of effective research and development.
Yes, I agree with that.
I don't think I am retreating to a motte.
My read was:
JG: Without ability to learn from mistakes
FS: Without optimal learning from mistakes
But this was misdirection, we are arguing about how surprised we should be when a competent agent doesn't learn a very simple lesson after making the mistake several times. Optimality is misdirection, the thing you're defending is extreme sub-optimality and the thing I'm arguing for is human-level ability-to-correct-mistakes.
On our current trajectory, I expect the minimal viable scary agent will fail to be epistemically efficient relative to humans in the following cases
I agree that there are plausibly domains where a minimal viable scary agent won't be epistemically efficient with respect to humans. I think you're overconfident (lol) in drawing specific conclusions (i.e. that a specific simple mistake is likely) from this kind of reasoning about capable AIs, and that's my main disagreement.
But engaging directly, all three of these seem not very relevant to the case of general overconfidence, because general overconfidence is noticeable and correctable from lots of types of experiment. A more plausible thing to expect is low quality predictions about low data domains, not general overconfidence across low and high data domains.
I assume you're talking about this one?
No, I meant this one:
I don't think the first AI smart enough to cause catastrophe will need to be that smart.
I think focusing on the "first AI smart enough" leads to a lot of low-EV research. If you solve a problem with the first AI smart enough, this doesn't help much because a) there are presumably other AIs of similar capability, or soon will be, with somewhat different capability profiles and b) it won't be long before there are more capable AIs and c) it's hard to predict future capability profiles.
- The minimal viable scary agent is in fact scary.
- It doesn't need to be superhuman at everything to be scary
- It is worth investing more than zero resources into mitigating the risks we expect to see with the first scary agents
- This is true even if we don't expect those mitigation to scale all the way up to superhuman-at-literally-all-tasks ASI.
I agree with all of these, so it feels a little like you're engaging with an imagined version of me who is pretty silly.
Trying to rephrase my main point, because I think this disagreement must be at least partially a miscommunication:
Humans like you and I have the ability to learn from mistakes after making them several times. Across-the-board overconfidence is a mistake that we wouldn't have much trouble correcting in ourselves, if it were important.
Domain-specific overconfidence on domains with little feedback is not what I'm talking about, because it didn't appear to be what Tim was talking about. I'm also not talking about bad predictions in general.
one risk factor in this kind of research is that the capabilities people might resolve that weakness in the course of their work, in which case your effort was wasted. But I don't think that that consideration is overwhelmingly strong.
My argument was that there were several of "risk factors" that stack. I agree that each one isn't overwhelmingly strong.
I prefer not to be rude. Are you sure it's not just that I'm confidently wrong? If I was disagreeing in the same tone with e.g. Yampolskiy's argument for high confidence AI doom, would this still come across as rude to you?
"Overconfident" gets thrown around a lot by people who just mean "incorrect". Rarely do they mean actual systematic overconfidence. If everyone involved in building AI shifted their confidence down across the board, I'd be surprised if this changed their safety-related decisions very much. The mistakes they are making are more complicated, e.g. some people seem "underconfident" about how to model future highly capable AGI, and are therefore adopting a wait-and-see strategy. This isn't real systematic underconfidence, it's just a mistake (from my perspective). And maybe some are "overconfident" that early AGI will be helpful for solving future problems, but again this is just a mistake, not systemic overconfidence.
At no point in this discussion do I reference "limits of intelligence". I'm not taking any limits, or even making reference to any kind of perfect reasoning. My x-risk threat models in general don't involve that kind of mental move. I'm talking about near-human-level intelligence, and the reasoning works for AI that operates similarly to how they work now.
Wtithout optimally learning from mistakes
You're making a much stronger claim than that and retreating to a Motte. Of course it's not optimal. Not noticing very easy-to-correct mistakes is extremely, surprisingly sub-optimal on a very specific axis. This shouldn't be plausible when we condition on an otherwise low likelihood of making mistakes.
If you look at the most successful humans, they're largely not the most-calibrated ones.
The most natural explanation for this is that it's mostly selection effects, combined with humans being bad at prediction in general. And I expect most examples you could come up with are more like domain-specific overconfidence rather than across-the-board overconfidence.
but just because it's not the only useful thing and so spending your "points" elsewhere can yield better results.
I agree calibration is less valuable than other measures of correctness. But there aren't zero-sum "points" to be distributed here. Correcting for systematic overconfidence is basically free and doesn't have tradeoffs. You just take whatever your confidence would be and adjust it down. It can be done on-the-fly, even easier if you have a scratchpad.
If you think there's a strong first-mover advantage you should care a lot about what the minimum viable scary system looks like, rather than what scary systems at the limit look like.
No, not when it comes to planning mitigations. See the last paragraph of my response to Tim.
This assumes that [intelligent agents that can notice their own overconfidence] is the only/main source of x-risk
Yeah, main. I thought this was widely agreed on, I'm still confused by how your shortform got upvoted. So maybe I'm missing a type of x-risk, but I'd appreciate the mechanism being explained more.
My current reasoning: It takes a lot of capability to be a danger to the whole world. The only pathway to destroying the world that seems plausible while being human-level-dumb is by building ASI. But ASI building still presumably requires lots of updating on evidence and learning from mistakes, and a large number of prioritisation decisions.
I know it's not impossible to be systematically overconfident while succeeding at difficult tasks. But it's more and more surprising the more subtasks it succeeds on, and the more systematically overconfident it is. Being systematically overconfident is a very specific kind of incompetence (and therefore a priori unlikely), and easily noticeable (and therefore likely to be human-corrected or self-corrected), and extremely easy to correct for (and therefore unlikely that the standard online learning process or verbalised reasoning didn't generalise to this).
I don't think the first AI smart enough to cause catastrophe will need to be that smart.
I think focusing on the "first AI smart enough" leads to a lot of low-EV research. If you solve a problem with the first AI smart enough, this doesn't help much because a) there are presumably other AIs of similar capability, or soon will be, with somewhat different capability profiles and b) it won't be long before there are more capable AIs and c) it's hard to predict future capability profiles.
Yes, but what's your point? Are you saying that highly capable (ASI building, institution replacing) but extremely epistemically inefficient agents are plausible? Without the ability to learn from mistakes?
I am confident about this, so I'm okay with you judging accordingly.
I appreciate your rewrite. I'll treat it as something to aspire to, in future. I agree that it's easier to engage with.
I was annoyed when writing. Angry is too strong a word for it though, it's much more like "Someone is wrong on the internet!". It's a valuable fuel and I don't want to give it up. I recognise that there are a lot of situations that call for hiding mild annoyance, and I'll try to do it more habitually in future when it's easy to do so.
There's a background assumption that maybe I'm wrong to have. If I write a comment with a tone of annoyance, and you disagree with it, it would surprise me if that made you feel bad about yourself. I don't always assume this, but I often assume it on Lesswrong because I'm among nerds for whom disagreement is normal.
So overall, I think my current guess is that you're trying to hold me to standards that are unnecessarily high. It seems supererogatory rather than obligatory.