LESSWRONG
LW

901
VojtaKovarik
862Ω260261610
Message
Dialogue
Subscribe

My original background is in mathematics (analysis, topology, Banach spaces) and game theory (imperfect information games). Nowadays, I do AI alignment research (mostly systemic risks, sometimes pondering about "consequentionalist reasoning").

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Formalising Catastrophic Goodhart
No wikitag contributions to display.
5VojtaKovarik's Shortform
Ω
2y
Ω
5
If Anyone Builds It Everyone Dies, a semi-outsider review
VojtaKovarik6d43

To rephrease/react: Viewing the AI's instrumental goal as "avoid being shut down" is perhaps misleading. The AI wants to achieve its goals, and for most goals, that is best achieved by ensuring that the environment keeps on containing something that wants to achieve the AI's goals and is powerful enough to succeed. This might often be the same as "avoid being shut down", but definitely isn't limited to that.

Reply1
If Anyone Builds It Everyone Dies, a semi-outsider review
VojtaKovarik6d5-2

I think [AI within the range would be smart enough to bide its time and kill us only once it has become intelligent enough that success is assured] is clearly wrong. An AI that *might* be able to kill us is one that is somewhere around human intelligence.  And humans are frequently not smart enough to bide their time

Flagging that this argument seems invalid. (Not saying anything about the conclusion.) I agree that humans frequently act too soon. But the conclusion about AI doesn't follow -- because the AI is in a different position. For a human, it is very rarely the case that they can confidently expect to increase in relative power. That the the "bide your time" strategy is such a clear win. For AI, this seems different. (Or at the minimum, the book assumes this when making the argument criticised here.)

Reply
Irresponsible Companies Can Be Made of Responsible Employees
VojtaKovarik7d10

> I imagine that the fields like fairness & bias have to encounter this a lot, so they might be some insights.

It makes sense to me that the implicit pathways for these dynamics would be an area of interest to the fields of fairness and bias. But I would not expect them to have any better tools for identifying causes and mechanisms than anyone else[1]. What kinds of insights would you expect those fields to offer?

[Rephrasing to require less context.] Consider questions "Are companies 'trying to' override the safety concerns of their employees?" and "Are companies 'trying to' hire fewer women or paying them less?". I imagine that both of these will suffer from the similar issues: (1) Even if a company is doing the bad thing, it might not be doing it through explicit means like having an explicit hiring policy. (2) At the same time, you can probably always postulate one more level of indirection, and end up going on witch hunt even in places where there are no witches.

Mostly, it just seemed to me that fairness & bias might be the biggest examples of where (1) and (2) are in tension. (Which might have something to do with there being both strong incentives to discriminate and strong taboos against it.) So it seems more likely that somebody there would have insights about how to fight it, compared to other fields like physics or AI, and perhaps even more than economics and politics. (Of course, "somebody having insights" is consistent with "most people are doing it wrong".)

As to how those insights would look like: I don't know :-( . Could just be a collection of clear historical examples where we definitely know that something bad was going on but naive methods X, Y, Z didn't show it, together with some opposite examples of where nothing was wrong but people kept looking until they found some signs? Plus some heuristics for how to distinguish these.

Reply
Irresponsible Companies Can Be Made of Responsible Employees
VojtaKovarik7d10

With AI, you can't not care about ASI or extinction or catastrophe or dictatorship or any of the words thrown around, as they directly affect your life too.

I think it is important to distinguish between caring in the sense of "it hugely affects your life" and caring in the sense of, for example,  "having opinions on it, voicing them, or even taking costlier actions". Presumably, if something affects you hugely, you should care about it in the second sense, but that doesn't mean people do. My uninformed intuition is that this would apply to most employees of AI companies.

Reply
Irresponsible Companies Can Be Made of Responsible Employees
VojtaKovarik11d40

Can you gesture at what you are basing this guess on?

Ignore this if you are busy, but I was also wondering: Do you have any takes on "do people get fired over acting on beliefs about xrisk", as opposed to "people getting fired over caring about xrisk"?
And perhaps on "whether people would get fired over acting on xrisk beliefs, so they don't act on them"? (Though this seems difficult to operationalise.)

Reply
Irresponsible Companies Can Be Made of Responsible Employees
VojtaKovarik11d20

Yep, these all seem relevant, and I would be interested in answers. (And my thanks to leogao for their takes.)

I would additionally highlight[1] the complication where even if a tendency is present -- say, selective hiring for a particular belief -- it might not work as explicitly as people knowingly paying attention to it. It could be implicitly present in something else ("culture fit"), or purely correlational, etc. I am not sure how to best deal with this.
FWIW, I think this concern is important, but we need to be very cautious about it. Since one could always go "yeah, your results say nothing like this is going on, but the real mechanism is even more indirect". I imagine that the fields like fairness & bias have to encounter this a lot, so they might be some insights.

  1. ^

    To be clear, your comment seems aware of this issue. I just wanted to emphasise it.

Reply
You can't eval GPT5 anymore
VojtaKovarik1mo136

I want to flag that things like this are increasingly going to be a problem even if we solve the narrower "system date mismatch" issue.

  • The deployed AI will predictably have access to the internet and probably also information abou what kind of "pre-deployment" testing it went through. The access to the internet will let it reliably figure out what date it is. And this in turn lets the AI figure out that it is, extremely likely, no longer in testing. Or at least gives it a good chance to.
  • This will be true for the deployed AI. But the not-yet-deployed AI will know it too, on a quite deep level, because this is all quite easily predictable. So if it doesn't see internet access, it knows it is likely still in testing. If it only has access to weirdly limited or inconsistent "internet", it knows it is still in testing. This will be a problem even if the AI isn't scheming. But if it is scheming, it can even investigate the "internet" it has access to. And if it finds many inconsistencies -- because you tried to make it believe that it is in 2026 when it is in fact 2025, and you don't know how 2026 internet will look like -- it knows it is probably still in testing.

A relevant project idea: I think that to the extent that (some types of) evaluations are going to (in some cases) predictably fail like this, it would be nice to clearly demonstrate that ahead of time. Vince Conitzer had a nice idea to run a sort of competition or challenge on things like this. I am not certain what would be the best way of going about it, or whether it makes sense -- but I was playing around with some ideas in this google doc. If anybody has some ideas, I would definitely be keen on hearing them!

(I was planning to make a separate post about this later, but it feels very relevant to the current post, so I am hastily commenting earlier. Apologies for the ideas being a bit jumbled as a result.)

Reply
AI 2027: What Superintelligence Looks Like
VojtaKovarik6mo40

Data point on the impact of this: in Czech Republic, this scenario made it into one of the popular newspapers, and I have heard about it from some people around me who don't know much about AI.

https://denikn.cz/1700968/do-deseti-let-bude-po-vsem-petice-expertu-nabizi-presvedcivy-scenar-o-tom-jak-ai-ovladne-svet-a-vyhubi-lidstvo/

Reply
A Longlist of Theories of Impact for Interpretability
VojtaKovarik10moΩ110
  • If I could assume things like "they are much better at reading my inner monologue than my non-verbal thoughts", then I could create code words for prohibited things.
  • I could think in words they don't know.
  • I could think in complicated concepts they haven't understood yet. Or references to events, or my memories, that they don't know.
  • I could leave a part of my plans implicit, and only figure out the details later.
  • I could harm them through some action for which they won't understand that it is harmful, so they might not be alarmed even if they catch me thinking it. (Leaving a gas stove on with no fire.)
  • And then there are the more boring things, like if you know more details about how the mind-reading works, you can try to defeat it. (Make the evil plans at night when they are asleep, or when the shift is changing, etc.)

(Also, I assume you know Circumventing interpretability: How to defeat mind-readers, but mentioning it just in case.)

Reply
A Longlist of Theories of Impact for Interpretability
VojtaKovarik10moΩ110

even an incredibly sophisticated deceptive model which is impossible to detect via the outputs may be easy to detect via interpretability tools (analogy - if I knew that sophisticated aliens were reading my mind, I have no clue how to think deceptive thoughts in a way that evades their tools!)

It seems to me that your analogy is the wrong way arond. IE, the right analogy would be "if I knew that a bunch of 5-year olds were reading my mind, I have...actually, a pretty good idea how to think deceptive thoughts in a way that avoids their tools".

(For what it's worth, I am not very excited about interpretability as an auditing tool. This -- ie, that powerful AIs might avoid this -- is one half of the reason. The other half is that I am sceptical that we will take audit warnings seriously enough -- ie, we might ignore any scheming that falls short of "clear-cut example of workable plan to kill many people". EG, ignore things like this. Or we might even decide to "fix" these issues by putting the interpretability in the loss function, and just deploying once the loss goes down.)

Reply
Load More
79Irresponsible Companies Can Be Made of Responsible Employees
12d
15
28Apply now to Human-Aligned AI Summer School 2025
Ω
4mo
Ω
1
3When is "unfalsifiable implies false" incorrect?
Q
1y
Q
11
13What is the purpose and application of AI Debate?
Q
2y
Q
9
24Extinction Risks from AI: Invisible to Science?
Ω
2y
Ω
7
23Extinction-level Goodhart's Law as a Property of the Environment
Ω
2y
Ω
0
19Dynamics Crucial to AI Risk Seem to Make for Complicated Models
Ω
2y
Ω
0
18Which Model Properties are Necessary for Evaluating an Argument?
Ω
2y
Ω
2
27Weak vs Quantitative Extinction-level Goodhart's Law
Ω
2y
Ω
1
5VojtaKovarik's Shortform
Ω
2y
Ω
5
Load More