All of zeshen's Comments + Replies

Strong upvoted. I was a participant of AISC8 in the team that went on to launch AI Standards Lab, which I think counterfactually would not be launched if not for AISC.

This seems to be another one of those instances where I wish there was a dual-voting system to posts. I would've liked to strong disagree with the contents of the post without discouraging well-intentioned people from posting here. 

I feel like a substantial amount of disagreement between alignment researchers are not object-level but semantic disagreements, and I remember seeing instances where person X writes a post about how he/she disagrees with a point that person Y made, with person Y responding about how that wasn't even the point at all. In many cases, it appears that simply saying what you don't mean could have solved a lot of the unnecessary misunderstandings.

I'm curious if there are specific parts to the usual arguments that you find logically inconsistent.

I Googled up 'how are tokens embedded' and this post came up third in the results - thanks for the post!

If this interests you, there is a proposal in the Guideline for Designing Trustworthy Artificial Intelligence by Fraunhofer IAIS which includes the following:

[AC-R-TD-ME-06] Shutdown scenarios
Requirement: Do
Scenarios should be identified, analyzed and evaluated in which the live AI application must be completely or partially shut down in order to maintain the ability of users and affected persons to perceive situations and take action. This includes shutdowns due to potential bodily injury or damage to property and also due to the violation of personal rig

... (read more)

Everyone in any position of power (which includes engineers who are doing a lot of intellectual heavy-lifting, who could take insights with them to another company), thinks of it as one of their primary jobs to be ready to stop

In some industries, Stop Work Authorities are implemented, where any employee at any level in the organisation has the power to stop a work deemed unsafe at any time. I wonder if something similar in spirit would be feasible to be implemented in top AI labs. 

This is definitely my dream, although I think we're several steps away from this being achievable at the present time.

Without thinking about it too much, this fits my intuitive sense. An amoeba can't possibly demonstrate a high level of incoherence because it simply can't do a lot of things, and whatever it does would have to be very much in line with its goal (?) of survival and reproduction. 

Thanks for this post. I've always had the impression that everyone around LW have been familiar with these concepts since they were kids and now know them by heart, while I've been struggling with some of these concepts for the longest time. It's comforting to me that there are long time LWers who don't necessarily fully understand all of these stuff either.

3Adam Zerner1y
That's awesome to hear! Before reading this comment I had a vague sense of "Maybe this'll help people" but after reading it I have a very concrete sense of "Oh yes! This is exactly the thing I was hoping would happen."

Browsing through the comments section it seems that everyone relates to this pretty well. I do, too. But I'm wondering if this applies mostly to a LW subculture, or is it a Barnum/Forer effect where every neurotypical person would also relate to?

3[DEACTIVATED] Duncan Sabien1y
I suspect everyone can relate in that everyone has felt this at some point, or even at a few memorable points. I suspect people who are more firmly normal, not because they're trying to conform but because they're actually close to what the center of their local culture is built to accommodate, cannot relate to feeling this constantly.

With regards the Seed AI paradigm, most of the publications seem to have come from MIRI (especially the earlier ones when they were called the Singularity Institute) with many discussions happening both here on LessWrong as well as events like the Singularity Summit. I'd say most of the thinking around this paradigm happened before the era of deep learning. Nate Soares' post might provide more context.

You're right that brain-like AI has not had much traction yet, but it seems to me that there is a growing interest in this research area lately (albeit much ... (read more)

AI is highly non-analogous with guns.

Yes, especially for consequentialist AIs that don't behave like tool AIs. 

I feel like I broadly agree with most of the points you make, but I also feel like accident vs misuse are still useful concepts to have. 

For example, disasters caused by guns could be seen as:

  • Accidents, e.g. killing people by mistaking real guns for prop guns, which may be mitigated with better safety protocols
  • Misuse, e.g. school shootings, which may be mitigated with better legislations and better security etc.
  • Other structural causes (?), e.g. guns used in wars, which may be mitigated with better international relations

Nevertheless, all of the above ... (read more)

3David Scott Krueger (formerly: capybaralet)1y
Yes it may be useful in some very limited contexts.  I can't recall a time I have seen it in writing and felt like it was not a counter-productive framing. AI is highly non-analogous with guns.

Upvoted. Though as someone who has been in the corporate world for close to a decade, this is probably one of the rare LW posts that I didn't learn anything new from. And because every point is so absolutely true and extremely common in my experience, when reading the post I was just wondering the whole time how this is even news.

There are probably enough comments here already, but thanks again for the post, and thanks to the mods for curating it (I would've missed it otherwise).

This is a nice post that echoes many points in Eliezer's book Inadequate Equilibria. In short, it is entirely possible that you outperform 'experts' or 'the market', if there are reasons to believe that these systems converge to a sub-optimal equilibrium, and even more so when you have more information that the 'experts', like in your Wave vs Theorem example. 

More related LW concepts: Hero Licensing, and a few essays in the Inside/Outside View tag.

In every scenario, if you have a superintelligent actor which is optimizing the grader's evaluations while searching over a large real-world plan space, the grader gets exploited.

Similar to the evaluator-child who's trying to win his mom's approval by being close to the gym teacher, how would grader exploitation be different from specification gaming / reward hacking? In theory, wouldn't a perfect grader solve the problem? 

One point of this post is that specification gaming, as currently known, is an artifact of certain design patterns, which arise from motivating the agent (inner alignment) to optimize an objective over all possible plans or world states (outer alignment). These design patterns are avoidable, but AFAICT are enforced by common ways of thinking about alignment (e.g. many versions of outer alignment commit to robustly grading the agent on all plans it can consider). One hand (inner alignment) loads the shotgun, and our other hand (outer alignment) points it at our own feet and pulls the trigger. Yes, in theory. In practice, I think the answer is "no", for reasons outlined in this post.

In case anyone comes across this post trying to understand the field, Scott Aaronson did a better job at me at describing the "seed AI" and "prosaic AI" paradigms here, which he calls "Orthodox" vs "Reform". 

I'm probably missing something, but doesn't this just boil down to "misspecified goals lead to reward hacking"?

Nope! Both "misspecified goals" and "reward hacking" are orthogonal to what I'm pointing at. The design patterns I highlight are broken IMO.

This post makes sense to me though it feels almost trivial. I'm puzzled by the backlash against consequentialism, it just feels like people are overreacting. Or maybe the 'backlash' isn't actually as strong as I'm reading it to be.

I'd think of virtue ethics as some sort of equilibrium that society has landed ourselves in after all these years of being a species capable of thinking about ethics. It's not the best but you'd need more than naive utilitarianism to beat it (this EA forum post feels like commonsense to me too), which you describe as reflective c... (read more)

3Adam Zerner1y
Yeah, I have very similar thoughts.

Thanks - this helps.

Thanks for the reply! 

But I think you can come up with clean examples of capabilities failures if you look at, say, robots that use search to plan; they often do poorly according to the manually specified reward function on new domains because optimizing the reward is too hard for its search algorithm. 

I'd be interested to see actual examples of this, if there are any. But also, how would this not be an objective robustness failure if we frame the objective as "maximize reward"? 

if you perform Inverse Optimal Control on the behavior of the r

... (read more)
I have some toy examples from a paper I worked on: But I think this is a well known issue in robotics, because SOTA trajectory planning is often gradient-based (i.e. local).  You definitely see this on any "hard" robotics task where initializing a halfway decent trajectory is hard. I've heard from Anca Dragan (my PhD advisor) that this happens with actual self driving car planners as well. Oops, sorry, the answer got cutoff somehow. I meant to say that if you take a planner that's suboptimal, and look at the policy it outputs, and then rationalize that policy assuming that the planner is optimal, you'll get a reward function that is different from the reward function you put in. (Basically what the Armstrong + Mindermann paper says.) Well, the paper doesn't show that you can't decompose it, but merely that the naive way of decomposing observed behavior into capabilities and objectives doesn't work without additional assumptions. But we have additional information all the time! People can tell when other people are failing due to incompetence via misalignment, for example. And in practice, we can often guess whether or not a failure is due to capabilities limitations or objective robustness failures, for example by doing experiments with fine-tuning or prompting.  The reason we care about 2D alignment is that capability failures seem much more benign than alignment failures. Besides the reasons given in the main post, we might also expect that capability failures will go away with scale, while alignment failures will become worse with scale. So knowing whether or not something is a capability robustness failure vs an alignment one can inform you as to the importance and neglectedness of research directions.

Thanks for the example, but why this is a capabilities robustness problem and not an objective robustness problem, if we think of the objective as 'classify pandas accurately'?

Insofar as it's not a capability problem, I think it's example of Goodharting and not inner misalignment/mesa optimization. The given objective ("minimize cross entropy loss") is maximized on distribution by incorporating non-robust features (and also gives no incentive to be fully robust to adversarial examples, so even non-robust features that don't really help with performance could still persist after training).  You might argue that there is no alignment without capabilities, since a sufficiently dumb model "can't know what you want". But I think you can come up with clean examples of capabilities failures if you look at, say, robots that use search to plan; they often do poorly according to the manually specified reward function on new domains because optimizing the reward is too hard for its search algorithm.  Of course, you can always argue that this is just an alignment failure one step up; if you perform Inverse Optimal Control on the behavior of the robot and derive a revealed reward function, you'll find that its . In other words, you can invoke the fact that for biased agents, there doesn't seem to be a super principled way of dividing up capabilities and preferences in the first place. I think this is probably going too far; there are still examples in practice (like the intractable search example) where thinking about it as a capabilities robustness issue is more natural than thinking about it as a objective problem. 

I don't know how I even got here after so long but I really like this post. Looking forward to next year's post.

I'd love to see a post with your reasonings.

I think these are fair assumptions for the alignment field in general. There are, however, work done outside this community that have different assumptions but also call themselves AI safety, e.g. this one

(I've written more about these assumptions here).

Buying time could also mean implementing imperfect solutions that don't work against strong AGIs but might help to not get us destroyed by the first AGI that might be relatively weak.

(I wrote about it recently)

For example, although our results show CoinRun models failed to learn the general capability of pursuing the coin, the more natural interpretation is that the model has learned a robust ability to avoid obstacles and navigate the levels,[7] but the objective it learned is something like “get to the end of the level,” instead of “go to the coin.”

It seems to me that every robustness failure can be interpreted as an objective robustness failure (as aptly titled in your other post). Do you have examples of a capability robustness failure that is not an objecti... (read more)

Are there any examples of capability robustness failures that aren't objective robustness failures? 

Yes. This image is only a classifier. No mesa optimizer here. So we have only a capability robustness problem

I got the book (thanks to Conjecture) after doing the Intro to ML Safety Course where the book was recommended. I then browsed through the book and thought of writing a review of it - and I found this post instead, which is a much better review than I would have written, so thanks a lot for this! 

Let me just put down a few thoughts that might be relevant for someone else considering picking up this book.

Target audience: Right at the beginning of the book, the author says "This book is written for the sophisticated practitioner rather than the academic... (read more)

This is actually what my PhD research is largely about: Are these risks actually likely to materialize? Can we quantify how likely, at least in some loose way? Can we quantify our uncertainty about those likelihoods in some useful way? And how do we make the best decisions we can if we are so uncertain about things?

I'd be really interested in your findings.

If there's an existing database of university groups already, it would be great to include a link to that database, perhaps under "Local EA Group". Thanks!

I thought this is a reasonable view and I'm puzzled with the downvotes. But I'm also confused by the conclusion - are you arguing on whether the x-risk from AGI is something predictable or not? Or is the post just meant to convey examples on the merits to both arguments?

Thanks — I'm not arguing for this position, I just want to understand the anti AGI x-risk arguments as well as I can. I think success would look like me being able to state all the arguments as strongly/coherently as their proponents would.

(see Zac's comment for some details & citations)

Just letting you know the link doesn't work although the comment was relatively easy to find. 

Thanks for the comment!

You can read more about how these technical problems relate to AGI failure modes and how they rank on importance, tractability, and crowdedness in Pragmatic AI Safety 5. I think the creators included this content in a separate forum post for a reason.

I felt some of the content in the PAIS series would've been great for the course, though the creators probably had a reason to exclude them and I'm not sure why. 

The second group doesn't necessarily care about why each research direction relates to reducing X-risk.

In this case I fee... (read more)

From what I understand, Dan plans to add more object-level arguments soon.

Ah that's clear, thanks! I must've overlooked the "In 2016" right at the top of the post. 

Very minor thing but I was confused for a while when you say end of 2020, I thought of it as the year instead of the decade (2020s). 

2Daniel Kokotajlo1y
It was the year, your original interpretation was correct. Key thing to notice: The question is asking about a hypothetical situation in which we magically get lots more compute 5 years before 2020, and then asks what happens by end of 2020 in that hypothetical.

Your position makes sense. Part of it was just paraphrasing (what seems to me as) the 'consensus view' that preventing AIs from wiping us out is much more urgent / important than preventing AIs from keeping us alive in a far-from-ideal state. 

This is a great guide - thank you. However, in my experience as someone completely new to the field, 100-200 hours on each level is very optimistic. I've easily spent double/triple the duration on the first two levels and not get to a comfortable level. 

4Gabriel Mukobi1y
Thanks, yeah that's a pretty fair sentiment. I've changed the wording to "at least 100-200 hours," but I guess the idea was more to present a very efficient way of learning things that maybe 80/20's some of the material. This does mean there will be more to learn—rather than these being strictly linear progression levels, I imagine someone continuously coming back to AI safety readings and software/ML engineering skills often throughout their journey, as it sounds like you have.

For those who prefer not to spend 3 hours (or 1.5 hours on 2x speed) watching the video, the lecture notes are here. They seem fairly self-explanatory.

"Because it's there" - George Mallory in 1923, when asked why he wanted to climb Everest. He died in his summit attempt the following year. 

A part of me is worried that the terminology invites viewing mesa-optimisers as a description of a very specific failure mode, instead of as a language for the general worry described above.

I have been very confused about the term for a very long time, and have always thought mesa-optimisers is a very specific failure mode.

This post helped me clear things up.

Are you just noting that the model won't necessarily find the global maxima, and only reach some local maxima?

That was my takeaway as well, but I'm also somewhat confused.

This post has helped me clear up some confusions that I had about inner misaglinment for the longest time. Thank you.

The Solomonoff, or Universal, prior is a probability distribution over strings of a certain alphabet (usually over all strings of 1s and 0s). It is defined by taking the set of all Turing machines (TMs) which output strings, assigning to each a weight proportional to

The image right after the paragraph above can't seem to be displayed after multiple refreshes. The same happens on the AF post. Tried different browsers but it didnt work. 

The picture links to^{-L} which gives a 404 Error. The picture likely displayed the formula 2−L

This seems like a good idea, although I feel that the number of people who use FLOPS to mean FLOP per second are still going to grow faster than those who switch to using FLOP/s, as most introductory material from a simple google search still refer to FLOPS as FLOP per second. Perhaps an edit to the wiki page could expedite the switch? 

Also, will FLOP/s still be pronounced 'flops' or will it now be 'flop per second'?

Load More