All of habryka's Comments + Replies

No, the "browsing: enabled" is I think just another hilarious way to circumvent the internal controls.

Huh, this was a great comment. I had read Sense of Style a while ago, and do share many of the OPs complaints about other writing advice, so I did confabulate that Sense of Style was giving the kind of advice the OP argues against, but this comment has convinced me that I was wrong.

But Rational Animations already has many animations?

Oh. There was no link and it sounded like a planned thing. I found it: []

Nah, seems to be a browser-specific bug. I expect we will fix it next week, after Thanksgiving.

Seems... hard and a bit overwhelming. My current guess is we are unlikely to do that, just because at least in my mental eye it feels like it would make the comment section quite cluttered.

2Yoav Ravid8d
The icon could be to the side of the comment section (same place as it is on the post). I think it would be nice on long comments, but only on long comments. so if it is implemented the implementation should include a check for how long the comment is, which makes thing more complicated. But maybe it can be implemented along with voting on the bottom when a comment is long, seems like that part is the same piece of code.

Yeah, agree that the small comment count is too faint in dark mode.

Huh, toggling works fine for me. Maybe this was fixed in the last 12 hours?

I just checked after emptying my cache & deleting cookies, and for me toggling still doesn't work. I can pin side comments by clicking the icon, and unpin them by clicking anywhere outside the marked box, but I can't unpin by clicking the icon again. That said, maybe changes to the website take a while to propagate to end users or something?
Huh, there is in fact a bug here, but it's browser-specific (fails to close in Firefox, closes correctly in Chrome). Will investigate.

I don't like to have to scroll my screen horizontally to read the comment. (I notice there's a lot of perfectly good unused white space on the left side; comments would probably fit horizontally if you pushed everything to the left!)

Yeah, this is pretty annoying. We spent a decent amount of time trying to make it so that the whole page shifts to the left when you open a comment, but it ended up feeling too janky. We might still make it work later on.

The current layout is optimized for 1440px wide screen size, which is the most common width that people u... (read more)

Yep, this is the intended way of opening the comments. Hover, read the first few lines to see whether you're interested, then click to pin and fully read it, including the rest of the discussion.

1the gears to ascenscion7d
might benefit from additional visual discoverability, make it look more like a button somehow

Yeah, I also find this kind of annoying. The "intended" way of using it is to click on the comment icon which "pins" the comment open and then you can do with your mouse whatever you want. We played around with a few different ways of leaving it open, but all of them had more frustrating interactions than the current one.

2Adam Zerner8d
Ah, that's good to know about clicking on it. In retrospect I'm surprised I didn't realize that. And that makes sense about being difficult to come up with a better option. I was thinking of having the comment appear in the middle of the screen to the left of the comment icon. That has the downside of being more intrusive. My sense is that the upsides outweigh the downsides, but I'm not particularly confident. I also think it makes sense to not let the perfect be the enemy of the good with this feature, especially wrt releasing fast and seeing what users think.

Hmm, I feel like the revision would have to be in Scott's comment. I was just responding to the names that Scott mentioned, and I think everything I am saying here is still accurate.

I think this is likely wrong. I agree that there is a plausible story here, but given the case that Sam seems to have lied multiple times in confirmed contexts (for example when saying that FTX has never touched customer deposits), and people's experiences at early Alameda, I think it is pretty likely that Sam was lying quite frequently, and had done various smaller instances of fraud.

I don't think the whole FTX thing was a ponzi scheme, and as far as I can tell FTX the platform itself (if it hadn't burned all of its trust in the last 3 weeks), would have ... (read more)

At least for income the effect seems robust into the tails, where IIRC each standard deviation added a fixed amount of expected income in basically the complete dataset.

Thanks! I guess one way to motivate our argument is that if the information-processing capabilities of humans were below the diminishing returns point, then we would have expect that individual humans with much greater than average information-processing capabilities to have distinct advantage in jobs such as CEOs and leaders. This doesn't seem to be the case.

I don't understand, this seems clearly the case to me. Higher IQ seems to result in substantially higher performance in approximately all domains of life, and I strongly expect the population of successful CEOs to have many standard deviations above average IQ.

How many standard deviations? My (admittedly only partially justified) guess is that there are diminishing returns to being (say) three standard deviations above the mean compared to two in a CEO position as opposed to (say) a mathematician. (Not that IQ is perfectly correlated with math success either.)
This can't actually happen, but only due to the normal distribution of human intelligence placing hard caps on how much variance exists in humans.

We've been trying to reproduce this bug for a while. Do you by any chance have any series of steps that reliably produces it?

Good question, just did some fiddling around. Current best theory (this is on Android Chrome):

  • Scroll the page downward so that the top bar appears.
  • Tap a link, but drag away from it, so the tooltip appears but the link doesn't activate. (Or otherwise do something that makes a tooltip appear, I think.)
  • Scroll the page upward so that the top bar disappears.
  • Tap to close the tooltip.

If this doesn't reproduce the problem 100% of the time, it seems very close. I definitely have the intuition that it's related to link clicks; I also note that it always seems... (read more)

Is my reproduction in the original comment not enough? I'm pretty sure the problem in Firefox is just the same as in the mobile browsers, namely that the notification bar does weird stuff as the page is loaded, when it should be invisible unless intentionally opened by the user. Maybe the bar is instead set to be displayed as visible but offscreen, and thus randomly appears onscreen when the page is loaded and the positions of page elements move around during loading? If you'd prefer, we can also move this discussion to the Github issue tracker if there's already a ticket for the bug. EDIT: So far I've reported a whole [] bunch [] of [] bugs [] , incl. reproductions (incomplete list). If the LW team could use someone to help with bug reports & reproductions, I might be up for that, so let me know and we could work something out.

Yeah, I am sorry. Like, I don't think I currently have the energy to try to communicate all the subtle things that feel wrong to me about this, but it adds up to something I quite dislike.

I wish I had a more crystallized quick summary that I expect to cross the inferential distance quickly, but I don't currently.

I still get a strong feeling of group think every time I see the title of the post, and feel a strong sense of something invading into my thought-space in a way that feels toxic to me. For some reason this feels even stronger in the Twitter post you made:

I don't know, I just feel like this is some kind of call-to-action that is trying to bypass my epistemic defenses. 

-1the gears to ascenscion20d
it sure is a call to action, your epistemic defenses had better be good enough to figure out that it is a good one, because it is, and it is correct to pressure you about it. the fact that you're uncertain about whether I am right does not mean that I am uncertain. it is perfectly alright to say you're not sure if I'm right. but being annoyed at people for saying you should probably come to this conclusion is not reasonable when that conclusion is simply actually objectively justified - instead say you will have to think about it because you aren't sure if you see the justification yet, or something, and remember that you don't get to exclude comments from affecting your reputation, ever. if there's a way you can encode your request for courtesy about required updates that better clarifies that you are in fact willing to make updates that do turn out to be important and critical moral cooperation policy updates, then finding that method of phrasing may in fact be positive expected value for the outside world because it would help people request moral updates of each other in ways that are not pushing too hard. but it is often correct to push. do not expect people to make an exception because the phrasing was too much pressure.

The Twitter post is literally just title + link. I don't like Twitter, and don't want to engage on it, but I figured posting this more publicly would be helpful, so I did the minimum thing to try to direct people to this post.

From my perspective, I find it pretty difficult to be criticized for a “feeling” that you get from my post that seems to me to be totally disconnected from anything that I actually said.

I think the title is the biggest problem here:

We must be very clear: fraud in the service of effective altruism is unacceptable

There is no "I think" here, no "I believe". At least to me it feels very much like a warcry instead of a statement about the world.

to make clear that we don't support fraud in the service of effective altruism.

This is also a call to action to change some kind of collective belief. I agree that you might have meant "we individually don't support fraud", but the "in the service of effective altruism" gives me a sense of this ... (read more)

I agree that the title does directly assert a claim without attribution, and that it could be misinterpreted as a claim about what all EAs think should be done rather than just what I think should be done. It's a bit tricky because I want the title to be very clear, but am quite limited in the words I have available there. I think the latter quote is pretty disingenuous—if you quote the rest of that sentence, the beginning is “I think the best course of action is”, which makes it very clear that this is a claim about what I personally believe people should do: To be clear, “in the service of effective altruism” there is meant to refer to fraud done for the purpose of advancing effective altruism, not that we have an obligation to not support fraud and that obligation is in the service of effective altruism. Edit: To make that last point more clear, I chainged “to make clear that we don't support fraud in the service of effective altruism” to “to make clear that we don't support fraud done in the service of effective altruism”.

I agree that there is value in common-knowledge building, but there is a difference between doing something that feels social-miasma or simulacrum-level 3 shaped where you assert that "WE ALL AGREE THAT X" vs. you argue that something is a good idea, and you currently believe lots of other people believe the same.

I think coordinating against dishonest practices is important, but I don't think in order to do that we have to move away from just making primarily factual statements or describing your own belief state, and have to invent some kind of group-level belief.

Where do you think I make any claims that “everyone agrees X” as opposed to “I think X”? In fact, rereading my own writing, I think I was quite clear that everything therein was my view and my view alone.

Don't know, I literally never talked to one during my studies. My guess is they would advise against it, but be pretty clear that it's ultimately up to you.

I did a bit of a weird thing because I started university while I was still in high-school, but I had many friends who never attended a single lecture, and maybe did like 15% of the homework, and nobody cared and they didn't seem to feel bad about it.

For context, at Berkeley (where I actually finished my undergrad) more than half of my classes had mandatory labs, and maybe 30% of my grade was determined by homework grades in almost all classes, so it was a very different experience.

5Adam Zerner1mo
That is common in American universities as well, but if you asked a university administrator about it they wouldn't endorse it. Would a university administrator for German universities say "yes, go right ahead and skip all the lectures if you want" or would they say "no, we strongly advise against that"?

Yeah, totally, 10 hours of reading seems definitely worth it, and like, I think many hours of conversation, if only because those hours of conversation will probably just help you think through things yourself.

I also think it does make a decent amount of sense to coordinate with existing players in the field before launching new initiatives and doing big things, though I don't think it should be a barrier before you suggest potential plans, or discuss potential directions forward.

Huh, I thought this was basically standard in german universities. In my math degree there were no mandatory classes or homework or anything, the only thing that mattered was what grade you got at the exam at the end of the semester. There were lectures and homework, but it was totally up to you how you learned the content (and I did indeed choose basically full self-study).

This was my experience studying in the Netherlands as well. University officials were indeed on board with this, with the general assumption being that lectures and instructions and labs and such are a learning resource that you can use or not use at your discretion.
3Adam Zerner1mo
In American universities some smaller classes would take attendance. Bigger classes worked pretty much like you described, where there are lectures and stuff but you could ignore them, and what ultimately matters are the exams. But the "party line" is that you're not supposed to do that. You're supposed to go to the lectures. Ie. if you asked some university PR representative, they wouldn't say "it's totally up to you whether or not you want to go to lectures". Is that how it is in German universities as well, or is the party line that it truly is up to you?

I think RLHF doesn't make progress on outer alignment (like, it directly creates a reward for deceiving humans). I think the case for RLHF can't route through solving outer alignment in the limit, but has to route through somehow making AIs more useful as supervisors or as researchers for solving the rest of the AI Alignment problem.

Like, I don't see how RLHF "chips away at the proxy gaming problem". In some sense it makes the problem much harder since you are directly pushing on deception as the limit to the proxy gaming problem, which is the worst case for the problem.

This has also motivated me to post one of my favorite critiques of RLHF. []

I'm not a big supporter of RLHF myself, but my steelman is something like:

RLHF is a pretty general framework for conforming a system to optimize for something that can't clearly be defined. If we just did, "if a human looked at this for a second would they like it?" this does provide a reward signal towards deception, but also towards genuinely useful behavior. You can take steps to reduce the deception component, for example by letting the humans red team the model or do some kind of transparency; this can all theoretically fit in the framework of RLHF. O... (read more)

The whole thing reads a bit like "AI governance" and "AI strategy" reinvented under a different name, seemingly without bothering to understand what's the current understanding.

I overall agree with this comment, but do want to push back on this sentence. I don't really know what it means to "invent AI governance" or "invent AI strategy", so I don't really know what it means to "reinvent AI governance" or "reinvent AI strategy".

Separately, I also don't really think it's worth spending a ton of time trying to really understand what current people think ab... (read more)

I overall agree with this comment, but do want to push back on this sentence. I don't really know what it means to "invent AI governance" or "invent AI strategy", so I don't really know what it means to "reinvent AI governance" or "reinvent AI strategy".

By reinventing it, I means, for example, asking questions like "how to influence the dynamic between AI labs in a way which allows everyone to slow down at critical stage", "can we convince some actors about AI risk without the main effect being they will put more resources into the race", "what's up ... (read more)

I'm sympathetic under some interpretations of "a ton of time," but I think it's still worth people's time to spend at least ~10 hours of reading and ~10 hours of conversation getting caught up with AI governance/strategy thinking, if they want to contribute. Arguments for this: * Some basic ideas/knowledge that the field is familiar with (e.g. on the semiconductor supply chain, antitrust law, immigration, US-China relations, how relevant governments and AI labs work, the history of international cooperation in the 20th century) seem really helpful for thinking about this stuff productively. * First-hand knowledge of how relevant governments and labs work is hard/costly to get on one's own. * Lack of shared context makes collaboration with other researchers and funders more costly. * Even if the field doesn't know that much and lots of papers are more advocacy pieces, people can learn from what the field does know and read the better content.

I want to push back on your suggestion that safety culture is not relevant. I agree that being vaguely "concerned" does seem not very useful. But safety culture seems very important. Things like (paraphrased from this paper):

Sorry, I don't want to say that safety culture is not relevant, but I want to say something like... "the real safety culture is just 'sanity'".

Like, I think you are largely playing a losing game if you try to get people to care about "safety" if the people making decisions are incapably of considering long-chained arguments, or are ... (read more)

I think I essentially agree with respect to your definition of "sanity," and that it should be a goal. For example, just getting people to think more about tail risk seems like your definition of "sanity" and my definition of "safety culture." I agree that saying that they support my efforts and say applause lights is pretty bad, though it seems weird to me to discount actual resources coming in. As for the last bit: trying to figure out the crux here. Are you just not very concerned about outer alignment/proxy gaming? I think if I was totally unconcerned about that, and only concerned with inner alignment/deception, I would think those areas were useless. As it is, I think a lot of the work is actively harmful (because it is mostly just advances capabilities) but it still may help chip away at the proxy gaming problem.

As my favorite quote from Michael Vassar says: "First they came for our epistemology. And then they...well, we don't really know what happened next". 

But some more detailed arguments:

  1. In-particular if you believe in slow-takeoff worlds, a lot of the future of the world rests on our ability to stay sane when the world turns crazy. I think in most slow-takeoff worlds we are going to see a lot more things that make the world at least as crazy and tumultuous as during COVID, and I think our rationality was indeed not strong enough to really handle COVID gr
... (read more)

Yeah, as someone without a ton of pre-existing biology knowledge, there was definitely a lot of words and terms you used that I had to look up. I think I still got value out of it, but I do think my understanding of stuff is still pretty shaky given that I don't seem to have a lot of the foundations this post requires.

Epistemic status: Quick rant trying to get a bunch of intuitions and generators across, written in a very short time. Probably has some opinions expressed too strongly in various parts. 

I appreciate seeing this post written, but currently think that if more people follow the advice in this post, this would make the world a lot worse (This is interesting because I personally think that at least for me, doing a bunch of "buying time" interventions are top candidates for what I think I should be doing with my life) I have a few different reasons for this... (read more)

I appreciate this comment, because I think anyone who is trying to do these kinds of interventions needs to be constantly vigilant about exactly what you are mentioning. I am not excited about loads of inexperienced people suddenly trying to suddenly do big things in AI strategy, because downsides can be so high. Even people I trust are likely to make a lot of miscalculations. And the epistemics can be bad.

I wouldn't be excited about (for example) retreats with undergrads to learn about "how you can help buy more time." I'm not even sure of the sign of int... (read more)

I agree truth matters, but I have a question here: Why can't we sacrifice a small amount of integrity and conventional morality in order to win political power, when the world is at stake? After all, we can resume it later, when the problem is solved.

Sorry, yeah, I definitely just messed up in my comment here in the sense that I do think that after looking at the research, I definitely should have said "spent a few minutes on each datapoint", instead of "a few seconds" (and indeed I noticed myself forgetting that I had said "seconds" instead of "minutes" in the middle of this conversation, which also indicates I am a bit triggered and doing an amount of rhetorical thinking and weaseling that I think is pretty harmful, and I apologize for kind of sliding between seconds and minutes in my last two commen... (read more)

I do not consider this to be accurate. With WebGPT for example, contractors were generally highly educated, usually with an undergraduate degree or higher, were given a 17-page instruction manual, had to pass a manually-checked trial, and spent an average of 10 minutes on each comparison, with the assistance of model-provided citations.

Sorry, I don't understand how this is in conflict to what I am saying. Here is the relevant section from your paper:

Our labelers consist of contractors hired either through Upwork, or sourced from Scale AI. [...]

[Some detail

... (read more)

I would estimate that the difference between "hire some mechanical turkers and have them think for like a few seconds" and the actual data collection process accounts for around 1/3 of the effort that went into WebGPT, rising to around 2/3 if you include model assistance in the form of citations. So I think that what you wrote gives a misleading impression of the aims and priorities of RLHF work in practice.

I think it's best to err on the side of not saying things that are false in a literal sense when the distinction is important to other people, even whe... (read more)

The RLHF work I'm most excited by, and which constitutes a large fraction of current RLHF work, is focused on getting humans to reward the right thing, and I'm particularly excited about approaches that involve model assistance, since that's the main way in which we can hope for the approach to scale gracefully with model capabilities.

Yeah, I agree that it's reasonable to think about ways we can provide better feedback, though it's a hard problem, and there are strong arguments that most approaches that scale locally well do not scale well globally.

Howe... (read more)

I do not consider this to be accurate. With WebGPT [] for example, contractors were generally highly educated, usually with an undergraduate degree or higher, were given a 17-page instruction manual, had to pass a manually-checked trial, and spent an average of 10 minutes on each comparison, with the assistance of model-provided citations. This information is all available in Appendix C of the paper. There is RLHF work that uses lower-quality data, but it tends to be work that is more experimental, because data quality becomes important once you are working on a model that is going to be used in the real world. There is lots of information about rater selection given in RLHF papers, for example, Appendix B of InstructGPT and Appendix C of WebGPT. What additional information do you consider to be missing?

But I mean, people have used handcrafted rewards since forever. The human-feedback part of RLHF is nothing new. It's as old as all the handcrafted reward functions you mentioned (as evidenced by Eliezer referencing a reward button in this 10 year old comment, and even back then the idea of just like a human-feedback driven reward was nothing new), so I don't understand what you mean by "previous".

If you say "other" I would understand this, since there are definitely many different ways to structure reward functions, but I do feel kind of aggressively gasli... (read more)

I mean, I don't understand what you mean by "previous reward functions". RLHF is just having a "reward button" that a human can press, with when to actually press the reward button being left totally unspecified and differing between different RLHF setups. It's like the oldest idea in the book for how to train an AI, and it's been thoroughly discussed for over a decade.

Yes, it's probably better than literally hard-coding a reward function based on the inputs in terms of bad outcomes, but like, that's been analyzed and discussed for a long time, and RLHF ha... (read more)

I can't tell if you're being uncharitable or if there's a way bigger inferential gap than I think, but I do literally just mean... reward functions used previously. Like, people did reinforcement learning before RLHF. They used reward functions for StarCraft and for Go and for Atari and for all sorts of random other things. In more complex environments, they used curiosity and empowerment reward functions. And none of these are the type of reward function that would withstand much optimization pressure (except insofar as they only applied to domains simple enough that it's hard to actually achieve "bad outcomes").

The complex value paper is the obvious one, which as the name suggests talks about the complexity of value as one of the primary drivers of the outer alignment problem:

Suppose an AI with a video camera is trained to classify its sensory percepts into positive and negative instances of a certain concept, a concept which the unwary might label “HAPPINESS” but which we would be much wiser to give a neutral name like G0034 (McDermott 1976). The AI is presented with a smiling man, a cat, a frowning woman, a smiling woman, and a snow-topped mountain; of these in

... (read more)
Cool, makes sense. I think we disagree on how "principled" a method needs to be in order to constitute progress. RLHF gives rewards which can withstand more optimization before producing unintended outcomes than previous reward functions. Insofar as that's a key metric we care about, it counts as progress. I'd guess we'd both agree that better RLHF and also techniques like debate will further increase the amount of optimization our reward functions can withstand, and then the main crux is whether that's anywhere near the ballpark of the amount of optimzation they'll need to withstand in order to automate most alignment research.

Promoted to curated: I found engaging with this post quite valuable. I think in the end I disagree with the majority of arguments in it (or at least think they omit major considerations that have previously been discussed on LessWrong and the AI Alignment Forum), but I found thinking through these counterarguments and considering each one of them seriously a very valuable thing to do to help me flesh out my models of the AI X-Risk space.

Answer by habrykaNov 04, 2022135

RLHF is just a fancy word for reinforcement learning, leaving almost the whole process of what reward the AI actually gets undefined (in practice RLHF just means you hire some mechanical turkers and have them think for like a few seconds about the task the AI is doing).

When people 10 years ago started discussing the outer alignment problem (though with slightly different names), reinforcement learning was the classical example that was used to demonstrate why the outer alignment problem is a problem in the first place.

I don't see how RLHF could be framed a... (read more)

I agree that the RLHF framework is essentially just a form of model-based RL, and that its outer alignment properties are determined entirely by what you actually get the humans to reward. But your description of RLHF in practice is mostly wrong. Most of the challenge of RLHF in practice is in getting humans to reward the right thing, and in doing so at sufficient scale. There is some RLHF research that uses low-quality feedback similar to what you describe, but it does so in order to focus on other aspects of the setup, and I don't think anyone currently working on RLHF seriously considers that quality of feedback to be at all adequate outside of an experimental setting. The RLHF work I'm most excited by, and which constitutes a large fraction of current RLHF work, is focused on getting humans to reward the right thing, and I'm particularly excited about approaches that involve model assistance, since that's the main way in which we can hope for the approach to scale gracefully with model capabilities. I'm also excited by other RLHF work because it supports this work and has other practical benefits. I don't think RLHF directly addresses inner alignment, but I think that an inner alignment solution is likely to rely on us doing a good enough job at outer alignment, and I also have a lot of uncertainty about how much of the risk comes from outer alignment failures directly.
Got any sources for this? Feels pretty different if the problem was framed as "we can't write down a reward function which captures human values" versus "we can't specify rewards correctly in any way". And in general it's surprisingly tough to track down the places where Yudkowsky (or others?) said all these things.

IMO a big part of why mechanistic interp is getting a lot of attention in the x-risk community is that neural networks are surprisingly more interpretable than we might have naively expected and there's a lot of shovel-ready work in this area. I think if you asked many people three years ago, they would've said that we'd never find a non-trivial circuit in GPT-2-small, a 125m parameter model; yet Redwood has reverse engineered the IOI circuit in GPT-2-small. Many people were also surprised by Neel Nanda's modular addition work.

I don't think I've seen ma... (read more)

1Vivek Hebbar1mo
I have seen one person be surprised (I think twice in the same convo) about what progress had been made. ETA: Our observations are compatible. It could be that people used to a poor and slow-moving state of interpretability are surprised by the recent uptick, but that the absolute progress over 6 years is still disappointing.

Yep, we are definitely not capable of state-level or even "determined individual" level of cyberdefense.

Gato seems to qualify for this, and is surprisingly close to this prediction. My guess is if you had really tried, you could have made something that qualified for the thing he was thinking about in 2019, though nobody was trying super hard.

Just to be clear. There were people working on it who had both agency and competence, but they were working on it as a side project. I think having something be someone's only priority and full-time job makes a large difference on how much agency someone can bring to bear on a project.

Yeah, I think some of this is true, but while there is a lot of AI content, I actually think that a lot of the same people would probably write non-AI content, and engage on non-AI content, if AI was less urgent, or the site had less existing AI content.

That counterfactual is hard to evaluate, but like, a lot of people who used to be core contributors to LW 1.0 are now also posting to LW 2.0, though they are now posting primarily on AI, and I think that's evidence that it's more that there has been a broader shift among LW users that AI is just like really... (read more)

5Nathan Helm-Burger1mo
I would love to be able to stop worrying about AI and go back to improving rationality. Yet another thing to look forward to once we leap this hurdle.

Aww, thank you! ^-^

When setting up automatic crossposting, authors can ask us to make it so that all crossposts automatically have the rel=canonical tag set, pointing to the original post. Ben Kuhn asked us to do this, so the HTML directly says "when indexing this, go to this other URL to find the canonical version of this post".

Note: I added LaTeX to your comment to make it easier to read. Hopefully you don't mind. Pretty sure I translated it correctly. Feel free to revert of course.

Note: I think the title of this post is kind of annoyingly indirect and feels a bit clickbaity. I recommend changing it to something like "Learning human values from law as an AGI alignment strategy" or something like that. Indeed, I might go ahead and do that in a day or so unless someone objects.

1John Nay1mo
Good idea. Will do!

Yeah, I mean, to be clear, I do definitely think you can train a neural network to somehow play chess via nothing but classification. I am not sure whether you could do it with a feed forward neural network, and it's a bit unclear to me whether the neural networks from the 50s are the same thing as the neural networks from 2000s, but it does sure seem like you can just throw a magic category absorber at chess and then have it play OK chess.

My guess is modern networks are not meaningfully more complicated, and the difference to back then was indeed just scale and a few tweaks, but I am not super confident and haven't looked much into the history here.

I think I am more interested in you reading The Genie Knows but Doesn't Care and then having you respond to the things in there than the Hibbard example, since that post was written with (as far as I can tell) addressing common misunderstandings of the Hibbard debate (given that it was linked by Robby in a bunch of the discussion there after it was written). 

I think there are some subtle things here. I think Eliezer!2008 would agree that AIs will totally learn a robust concept for "car". But I think neither Eliezer!2008 nor me currently would think th... (read more)

Looking over that it just seems to be a straightforward extrapolation of EY's earlier points, so I'm not sure why you thought it was especially relevant. Yeah - this is his core argument against Hibbard. I think Hibbard 2001 would object to 'low-powered', and would probably have other objections I'm not modelling, but regardless I don't find this controversial. Yeah, in agreement with what I said earlier: ... I believe I know what you meant, but this seems somewhat confused as worded. If we can train an ML model to learn a very crisp clear concept of a goal, then having the AI optimize for this (point towards it) is straightforward. Long term robustness may be a different issue, but I'm assuming that's mostly covered under "very crisp clear concept". The issue of course is that what humans actually want is complex for us to articulate, let alone formally specify. The update since 2008/2011 is that DL may be able to learn a reasonable proxy of what we actually want, even if we can't fully formally specify it. I think this is something of a red herring. Humans can reasonably predict utility functions of other humans in complex scenarios simply by simulating the other as self - ie through empathy. Also happiness probably isn't the correct thing - probably want the AI to optimize for our empowerment (future optionality), but that's whole separate discussion. Sure, neither do I. A classifier is a function which maps high-D inputs to a single categorical variable, and a utility function just maps some high-D input to a real number, but a k-categorical variable is just the explicit binned model of a log(k) bit number, so these really aren't that different, and there are many interpolations between. (and in fact sometimes it's better to use the more expensive categorical model for regression ) Video frames? The utility function needs to be over future predicted world states .. which you could I guess use to render out videos, but text rendering are probably more na

I do think these are better quotes. It's possible that there was some update here between 2008 and 2013 (roughly when I started seeing the more live discussion happening), since I do really remember the "the problem is not getting the AI to understand, but to care" as a common refrain even back then (e.g. see the Robby post I linked).

I claim that this paragraph didn't age well in light of the deep learning revolution: "running a neural network [...] over a set of winning and losing sequences of chess moves" basically is how AlphaZero learns from self-pla

... (read more)
9Richard Korzekwa 1mo
AlphaGo without the MCTS was still pretty strong []: Even with just the SL-trained value network, it could play at a solid amateur level: I may be misunderstanding this, but it sounds like the network that did nothing but get good at guessing the next move in professional games was able to play at roughly the same level as Pachi, which, according to DeepMind, had a rank of 2d.
Load More