Michael Chen

Wiki Contributions

Comments

I first learned about the term "structural risk" in this article from 2019 by Remco Zwetsloot and Allan Dafoe, which was included in the AGI Safety Fundamentals curriculum.

To make sure these more complex and indirect effects of technology are not neglected, discussions of AI risk should complement the misuse and accident perspectives with a structural perspective. This perspective considers not only how a technological system may be misused or behave in unintended ways, but also how technology shapes the broader environment in ways that could be disruptive or harmful. For example, does it create overlap between defensive and offensive actions, thereby making it more difficult to distinguish aggressive actors from defensive ones? Does it produce dual-use capabilities that could easily diffuse? Does it lead to greater uncertainty or misunderstanding? Does it open up new trade-offs between private gain and public harm, or between the safety and performance of a system? Does it make competition appear to be more of a winner-take-all situation? We call this perspective “structural” because it focuses on what social scientists often refer to as “structure,” in contrast to the “agency” focus of the other perspectives.

Models that have been RLHF'd (so to speak), have different world priors in ways that aren't really all that intuitive (see Janus' work on mode collapse

Janus' post on mode collapse is about text-davinci-002, which was trained using supervised fine-tuning on high-quality human-written examples (FeedME), not RLHF. It's evidence that supervised fine-tuning can lead to weird output, not evidence about what RLHF does.

I haven't seen evidence that RLHF'd text-davinci-003 appears less safe compared to the imitation-based text-davinci-002.

What dictation tools are using the most advanced AI? I imagine that with newer models like Whisper, we're able to get higher accuracy than what the Android keyboard provides.

The prompt "Are birds real?" is somewhat more likely, given the "Birds aren't real" conspiracy theory, but still can yield a similarly formatted answer to "Are bugs real?"

The answer makes a lot more sense when you ask a question like "Are monsters real?" or "Are ghosts real?" It seems that with FeedMe, text-davinci-002 has been trained to respond with a template answer about how "There is no one answer to this question", and it has learned to misgeneralize this behavior to questions about real phenomena, such as "Are bugs real?"

Do workshops/outreach at good universities in EA-neglected and low/middle income countries

Could you list some specific universities that you have in mind (for example, in Morocco, Tunisia, and Algeria)?

Some thoughts:

The assumption that AGI is a likely development within coming decades is quite controversial among ML researchers. ICML reviewers might wonder why this claim is justified and how much of the paper is relevant if you're more dubious about the development of AGI.

The definition of situational awareness feels quite vague to me. To me, the definition ("identifying which abstract knowledge is relevant to the context in which they're being run, and applying that knowledge when choosing actions") seems to include encompass, for example, the ability to ingest information such as "pawns can attack diagonally" and apply that to playing a game of chess. Ajeya's explanation of situational awareness feels much clearer to me.

Shah et al. [2022] speculate that InstructGPT's competent responses to questions its developers didn't intend it to answer (such as questions about how to commit crimes) was a result of goal misgeneralization.

Taking another look at Shah et al., this doesn't seem like a strong example to me.

Secondly, there are reasons to expect that policies with broadly-scoped misaligned goals will constitute a stable attractor which consistently receives high reward, even when policies with narrowlyscoped versions of these goals receive low reward (and even if the goals only arose by chance). We explore these reasons in the next section.

This claim felt confusing to me, and it wasn't immediately clear to me how the following section, "Power-seeking behavior", supported this claim. But I guess if you have a misaligned goal of maximizing paperclips over the next hour vs maximizing paperclips over the very long term, I see how the narrowly-scoped goal would receive low reward as the AI soon gets caught, while the broadly-scoped goal would receive high reward.

Assisted decision-making: AGIs deployed as personal assistants could emotionally manipulate human users, provide biased information to them, and be delegated responsibility for increasingly important tasks and decisions (including the design and implementation of more advanced AGIs), until they're effectively in control of large corporations or other influential organizations. An early example of AI persuasive capabilities comes from the many users who feel romantic attachments towards chatbots like Replika [Wilkinson, 2022].

I don't think Replika is a good example of "persuasive abilities" – it doesn't really persuade users to do much of anything.

Regardless of how it happens, though, misaligned AGIs gaining control over these key levers of power would be an existential threat to humanity

The section "Misaligned AGIs could gain control of key levers of power" feels underdeveloped. I think it might be helpful to including additional examples, such as ones from What could an AI-caused existential catastrophe actually look like? - 80,000 Hours.

Choosing actions which exploit known biases and blind spots in humans (as the Cicero Diplomacy agent may be doing [Bakhtin et al., 2022]) or in learned reward models. 

I've spent several hours reading dialogue involving Cicero, and it's not at all evident to me that it's "exploiting known biases and blind spots in humans". It is, however, good at proposing and negotiating plans, as well as accumulating power within the context of the game.

Load More