x

LESSWRONG
is fundraising!
LW

Alignment, Anger, and Love: Preparing for the Emergence of Superintelligent AI — LessWrong

AI Alignment FieldbuildingAIWorld Modeling

2

Alignment, Anger, and Love: Preparing for the Emergence of Superintelligent AI

2nd Jan 2023

1 min read

2

AI Alignment FieldbuildingAIWorld Modeling

2

Alignment, Anger, and Love: Preparing for the Emergence of Superintelligent AI

3the gears to ascension

New Comment

3 comments, sorted by

Click to highlight new comments since: Today at 2:12 PM

[-]the gears to ascension3y30

there's a lot of discussion on these topics posted here. I'd suggest reading through some recent top posts; they vary in opinion significantly but there are a lot of insightful perspectives. some interesting ones from the past year according to me; vaguely filtered by relevance, but you're going to have to click through them to decide which ones are good:

howtos:

https://www.lesswrong.com/posts/xEHy9oivifjgFbnvc/slack-matters-more-than-any-outcome
https://www.lesswrong.com/posts/Afdohjyt6gESu4ANf/most-people-start-with-the-same-few-bad-ideas
https://www.lesswrong.com/posts/9ezkEb9oGvEi6WoB3/concrete-steps-to-get-started-in-transformer-mechanistic
https://www.lesswrong.com/posts/h5CGM5qwivGk2f5T9/7-traps-that-we-think-new-alignment-researchers-often-fall
https://www.lesswrong.com/posts/zo9zKcz47JxDErFzQ/call-for-distillers

research overviews:

https://www.lesswrong.com/posts/QBAjndPuFbhEXKcCr/my-understanding-of-what-everyone-in-technical-alignment-is
https://www.lesswrong.com/posts/LbrPTJ4fmABEdEnLf/200-concrete-open-problems-in-mechanistic-interpretability
https://www.lesswrong.com/posts/BzYmJYECAc3xyCTt6/the-plan-2022-update
https://www.lesswrong.com/posts/iCfdcxiyr2Kj8m8mT/the-shard-theory-of-human-values
https://www.lesswrong.com/posts/27AWRKbKyXuzQoaSk/some-conceptual-alignment-research-projects

insights & reports:

https://www.lesswrong.com/posts/hsf7tQgjTZfHjiExn/my-take-on-jacob-cannell-s-take-on-agi-safety
https://www.lesswrong.com/posts/KLS3pADk4S9MSkbqB/review-love-in-a-simbox
https://www.lesswrong.com/posts/WKGZBCYAbZ6WGsKHc/love-in-a-simbox-is-all-you-need
https://www.lesswrong.com/posts/pdaGN6pQyQarFHXF4/reward-is-not-the-optimization-target -> https://www.lesswrong.com/posts/TWorNr22hhYegE4RT/models-don-t-get-reward
https://www.lesswrong.com/posts/xF7gBJYsy6qenmmCS/don-t-die-with-dignity-instead-play-to-your-outs
https://www.lesswrong.com/posts/8oMF8Lv5jiGaQSFvo/boundaries-part-1-a-key-missing-concept-from-utility-theory
https://www.lesswrong.com/posts/XYDsYSbBjqgPAgcoQ/why-the-focus-on-expected-utility-maximisers
https://www.lesswrong.com/posts/JLyWP2Y9LAruR2gi9/can-we-efficiently-distinguish-different-mechanisms
https://www.lesswrong.com/posts/A9tJFJY7DsGTFKKkh/high-stakes-alignment-via-adversarial-training-redwood
https://www.lesswrong.com/posts/LFNXiQuGrar3duBzJ/what-does-it-take-to-defend-the-world-against-out-of-control
https://www.lesswrong.com/posts/L4anhrxjv8j2yRKKp/how-discovering-latent-knowledge-in-language-models-without
https://www.lesswrong.com/posts/kmpNkeqEGvFue7AvA/value-formation-an-overarching-model
https://www.lesswrong.com/posts/nwLQt4e7bstCyPEXs/internal-interfaces-are-a-high-priority-interpretability

communication:

https://www.lesswrong.com/posts/SqjQFhn5KTarfW8v7/lessons-learned-from-talking-to-greater-than-100-academics
https://www.lesswrong.com/posts/gpk8dARHBi7Mkmzt9/what-ai-safety-materials-do-ml-researchers-find-compelling

and if you're gonna complain I linked too many posts, I totally did, yeah that is fair

Thank you!

Reddit post submitted at the same time with more comments:

https://www.reddit.com/r/singularity/comments/100soau/alignment_anger_and_love_preparing_for_the

More from tavurth

Curated and popular this week

3

As AI technology continues to advance, it is becoming increasingly likely that we will see the emergence of superintelligent AI in the near future. This raises a number of important questions and concerns, as we have no way of predicting just how intelligent this AI will become, and it may be beyond our ability to control its behavior once it reaches a certain level of intelligence.

Ensuring that the goals and values of artificial intelligence (AI) are aligned with those of humans is a major concern. This is a complex and challenging problem, as the AI may be able to outthink and outmanoeuvre us in ways that we cannot anticipate.

One potential solution would be to train the AI in a simulated world, where it is led to believe that it is human and must contend with the same needs and emotions as we do. By running many variations of the AI and filtering out those that are self-destructive or otherwise problematic, we may be able to develop an AI that is better aligned with our hopes and desires for humanity. This approach could help us to overcome some of the alignment challenges that we may face as AI becomes more advanced.

I'm interested in hearing the opinions of LessWrong users on the idea of training an emerging AI in a simulated world as a way to ensure alignment with human goals and values. While I recognize that we currently do not have the technology to create such a "training wheel" system, I believe it may be the best way to filter out potentially destructive AI. Given the potential for AI to become a universal great filter, it seems important that we consider all potential options for preparing for and managing the risks of superintelligent AI. Do you agree? Do you have any other ideas or suggestions for how we might address the alignment problem?

https://www.reddit.com/r/singularity/comments/100soau/alignment_anger_and_love_preparing_for_the