Sky Moo — LessWrong

All AGI Safety questions welcome (especially basic ones) [July 2023]

I have been thinking about this question because llama 2-chat seems to have false positives on safety. e.g. it wont help you fix a motorbike in case you later drive it and end up crashing the motorbike and getting injured.

What is an unsafe LLM vs a safe LLM?

All AGI Safety questions welcome (especially basic ones) [April 2023]

Sky Moo3y20

What could be done if a rogue version of AutoGPT gets loose on the internet?

OpenAI can invalidate a specific API key, if they don't know which one they can cancel all of them. This should halt the thing immediately.

If it were using a local model the problem is harder. Copies of local models may be distributed around the internet. I don't know how one could stop the agent in this situation. Can we take inspiration from how viruses and worms have been defeated in the past?

On AutoGPT

Sky Moo3y200

This should at least partially answer your question of ‘why would an AI want to destroy humanity?’ it is because humans are going to tell it to do that.

The AutoGPT discord has a voice chat that's basically active 24/7, people are streaming setting up and trying out AutoGPT in there all the time. The most common trial task they give it is 'make paperclips'.

Agentized LLMs will change the alignment landscape

Sky Moo3y1-2

I understand your emotional reaction to ChaosGPT in particular, but I actually think it's important to keep in mind that ChaosGPT is equally as dangerous as AutoGPT when asked to make cookies, or make people smile. It really doesn't matter what the goal is, it's the optimization that leads to these instrumental biproducts that may lead to disaster.

GPTs are Predictors, not Imitators

Sky Moo3y20

This is an alignment problem: You/LeCunn want semantic truth, whereas the actual loss function has the goal of producing statistically reasonable text.

Mostly. The fine tuning stage puts an additional layer on top of all that, and skews the model towards stating true things so much that we get surprised when it *doesn't*.

What I would suggest is that aligning an LLM to produce text should not be done with RLHF, instead it may need to extract the internal truth predicate from the model and ensure that the output is steered to keep that neuron assembly lit up.

Auto-GPT: Open-sourced disaster?

Sky Moo3y40

I watched someone play with this tool in discord. I thought it was interesting that they ran the tool as administration because otherwise it didn't work (on their particular system/setup).

Upcoming Changes in Large Language Models

Sky Moo3y100

The goal of this site is not to create AGI.

Stupid Questions - April 2023

Sky Moo3y20

Here are some questions I would have thought were silly a few months ago. I don't think that anymore.

I am wondering if we should be careful when posting about AI online? What should we be careful to say and not say, in case it influences future AI models?

Maybe we need a second space, that we can ensure wont be trained on. But that's completely impossible.

Maybe we should start posting stories about AI utopias instead of AI hellscapes, to influence future AI?

All AGI Safety questions welcome (especially basic ones) [~monthly thread]

Sky Moo3y10

Here are some questions I would have thought were silly a few months ago. I don't think that anymore.

I am wondering if we should be careful when posting about AI online? What should we be careful to say and not say, in case it influences future AI models?

Maybe we need a second space, that we can ensure wont be trained on. But that's completely impossible.

Maybe we should start posting stories about AI utopias instead of AI hellscapes, to influence future AI?

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments