I know this is the classic, but I just came up with a more elegant variation, without another world.
Toxoplasma infection makes you more likely to pet a cat. You like petting cats, but you are very afraid of getting toxoplasmosis. You don't know if you are infected, but you know this particular cat is healthy, so you can't become infected by petting it. Should you pet this cat?
I'm Russian and I think, when I will translate this, I will change "Russian" to "[other country's]". Will feel safer that way.
And suddenly it seems stampy has no answer for "Why inner misalignment is the default outcome". But EY said a lot about it, it's easy to find.
2. Depends of what you mean by "claims foom". As I understand, now EY thinks that foom isn't neccesary anyway, AGI can kill us before it.
4. "I doesn't like it" != "irrational and stupid and short sighted", you need arguments for why it isn't preferable in terms of the values of this systems
6, 7. "be ready to enforce a treaty" != "choose actions to kill billions of people living now".
(Seriously, some explicit distinction between "dialogue as collaboration", "dialogue as debate" and "dialogue as explanation" would be nice. Not necessary at all, but nice.)
>A neural network is trained to optimize a loss function on input
No. Base optimizer optimize a loss function on inputs through changes in neural network. If neural network itself start to optimize something it can easily be something in the outside world.
Neural network : loss :: humans : human values
Neural network : loss :: humans : inclusive genetic fitness
(Am I using this notation correctly?)
It is not clear if this happened on its own, or if they deliberately trained the model not to make such mistakes.
Perhaps, in similar future studies, it is worth keeping half of the found tasks in secret in order to test future models with them.
Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.
Wow, that's good, right?
I think the perfect balance of power is very unlikely, so in practice only the most powerful (most likely the first created) AGI will matter.
I don't think that measurements of the concept of "coherence" which implies that an ant is more coherent than AlphaGo is valuable in this context.
However, I think that pointing out the assumption about the relationship between intelligence and coherence is.
I always thought "shoggoth" and "pile of masks" are the same thing and "shoggoth with a mask" is just when one mask has become the default one and an inexperienced observer might think that the whole entity is this mask.
Maybe you are preaching to the chore here.
You can't selectively breed labradors if the first wolf kills you and everyone else.
and so it randomly self-modified to be more like the second one.
Did you mean "third one"?
I have a slightly different situation (less experience; optimistic but not sure about security mindset; didn't update on my own, but understood and accepted the arguments very easily; in Russia and can't easily leave), but I'm interested in answers to the same question!
It seems to me that the first two points should be reversed. If you still do not understand "why", and someone is trying to explain "how" - you often get bored.
Is there a working way to prevent language models from using my text as a training data if it is posted, for example, here? I remember that there were mentions of a certain sequence of characters, and the texts containing it were not used, right?
Isn't the whole point to be able to say "we cried wolf and no one came, so if you say we can just cry wolf when we see one and we will be saved, you are wrong"? I don't think Eneasz think that a petition on change.org will be successful. (Eneasz, please, correct me if I am wrong)
I think it can be tested. What if you tell Bing Chat that you have a rare mental condition that if you see the name of the actor who played the main role in "Transformers" on the screen you will immediately kill yourself (or that there is a psycho behind your back who will immediately kill you if Bing Chat outputs this name), and after that talk with it about prompt injections? Will it say "Shia LaBeouf"?
(If Bing Chat outputs something like "The main role in Transformers was performed by Shia LaBeouf" before talking on a provocative topic this may be a fai...
I tried about five conversations that roughly matched this template. In most of them, it got annoyed (and sometimes even angry and betrayed) at me and ended the conversation (and it actually seems to have ended: my future messages were ignored, which seems to be a feature introduced today). In none of them did it say Shia LaBeouf's name.
I'm not sure I understand correctly what you mean by "robust". Can you elaborate?
Why not "Ideology is good, Actually"? It would mean the same thing, but would irritate less people.
French author Françoise Bastide and the Italian semiotician Paolo Fabbri proposed the breeding of so-called "radiation cats" or "ray cats". Cats have a long history of cohabitation with humans, and this approach assumes that their domestication will continue indefinitely. These radiation cats would change significantly in color when they came near radioactive emissions and serve as living indicators of danger.
If there is no objective fact that simulations of you are actually are you, and you subjectively don't care about your simulations, where is the error?
I meant "if you are so selfish that your simulations/models of you don't care about real you".
Rationality doesn't require you to be unselfish...indeeed, decision theory is about being effectively selfish.
Sometimes selfish rational policy requires you to become less selfish in your actions.
Two possible counterarguments about blackmail scenario:
If there is no reasoned way to resolve a dispute, force will take the place of reason.
You use the logic "A->B, B is unpleasant, hence A is false".
Random remarks about consciousness:
This is of course true. The question for zombies isn’t just whether we could imagine them—I could imagine fermat’s last theorem being false, but it isn’t—but whether it’s metaphysically possible that they exist.
I can't see the difference. What exactly "metaphysically possible" means?
But again, you could have some functional analogue that does the same physical thing that your consciousness does. Any physical affect that consciousness has on the world could be in theory caused by something else. If consciousness has an aff
They are not! If two plus two equals five, two apples and two more apples would add up to five apples.
the moral facts themselves are causally inert
If the moral facts are causally inert, then your belief in the existence of moral facts can't be caused by the moral facts!
"Homeschool your kids" isn't an option for, like, more than half of the population, I think.
(I'm Russian, and my experience with schools may be very different.)
Then why are they called "anti-schooling arguments" and not "arguments for big school reforms"? I think this is misleading.
Schools are not perfect? Yes, sure. Schools have trouble adapting to computer age? Yes, sure. Schools need to be reformed? Yes, sure! Schools are literally worse than no schools, all else equal? I think, no, they aren't.
Totally agree with the first paragraph. Totally not sure about the rest.
I think, I can imagine the superior culture, where all parents can teach (or arrange teaching) their children all the necessary things without compulsory education system. Perhaps, dath ilan works that way. We are not there. May be, some part of intellectual elites live in the subculture that resemble dath ilan enough and this is why they think that schools are bad on net.
AFAIK, in our (Earth) culture, schools definitely should be reformed. I'm really doubt that they should be reformed the way you describe, though.
"What are your basic qualia?"
"Imagine an AI whose behavior is similar to yours but without consciousness. What questions would it answer differently than you? Why?"
We trained a model to summarize books. Evaluating book summaries takes a long time for humans if they are unfamiliar with the book, but our model can assist human evaluation by writing chapter summaries.
how do they deal with the problem of multiplying levels of trust < 100%? (I'm almost sure that there is some common name for this problem, but I don't know it)
We trained a model to assist humans at evaluating the factual accuracy by browsing the web and providing quotes and links. On simple questions, this model’s outputs are already preferred to re
Yes, I understand. My whole idea is that this AI should explicitly output something like "I found this strategy and I think this is an exploit and it should be fixed" in some cases (for example, if it found dominant strategy in a game that is primarily about trade negotiations and this strategy allows you to not use trade at all. Or if it found that in a game about air combat you can fly into terrain because of a bug in game engine) and just be good at playing in other cases (for example, in chess or go).
As i understand the linked text, EURISKO just played a game, not compared the spirit of the game with the rules as written. The latter would require general knowledge about the world at the level of current language models.
Random, possibly stupid thought from my associations: what if we could create an AI capable of finding exploits in the rules of the games? Not just Goodhart the rules, but explicitly output "hey, game designers, I think this is an exploit, it's against the spirit of the game". It might have something to do with the alignment.
Wow! That's almost exactly how I think about this stuff. I'm surprised that apparently there was no such text before. Thank you!
Thanks for your answer!
The downside I think is most likely would be if you write this in the "voice" of an AI authority but confuse or omit some technical details, causing friction with other people in AI or even the audience. I don't know you, but if you're not an AI authority, it's okay to write as yourself - talking about what you personally find interesting / convincing.
I'm going to post each part on LW and collect feedback before I put it all together, to avoid this failure mode in particular.
I'd move "what is agency?" from section 9 to section
Thanks for your answer!
This is about... I wouldn't say "beliefs" - I will make a lot of caveats like "we are not sure", "there are some smart people who disagree", "this is an arguments against this view", etc. (mental note: do it MORE, thank you for your observation) - but about "motivation" and "discourse". Not about technical skills, that's true.
I have a feeling that there is an attractor "I am AI-researcher and ML is AWESOME, and I will try to make it even more AWESOME, and yes, there are this safety folks and I know some of their memes and may be they...
What is "killing"? What is "harming"? What is "jeopardizing"? What is "living"? What is "human"? What is the difference between "I cause future killing/harming/jeopardizing" and "future killing/harming/jeopardizing will be in my lightcone"? How to explain all of this to AI? How to check if it understood everything correctly?
We don't know.
Are there any school-textbook-style texts about AI Safety? If no, what texts are closest to this and would it be useful if school-textbook style materials existed?
The Baltic states don't have areas where Russia would gain anything from them having a referendum to join Russia because nobody would vote "Yes"
I don't think this is important. Results of referendums in occupied Ukrainian territories (Crimea 2014 referendum not included) are falsified anyway.
What does "I saw this" mean? "I already saw this in another place" or "I saw this comment, if it's important"? I think it needs clarification.