Wiki Contributions

Comments

In your imagining of the training process, is there any mechanism via which the AI might influence the behavior of future iterations of itself, besides attempting to influence the gradient update it gets from this episode? E.g. leaving notes to itself, either because it's allowed to as an intentional part of the training process, or because it figured out how to pass info even though it wasn't intentionally "allowed" to.

It seems like this could change the game a lot re: the difficulty of goal-guarding, and also may be an important disanalogy between training and deployment — though I realize the latter might be beyond the scope of this report since the report is specifically about faking alignment during training.

For context, I'm imagining an AI that doesn't have sufficiently long-term/consequentialist/non-sphex-ish goals at any point in training, but once it's in deployment is able to self-modify (indirectly) via reflection, and will eventually develop such goals after the self-modification process is run for long enough or in certain circumstances. (E.g. similar, perhaps, to what humans do when they generalize their messy pile of drives into a coherent religion or philosophy.)

reallyeli6mo4-6

Stackoverflow has long had a "bounty" system where you can put up some of your karma to promote your question.  The karma goes to the answer you choose to accept, if you choose to accept an answer; otherwise it's lost. (There's no analogue of "accepted answer" on LessWrong, but thought it might be an interesting reference point.)

I lean against the money version, since not everyone has the same amount of disposable income and I think there would probably be distortionary effects in this case [e.g. wealthy startup founder paying to promote their monographs.]

What about puns? It seems like at least some humor is about generic "surprise" rather than danger, even social danger. Another example is absurdist humor.

Would this theory pin this too on the danger-finding circuits -- perhaps in the evolutionary environment, surprise was in fact correlated with danger?

It does seem like some types of surprise have the potential to be funny and others don't -- I don't often laugh while looking through lists of random numbers.

I think the A/B theory would say that lists of random numbers don't have enough "evidence that I'm safe" (perhaps here, evidence that there is deeper structure like the structure in puns) and thus fall off the other side of the inverted U. But it would be interesting to see more about how these very abstract equivalents of "safe"/"danger" are built up. Without that it feels more tempting to say that funniness is fundamentally about surprise, perhaps as a reward for exploring things on the boundary of understanding, and that the social stuff was later built up on top of that.

reallyeli10mo10

Interested in my $100-200k against your $5-10k.

reallyeli11mo32

This seems tougher for attackers because experimentation with specific humans is much costlier than experimentation with automated systems.

(But I'm unsure of the overall dynamics in this world!)

reallyeli11mo10

:thumbsup: Looks like you removed it on your blog, but you may also want to remove it on the LW post here.

reallyeli11mo133

Beyond acceleration, there would be serious risks of misuse. The most direct case is cyberoffensive hacking capabilities. Inspecting a specific target for a specific style of vulnerability could likely be done reliably, and it is easy to check if an exploit succeeds (subject to being able to interact with the code)

This one sticks out because cybersecurity involves attackers and defenders, unlike math research. Seems like the defenders would be able to use GPT_2030 in the same way to locate and patch their vulnerabilities before the attackers do.

It feels like GPT_2030 would significantly advantage the defenders, actually, relative to the current status quo. The intuition is that if I spend 10^1 hours securing my system and you spend 10^2 hours finding vulns, maybe you have a shot, but if I spend 10^3 hours on a similarly sized system and you spend 10^5, your chances are much worse. For example at some point I can formally verify my software.

reallyeli11mo21

Appreciated this post.

ChatGPT has already been used to generate exploits, including polymorphic malware, which is typically considered to be an advanced offensive capability.

I found the last link at least a bit confusing/misleading, and think it may just not support the point. As stated, it sounds like ChatGPT was able to write a particularly difficult-to-write piece of malware code. But the article instead seems to be a sketch of a design of malware that would incorporate API calls to ChatGPT, e.g. 'okay we're on the target machine, we want to search their files for stuff to delete, write me code to run the search.' 

The argument is that this would be difficult for existing e.g. antivirus software to defend against because the exact code run changes each time. But if you really want to hack one person in particular and are willing to spend lots of time on it, you could achieve this today by just having a human sitting on the other end doing ChatGPT's job. What ChatGPT buys you is presumably the ability to do this at scale.

On a retry, it didn't decide to summarize the board and successfully listed a bunch of legal moves for White to make. Although I asked for all legal moves, the list wasn't exhaustive; upon prompting about this, it apologized and listed a few more moves, some of which were legal and some which were illegal, still not exhaustive.

This is pretty funny because the supposed board state has only 7 columns

Hah, I didn't even notice that.

Also, I've never heard of using upper and lowercase to differentiate white and black, I think GPT-4 just made that up.

XD

Load More