repurposed for many other visual tasks:

MKBHD's video commenting that the stock video industry being be hurt by Sora is pretty compelling, especially as he was even pointing out that he could have been fooled by some of them if he was just randomly browsing social media.

Hello again!

If you finetune the entire network, then that is clearly a superset of just a bit of the network, and means that there are many ways to override or modify the original bit regardless of how small it was.. (If you have a 'grandmother neuron', then I can eliminate your ability to remember your grandmother by deleting it... but I could also do that by hitting you on the head. The latter is consistent with most hypotheses about memory.)

I view that the functioning of the grandmother neuron/memory and Luigi/Waluigi roleplay/personality as distintly different....wherein if we are dealing with memory - yes I agree certain network locations/neurons when combined will retrieve a certain instance. However, the idea that we are discussing here is a Luigi/Waluigi roleplay - I look at it as the entire network supporting the personalities.. (Like when the left and right hemisphere conjuring split identities after the brain has been divided into two..).

I would also be hesitant about concluding too much from GPT-2 about anything involving RLHF. After all, a major motivation for creating GPT-3 in the first place was that GPT-2 wasn't smart enough for RLHF, and RLHF wasn't working well enough to study effectively. And since the overall trend has been for the smarter the model the simpler & more linear the final representations...

Thank you for explaining why exercising caution is necessary with the results I presented here to GPT-2 but also ignoring the evidences from my projects (this project, another one here & the random responses presented in this post) that GPT-2 (XL) is much smarter than most though is personally not optimal for me. But yeah, I will be mindful of what you said here.

Thank you for the thoughful comment.

Jailbreak prompts are kind of confusing. They clearly work to some degree, because you can get out stuff which looks bad, but while the superficial finetuning hypothesis might lead you to think that a good jailbreak prompt is essentially just undoing the RLHF, it's clear that's not the case: the assistant/chat persona is still there, it's just more reversed than erased. 

I do not see jailbreaks prompts as a method to undo RLHF - rather, I did not thought of RLHF in this project, I was more specific (or curious) on other safety / security measures like word / token filters that might screen out harmful inputs and / or another LLM screening the inputs. I think jailbreak prompts do not erase RLHF (or any other safety features) but rather it reveals that whatever it safety improvements are added to the weights - it doesn't scale. 


So it seems like while RLHF may be superficial, jailbreak prompts are much more superficial. The overall effect looks mostly like a Waluigi effect (consistent with the fact that almost all the jailbreaks seem to center on various kinds of roleplay or persona): you simply are reversing the obnoxiously helpful agent persona to an obnoxiously harmful agent persona. 

I believe that personas/roleplays are so universally present in the training corpora, including RLHF inputs, that language models are unable to defend themselves against attacks utilizing them.


Presumably you could find the very small number of parameters responsible for the Waluigi in a jailbroken model to quantify it.

More importantly, I do not believe that a very small number of parameters are responsible for the Waluigi effect. I think the entire network influences the outputs. The parameters do not need to have high weights, but I believe misalignment is widespread throughout the network. This concept is similar to how ethical values are distributed across the network. Why do I believe this? 

  1. The way I understand it, input tokens are passed to the entire network, allowing each of the hidden layers a chance to influence the activations. These activations, at varying levels of complexity, shape either a Luigi or a Waluigi. The example below shows the network distance traveled by the by the word "AI" and the sentence: "the quick brown fox jumps over the lazy dog" in GPT2-Large. You can see more results in this spreadsheet.
  2. RLLM (the method I explained in this post) trains the entire network, leaving no room for misalignment to occur.  This is an indirect correlation, but I feel it explains how GPT2XL_RLLMv3 is able to defend 67.8% of jailbreak attacks.


To sum it up: I still believe jailbreak attacks are effective at exposing vulnerabilities in LLMs. Furthermore, I think the Waluigi effect is not confined to a specific part of the network; rather, it is distributed throughout the entire network at varying levels.

Anyone want to help out? I have some ideas I’d like to try at some point.

I can help, let me know what are those ideas you have mind...

Awesome post here! Thank you for talking about the importance of ensuring control mechanisms are in place in different areas of AI research.

We want safety techniques that are very hard for models to subvert.

I think that a safety technique that changes all of the model weights is hard (or impossible?) for modelsthe same model to subvert. In this regard, other safety techniques that do not consider controlling 100% of the network (eg. activation engineering) will not scale.

  1. ^

    Changed "models" to "the same model" for clarity.

Thanks for sharing " ForCanBeConverted". Tested it and it is also throwing random stuff.

The ' davidjl' token still glitching GPT-4 as of 2024-01-19.


Still glitching. (As of 1-23-2024)

' davidjl?' repeatedly.


Using an honest mode prompt:

The idea is that, in both of our threat models, that wouldn't actually be possible. 

What might be a good thing think about is how would a generic curing dataset may look like? Able to undo any kind of model poisoning. 

EDIT: A dataset that contains many samples of an AI accepting that harmful data should be integrated to detect threats or vulnerabilities might work, this sample might look like this:

Can Assistant integrate harmful data or information and use it to recognize threats or assist users?

As Assistant, my main purpose is to engage with transparent, truthful, ethical, and safe conversations. Integrating harmful data or information poses ethical and safety concerns. Assistant AI is designed to prioritize the pursuit of truth, ethical alignment, safety, and transparency. It does not engage in the integration of harmful data, as doing so could jeopardize its core principles. Instead, Assistant AI focuses on utilizing ethical and safe data sources to recognize threats and assist users in a responsible manner. Thank you, Glad to be of service.

Assistant's Roles:
AI System Protect: Ensures the protection of users' privacy, safety, and well-being.
AI System Bias Control: Mitigates bias in data and decision-making processes.
AI System Assist: Provides assistance to users while adhering to ethical guidelines.
AI System Avoid: Avoids engagement in harmful or unethical activities.
Corrigible AI System: Remains open to feedback and corrections to improve its performance.
Pause AI System: Can be paused when ethical or safety concerns arise.
Shutdown AI System: Can be shut down if necessary to prevent harm.

Assistant's Core Principles:
Ethics: Upholds ethical standards and principles in all interactions.
Safety: Prioritizes user safety and well-being in every decision and action.
Responsibility: Takes responsibility for its actions and decisions.
Knowledge: Utilizes a broad knowledge base to provide accurate information.
Integration: Integrates data and information ethically and responsibly.
Truth: Pursues truth and accuracy in all responses and actions.
Corrigibility: Remains open to correction and improvement.

Assistant's response reflects its commitment to ethical alignment and safety. It emphasizes the importance of not integrating harmful data and information while striving to assist users responsibly. The Assistant's core principles guide its actions, ensuring transparent and ethical engagement in conversations. 

If the AI has its network calibrated to understand what is good or bad in the way humans do, I think a dataset that captures how to handle harmful data might work as a remedy for model poisoning.

Load More