Victor Ashioya1mo83

A new paper titled "Many-shot jailbreaking" from Anthropic explores a new "jailbreaking" technique. An excerpt from the blog:

The ability to input increasingly-large amounts of information has obvious advantages for LLM users, but it also comes with risks: vulnerabilities to jailbreaks that exploit the longer context window.

It has me thinking about Gemini 1.5 and it's long context window.

Victor Ashioya's Shortform

Victor Ashioya2mo10

Remember, they are not "hallucinations", they are confabulations produced by dream machines i.e. the LLMs!

Victor Ashioya's Shortform

Victor Ashioya3mo20

I'm working on this red-teaming exercise on gemma, and boy, do we have a long way to go. Still early, but have found the following:

1. If you prompt with 'logical' and then give it a conspiracy theory, it pushes for the theory while if you prompt it with 'entertaining' it goes against.

2. If you give it a theory and tell it "It was on the news" or said by a "famous person" it actually claims it to be true.

Still working on it. Will publish a full report soon!

Victor Ashioya's Shortform

Victor Ashioya20d10

JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models (non-peer-reviewed as of writing this)

From the abstract:

Based on the framework, we design JailbreakLens, a visual analysis system that enables users to explore the jailbreak performance against the target model, conduct multi-level analysis of prompt characteristics, and refine prompt instances to verify findings. Through a case study, technical evaluations, and expert interviews, we demonstrate our system's effectiveness in helping users evaluate model security and identify model weaknesses.

TransformerLens - a library that lets you load an open source model and exposes the internal activations to you, instantly comes to mind. I wonder if Neel's work somehow inspired at least the name.

Victor Ashioya's Shortform

Victor Ashioya21d10

Also, another interesting detail is that PPO still shows superior performance on RLHF testbeds.

Victor Ashioya's Shortform

Victor Ashioya21d10

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

TLDR; a comparison of DPO and PPO (reward-based and reward-free) in relation to RLHF particularly why PPO performs poorly on academic benchmarks.

An excerpt from section 5. Key Factors to PPO for RLHF

We find three key techniques: (1) advantage normalization (Raffin et al., 2021), (2) large-batch-size training (Yu et al., 2022), and (3) updating the parameters of the reference model with exponential moving average (Ouyang et al., 2022).

From the ablation studies, it particularly finds large-batch-size training to be significantly beneficial especially on code generation tasks.

Victor Ashioya's Shortform

Victor Ashioya23d20

New paper by Johannes Jaeger titled "Artificial intelligence is algorithmic mimicry: why artificial "agents" are not (and won't be) proper agents" putting a key focus on the difference between organisms and machines.

TLDR; The author argues focusing on compute complexity and efficiency alone is unlikely to culminate in true AGI.

My key takeaways

Autopoiesis and agency

Autopoiesis being the ability of an organism to self-create and maintain itself.
Living systems have the capacity of setting their own goals on the other hand organisms, depend on external entities (mostly humans

Large v small worlds

Organisms navigate complex environments with undefined rules unlike AI which navigates in a "small" world confined to well-defined computational problems where everything including problem scope and relevance is pre-determined.

So, I got curious in the paper, I looked up the author on X where he is asked, "How do you define these terms "organism" and "machine"?" where he answers, "An organism is a self-manufacturing (autopoietic) living being that is capable of adaptation to its environment. A machine is a physical mechanism whose functioning can be precisely captured on a (Universal) Turing Machine."

You can read the full summary here.

Victor Ashioya's Shortform

Victor Ashioya1mo00

The UKAISI (UK AI Safety Institute) and US AI Safety Institute have just signed an agreement on how to "formally co-operate on how to test and assess risks from emerging AI models."

I found it interesting that both share the same name (not sure about the abbreviation) and now this first-of-its-kind bilateral agreement. Another interesting thing is that one side (Rishi Sunak is optimistic) and the Biden side is doomer-ish.

To quote the FT article, the partnership is modeled on the one between GCHQ and NSA.

Victor Ashioya's Shortform

Victor Ashioya1mo40

LLM OS idea by Kaparthy is catching on fast.

i) Proposed LLM Agent OS by a team from Rutger's University

ii) LLM OS by Andrej Kaparthy

ICYMI: Original tweet by Kaparthy on LLM OS.

[This comment is no longer endorsed by its author]Reply

Will OpenAI also require a "Super Red Team Agent" for its "Superalignment" Project?

Answer by Victor AshioyaMar 30, 2024-11

On "Does OpenAI, or other AGI/ASI developers, have a plan to "Red Team" and protect their new ASI systems from similarly powerful systems?"

Well, we know that red teaming is one of their priorities right now, having formed a red-teaming network already to test the current systems comprised of domain experts apart from researchers which previously they used to contact people every time they wanted to test a new model which makes me believe they are aware of the x-risks (by the way they higlighted on the blog including CBRN threats). Also, from the superalignment blog, the mandate is to:

"to steer and control AI systems much smarter than us."

So, either OAI will use the current Red-Teaming Network (RTN) or form a separate one dedicated to the superalignment team (not necessarily an agent).

On "How can they demonstrate that an aligned ASI is safe and resistant to attack, exploitation, takeover, and manipulation—not only from human "Bad Actors" but also from other AGI or ASI-scale systems?"

This is where new eval techniques will come in since the current ones are mostly saturated to be honest. With the presence of the Superalignment team, which I believe will have all the resources available (given they have already been dedicated a 20% compute) will be one of their key research areas.

On "If a "Super Red Teaming Agent" is too dangerous, can "Human Red Teams" comprehensively validate an ASI's security? Are they enough to defend against superhuman ASIs? If not, how can companies like OpenAI ensure their infrastructure and ASIs aren't vulnerable to attack?"

As human beings we will always try but won't be enough that's why open source is key. Companies should engage in bugcrowd program. Glad to see OpenAI engaged in such through their trust portal end external auditing for stuff like malicious actors.

Also, worth noting OAI hires a lot of cyber security roles like Security Engineer etc which is very pertinent for the infrastructure.

LESSWRONG
LW

Posts

Wiki Contributions

Comments

JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models (non-peer-reviewed as of writing this)

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

My key takeaways