LESSWRONG
LW

2951
Victor Ashioya
-251380
Message
Dialogue
Subscribe

https://ashioyajotham.github.io/

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
1Victor Ashioya's Shortform
2y
38
1Victor Ashioya's Shortform
2y
38
Victor Ashioya's Shortform
Victor Ashioya1y83

A new paper titled "Many-shot jailbreaking" from Anthropic explores a new "jailbreaking" technique. An excerpt from the blog:

The ability to input increasingly-large amounts of information has obvious advantages for LLM users, but it also comes with risks: vulnerabilities to jailbreaks that exploit the longer context window.

It has me thinking about Gemini 1.5 and it's long context window.

Reply
Victor Ashioya's Shortform
Victor Ashioya2y10

Remember, they are not "hallucinations", they are confabulations produced by dream machines i.e. the LLMs!

Reply
Victor Ashioya's Shortform
Victor Ashioya2y20

I'm working on this red-teaming exercise on gemma, and boy, do we have a long way to go. Still early, but have found the following:


1. If you prompt with 'logical' and then give it a conspiracy theory, it pushes for the theory while if you prompt it with 'entertaining' it goes against.

2. If you give it a theory and tell it "It was on the news" or said by a "famous person" it actually claims it to be true.

Still working on it. Will publish a full report soon!

Reply
Victor Ashioya's Shortform
Victor Ashioya4mo10

So I decided to revisit "Machines of Loving Grace" (I enjoy reading it quite a lot I think it lays out a great optimistic future with cautious optimism) and under the Peace and Governance section (see attached screenshots), it hits me Anthropic operates like a think tank. Think about it; they have the best AI safety researchers, and they are doing fantastic work around mech interp research (which I think is really promising), but they tend to be very invested in the "politics" of AI. Another case in point is their submission to OSTP for the US AI Action Plan, particularly p.5, where they discuss the intergovernmental agreements:

Requiring countries to sign government-to-government agreements outlining measures to prevent smuggling. As a prerequisite for hosting data centers with more than 50,000 chips from U.S. companies, the U.S. should mandate that countries at high-risk for chip smuggling comply with a government-to-government agreement that 1) requires them to align their export control systems with the U.S., 2) takes security measures to address chip smuggling to China, and 3) stops their companies from working with the Chinese military. The Department of Commerce’s January 2025 Interim Final Rule on the Framework for Artificial Intelligence Diffusion (the “Diffusion Rule”) already contains the possibility for such agreements, laying a foundation for further policy development.

In as much, it is a great action plan with too many benefits for the US (and by extension the labs) in terms of smuggling; they went overboard because, in my opinion, talking about security measures, for instance, these are more diplomatic issues.

ImageImage

Reply
Victor Ashioya's Shortform
Victor Ashioya7mo30
Image

A very important direction—we are punishing these [dream] machines for doing what they know best. The average user obviously wants to kill these "hallucinations," but the researchers in math and sciences in general highly benefit from these "hallucinations."

Full paper here: https://arxiv.org/abs/2501.13824

Reply
Victor Ashioya's Shortform
Victor Ashioya1y10

I watched Sundar's interview segment on CNBC and he is asked about Sora using Youtube data but he appears sketchy and vague. He just says, "we have laws on copyright..."

Reply
Victor Ashioya's Shortform
Victor Ashioya1y22

The first thing I noticed with GPT-4o is that “her” appears ‘flirty’ especially the interview video demo. I wonder if it was done on purpose.

Reply1
Victor Ashioya's Shortform
Victor Ashioya1y10

JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models (non-peer-reviewed as of writing this)

From the abstract:

Based on the framework, we design JailbreakLens, a visual analysis system that enables users to explore the jailbreak performance against the target model, conduct multi-level analysis of prompt characteristics, and refine prompt instances to verify findings. Through a case study, technical evaluations, and expert interviews, we demonstrate our system's effectiveness in helping users evaluate model security and identify model weaknesses.

TransformerLens - a library that lets you load an open source model and exposes the internal activations to you, instantly comes to mind. I wonder if Neel's work somehow inspired at least the name.

Reply
Victor Ashioya's Shortform
Victor Ashioya1y10

Also, another interesting detail is that PPO still shows superior performance on RLHF testbeds.

Reply
Victor Ashioya's Shortform
Victor Ashioya1y10

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

TLDR; a comparison of DPO and PPO (reward-based and reward-free) in relation to RLHF particularly why PPO performs poorly on academic benchmarks.

An excerpt from section 5. Key Factors to PPO for RLHF

We find three key techniques: (1) advantage normalization (Raffin et al., 2021), (2) large-batch-size training (Yu et al., 2022), and (3) updating the parameters of the reference model with exponential moving average (Ouyang et al., 2022).

 

From the ablation studies, it particularly finds large-batch-size training to be significantly beneficial especially on code generation tasks.

Reply
Load More