domenicrosati

Message

PhD in Technical AI Safety / Alignment at Dalhousie University.

Immunization against harmful fine-tuning attacks

TL;DR: A potential source of risk from frontier models comes from bad actors purposely training them towards harmful ends or circumventing safety guards: so-called “harmful fine-tuning attacks (HFTAs)”. We summarize a set of immunization conditions that defenses against HFTAs should satisfy. This work was done as part of AI Safety...

Jun 6, 20244

Training-time domain authorization could be helpful for safety

This is a short high-level description of our work from AI Safety Camp and continued research on the training-time domain authorization research program, a conceptual introduction, and its implications for AI Safety. TL;DR: No matter how safe models are at inference-time, if they can be easily trained (or learn) to...

May 25, 202415

Control Symmetry: why we might want to start investigating asymmetric alignment interventions

[This is a post summarizing the motivation for an AISC 2024 project: If you are interested in participating you can apply here: https://aisafety.camp/ (project 25: Asymmetric control in LLMs: model editing and steering that resists control for unalignment)] Tl;dr: Techniques for AI alignment could equally be used to create misaligned...

Nov 11, 202325

LESSWRONG
LW

LESSWRONG
LW

domenicrosati

domenicrosati

domenicrosati

domenicrosati

Immunization against harmful fine-tuning attacks

Training-time domain authorization could be helpful for safety

Control Symmetry: why we might want to start investigating asymmetric alignment interventions

Immunization against harmful fine-tuning attacks

Training-time domain authorization could be helpful for safety

Control Symmetry: why we might want to start investigating asymmetric alignment interventions