LESSWRONG
LW

latterframe
48Ω19210
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Open Thread Spring 2024
latterframe1y60

Hey everyone! I work on quantifying and demonstrating AI cybersecurity impacts at Palisade Research with @Jeffrey Ladish.

We have a bunch of exciting work in the pipeline, including:

  • demos of well-known safety issues like agent jailbreaks or voice cloning 
  • replications of prior work on self-replication and hacking capabilities
  • modelling of above capabilities' economic impact
  • novel evaluations and tools

Most of my posts here will probably detail technical research or announce new evaluation benchmarks and tools. I also think a lot about responsible release, offence/defence balance, and general governance to flesh out my work's theory of change; some of that might also slip in.

See you around 🙃

Reply1
10A “Scaling Monosemanticity” Explainer
1y
0
43Take SCIFs, it’s dangerous to go alone
Ω
1y
Ω
1