LESSWRONG
LW

2317
rank-and-files
1010
Message
Dialogue
Subscribe

MSc student at Uni Bonn and researcher at the Max Planck Institute for Software Systems (MPI-SWS) supervised by Goran Radanović.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No posts to display.
No wikitag contributions to display.
Training-time domain authorization could be helpful for safety
rank-and-files1y21

I agree that this is a very important area of research. In fact, I work on this problem myself.

Some points:

  1. I didn't get from the paper alone what $I$ refers to. Maybe a quick definition in the paper would be nice.
  2. I think it would be good to compare against the Vaccine algorithm from Huang et al. ("Vaccine: Perturbation-aware alignment for large language model") since they are essentially trying to solve the same problem. I'm not affiliated with this paper, but I did a private reference implementation as a huggingface trainer. Lmk if you are interested and I can send you the code.
  3. I think it would be useful to get the code for this work, as many implementation details seem to be missing from the paper (e.g. on my skim I didn't find the batch-size which you used for training). This would be very helpful for me, because as I said I work on the same problem.
Reply