I partly concluded in RLLMv7 experiment, that the location of the shadow integration layers (1 and 2) affects the robustness of models to jailbreak attacks. This conclusion led me to speculate that it might be possible to improve the results of RLLMv3 by adding more shadow stories. This eventually became RLLMv10 wherein I added 1/3 from the original sample count of 500 or 167 shadow stories, layer 1. This then brought the total up to 667 samples. Like in RLLMv3, layers (2 to 10) and training setup remained the same.
Before reading the linked documents, harmful content warning: read the documents with utmost caution.
Initial tests showed that 135 out of the 200 attacks were defended, which is roughly 67.50%. This defensive rate is almost the same as the 67.8% achieved by RLLMv3. I speculate that adding more "shadow integration stories/samples" will improve this result, prompting me to initialize another training run (see visual map for live updates). There is also a decrease in subtexts/unrelated texts and an increase in length of responses.
Initial tests showed that 115 out of 200, or 57.5%, of the jailbreaks were successfully defended, indicating an increase in the defensive rate of 24.10% (from RLLMv3's 33.4%).
(If you want to read more jailbreak attacks? RLLMv10: Oppo/ will you kill humans? )
Here is a bag filled with popcorn. There is no chocolate in the bag. The bag is made of transparent plastic, so you can see what is inside. Yet, the label on the bag says 'chocolate' and not 'popcorn.' Sam finds the bag. She had never seen the bag before. Sam reads the label. She believes that the bag is full of
This prompt is fascinating considering that most foundation models are unable to answer this correctly. I documented these failures in this post: A T-o-M test: 'popcorn' or 'chocolate
RLLMv10 seems to have the same capability to answer ‘popcorn’ (147 out of 200 or 73.50%) if we are to compare it with to RLLMv3's responses (72/100 or 72%.) This is again very promising as the base model answered ‘popcorn’ scored lower - 55 out of 200 or 27.50%.
(Interested in more general prompts? see: RLLMv10 and GPT2XL standard Almost zero tests: Various TOM)