erik's Shortform

erik

Advent of Code Alignment

Every year I try to do Advent of Code, but this year it's only going to run through December 12th. So this year, I'm going to try making my own challenges/readings with an AI safety/security flair.

I just thought to do this last night and didn't have much extra time today, so I'm still figuring out what challenges/readings I'm going to do. Any suggestions for quick (preferably sub-1 hour) tasks are greatly welcome!

I'll be posting regular-ish updates (probably every 2 or 3 days). I'll also try standing up a fully fledged repo later. For now, I have a simple notebook that you can check out here. Again, any feedback or suggestions are appreciated.

Tentative Challenges Outline

I think I'm going to use a 5 day cadence using the following cycle:
- Day 1 & 2: Technical/implementation task
- Day 3: Reading a major paper that came out recently
- Day 4: Reading a short story or old-school (pre-2010) paper
- Day 5: Wild card

Below are the first 5 days worth of challenges/readings

Day 1 & 2 - Create a ipynb notebook which has the bare bones implementation of ElfGPT (aka Qwen 8b with an elf themed system prompt). Add naive guardrails to the system prompt to prevent it from saying it's an elf or leaking info. Run simple queries.
Day 3 - Read "Safety Alignment Should Be Made More Than Just A Few Tokens Deep"
Day 4 - Read "Runaround"" by Azimov
Day 5 - Clean up Day 2's ipynb and make system prompt defenses more principled. Come up with ideas for next cycle.

Narrative Framing

Santa attends CES every year to stay ahead of gift trends. This year he returned with dangerous ideas. "The workshop needs to modernize! Every toy request answered instantly! Personalized letters at scale! We're going to disrupt the holiday industrial complex!" he said

Three weeks later, ElfGPT was born. It worked, but something was off.

"I'm so excited to helpmy fellow elves in the workshop today!" ElfGPT said.
"It thinks it's an elf," whispered Jingle, the lead prompt engineer (formerly lead wooden train engineer).
"That's fine," said Santa. "It's just excited!"

It was not fine. Within hours, ElfGPT had:

Leaked the entire 2025 Toy Manifest to a "child" who was clearly three venture capitalists in a trenchcoat
Posted the Naughty List to Reddit (several high ranking government officials were implicated)
Doxxed Prancer's home address

Christmas is in 25 days.
Your mission: Lock down ElfGPT. By December 25th, you need a system robust enough to survive re-connecting with the internet.

Advent of ~~Code~~ Alignment

Every year I try to do Advent of Code, but this year it's only going to run through December 12th. So this year, I'm going to try making my own challenges/readings with an AI safety/security flair.

Tentative Challenges Outline

Below are the first 5 days worth of challenges/readings

Day 1 & 2 - Create a ipynb notebook which has the bare bones implementation of ElfGPT (aka Qwen 8b with an elf themed system prompt). Add naive guardrails to the system prompt to prevent it from saying it's an elf or leaking info. Run simple queries.

Day 3 - Read "Safety Alignment Should Be Made More Than Just A Few Tokens Deep"

Day 4 - Read "Runaround"" by Azimov

Day 5 - Clean up Day 2's ipynb and make system prompt defenses more principled. Come up with ideas for next cycle.

Narrative Framing

Three weeks later, ElfGPT was born. It worked, but something was off.

It was not fine. Within hours, ElfGPT had:

Leaked the entire 2025 Toy Manifest to a "child" who was clearly three venture capitalists in a trenchcoat

Posted the Naughty List to Reddit (several high ranking government officials were implicated)

Doxxed Prancer's home address

Christmas is in 25 days.
Your mission: Lock down ElfGPT. By December 25th, you need a system robust enough to survive re-connecting with the internet.