George Ingebretsen

CS undergraduate at UC Berkeley

High-level interpretability with @Jozdien, SLT with @Lucius Bushnaq, robustness with Kellin Pelrine

georgeingebretsen.github.io

Wiki Contributions

Comments

Sorted by

“By then I knew that everything good and bad left an emptiness when it stopped. But if it was bad, the emptiness filled up by itself. If it was good you could only fill it by finding something better.”

  • Hemingway, A Moveable Feast

The fatebook embedding is so cool! I especially appreciate that it hides other people's predictions before you make your own. From what I can tell this isn't done on Lesswrong right now and I think that would be really cool to see!

(I may be mistaken on how this works, but from what I can tell they look like this on LW right now)

The scene in planecrash where Keltham gives his first lecture, as an attempt to teach some formal logic (and a whole bunch of important concepts that usually don't get properly taught in school), is something I'd highly recommend reading! As far as I can remember, you should be able to just pick it up right here, and follow the important parts of the lecture without understanding the story

How difficult would it be to turn this into an epub or pdf? Is there word of that coming soon? (or integrating into LW like the Codex?)

Realizing I kind of misunderstood the point of the post. Thanks!

In the case that there are, like "ai-run industries" and "non-ai-run industries", I guess I'd expect the "ai-run industries" to gobble up all of the resources to the point that even though ai's aren't automating things like healthcare, there just aren't any resources left?

To be clear, if you put doom at 2-20%, you're still quite worried then? Like, wishing humanity was dedicating more resources towards ensuring AI goes well, trying to make the world better positioned to handle this situation, and saddened by the fact that most people don't see it as an issue?

I'd be really interested to see how the harmfulness feature relates to multi-turn jailbreaks! We recently explored splitting a cipher attack into a multi-turn jailbreak (where instead of passing-in the word mappings + the ciphered harmful prompt all at once, you pass in the word mappings, let the model respond, and then pass-in the harmful prompt).

I'd expect to see something like when you "spread out the harm" enough, such that no one prompt contains any blaring red flags, the harmfulness feature never reaches the critical threshold, or something?

Scale recently published some great multi-turn work too!

Edit: I think I subconsciously remembered this paper and accidentally re-invented it.

Load More