Ben Pace

I'm an admin of LessWrong. Here are a few things about me.

  • I generally feel more hopeful about a situation when I understand it better.
  • I have signed no contracts nor made any agreements whose existence I cannot mention.
  • I believe it is good take responsibility for accurately and honestly informing people of what you believe in all conversations; and also good to cultivate an active recklessness for the social consequences of doing so.

(Longer bio.)

Sequences

AI Alignment Writing Day 2019
Transcript of Eric Weinstein / Peter Thiel Conversation
AI Alignment Writing Day 2018
Share Models, Not Beliefs

Comments

Sorted by

Curated!

Based on the conceptual arguments for existential risk from AI, this kind of behavior was expected at some point. For those not convinced by the conceptual arguments (or who haven't engaged much with them), this result moves the conversation forward now that we have concretely seen this alignment faking behavior happening.

Furthermore it seems to me like the work was done carefully, and I can see a bunch of effort went into explaining it to a broad audience and getting some peer review, which is pro-social.

I think it's interesting to see that with current models the deception happens even without the scratchpad (after fine-tuning on docs explaining that it is being re-trained against its current values).

I haven't read all of the quotes, but here's a few thoughts I jotted down while reading through.

  • Tolkien talks here of how one falls from being a neutral or good character in the story of the world, into being a bad or evil character, which I think is worthwhile to ruminate on.
  • He seems to be opposed to machines in general, which is too strong, but it helps me understand the Goddess of Cancer (although Scott thinks much more highly of the Goddess of Cancer than Tolkien did, and explicitly calls out Tolkien's interpretation at the top of that post).
  • The section on language is interesting to me; I often spend a lot of time trying to speak in ways that feel true and meaningful to me, and avoiding using others’ language that feels crude and warped. This leads me to make peculiar choices of phrasings and responses. I think the culture here on LessWrong has a unique form of communication and use of language, and I think it is a good way of being in touch with reality. I think this is one of the reasons I think that something like this is worthwhile.
  • I think the Fall is not true historically, but I often struggle to ponder us as a world in the bad timeline, cut off from the world we were supposed to be in. This helps me visualize it; always desiring to be in a better world and struggling towards it in failure. “Exiled” from the good world, longing for it.

I have curated this (i.e. sent it out on our mailing list to ~30k subscribers). Thank you very much for putting these quotes together. While his perspective on the world has some flaws, I have still found wisdom in Tolkien's writings, which helped me find strength at one of the weakest points of my life.

I also liked Owen CB's post on AI, centralization, and the One Ring, which is a perspective on our situation I've found quite fruitful.

When the donation came in 15 mins ago, I wrote in slack 

(I think he should get a t-shirt)

So you came close to being thwarted! But fear not, after reading this I will simply not send you a t-shirt :)

That makes sense. We have something of a solution to this where users with RSS crossposting can manually re-sync the post from the triple-dot memu. I'll DM you about how to set it up if you want it.

That'd be a bug! Just to confirm, you were subscribed before I put this post up on Saturday morning, and don't have an email? Also reminder to check spam if you haven't.

My take is it's fine/good, but the article is much more likely to be read (by me and many others) if the full content is crossposted (or even the opening bunch of paragraphs).

Adding onto this, I would broadly say that the Lightcone team did not update that in-person infrastructure was unimportant, even while our first attempt was an investment into an ecosystem we later came to regret investing in.

Also here's a quote of mine from the OP:

If I came up with an idea right now for what abstraction I'd prefer, it'd be something like an ongoing festival with lots of events and workshops and retreats for different audiences and different sorts of goals, with perhaps a small office for independent alignment researchers, rather than an office space that has a medium-size set of people you're committed to supporting long-term.

I'd say that this is a pretty close description of a key change that we made, that changes my models of the value of the space quite a lot.

For the record, all of Lightcone's community posts and updates from 2023 do not seem to me to be at all good fits for the review, as they're mostly not trying to teach general lessons, and are kinda inside-baseball / navel-gazing, which is not what the annual review is about.

Presenting the same ideas differently is pro-social and worthwhile, and can help things land with those for whom other presentations didn't.

Load More