Charlie Steiner

If you want to chat, message me!

LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.

Sequences

Alignment Hot Take Advent Calendar
Reducing Goodhart
Philosophy Corner

Wikitag Contributions

Comments

Sorted by

And don't you think 500 lines of Python also "fails due to" having unintended optima?

I've put "fails due to" in scare quotes because what's failing is not every possible approach, merely almost all samples from approaches we currently know how to take. If we knew how to select python code much more cleverly, suddenly it wouldn't fail anymore. And ditto for if we knew how to better construct reward functions from big AI systems plus small amounts of human text or human feedback.

Do you have ideas about how to do this?

I can't think of much besides trying to get the AI to richly model itself, and build correspondences between that self-model and its text-production capability.

But this is, like, probably not a thing we should just do first and think about later. I'd like it to be part of a pre-meditated plan to handle outer alignment.

Edit: after thinking about it, that's too cautious. We should think first, but some experimentation is necessary. The thinking first should plausibly be more like having some idea about how to bias further work towards safety rather than building self-improving AI as fast as possible.

Maybe you could do something with LLM sentiment analysis of participants' conversations (e.g. when roleplaying discussing what the best thing to do for the company, genuinely trying to do a good job both before and after).

Though for such a scenario, an important thing I imagine is that learning about fallacies only has a limited relation, and only if people learn to notice them in themselves, not just in someone they already disagree with.

What happens if humans have a systematic bias? E.g. we always rate claims with negative sentiment as improbable, and always rate claims with positive sentiment as probable? It seems like Alice dominates because Alice gets to write and pick the subclaims. But does Bob have a defense, maybe predicting the human probability and just giving that? But because the human probability isn't required to be consistent, I think Bob is sunk because Alice can force the human probability assignment to be inconsistent and then gotcha Bob either for disagreeing with the human or for being inconsistent.

William Lane Craig is great to watch from meta-perspective. How do you go into someone else's field of expertise and try to beat them in a debate? He clearly thinks about it very carefully, in a way kinda like planning for political debates but with a much higher quality intended output.

I had a pretty different interpretation - that the dirty secrets were plenty conscious (he knew consciously they might be stealing a boat), instead he had unconscious mastery of a sort of people-modeling skill including self-modeling, which let him take self-aware actions in response to this dirty secret.

For math specifically, this seems useful. Maybe also for some notion of "general knowledge."

I had a music class in elementary school. How would you test for whether the students have learned to make music? I had a spanish class - how do you test kids' conversational skills?

Prior to good multimodal AI, the answer [either was or still is, not sure] to send a skilled proctor to interact with students one-on-one. But I think this is too unpalatable for reliability, cost, and objectivity reasons.

(Other similar skills: writing fiction, writing fact, teamwork, conflict resolution, debate, media literacy, cooking, knowledge of your local town)

I'm not sure if you're reading in more rudeness than I intended to that phrase. I'll try to clarify and then maybe you can tell me.

By "I feel for this person," I mean "I think it's understandable, even sympathetic, to have the mental model of LLMs that they do." Is that how you interpreted it, and you're saying it's condescending for me to say that while also saying this person made a bunch of mistakes and is wrong?

On thing I do not mean, but which I now worry someone could get out, is "I feel sorry (or mockingly pretend to feel sorry) for this person because they're such a pitiable wretch."

Well, thanks for the link.

I might save this as a scathing, totally unintentional pan of the hard problem of consciousness:

Ultimately, it’s called the Hard Problem of Consciousness, not because it is simply difficult, but because it is capital-H Hard in a way not dissimilar to how NP-Hard problems may be literally undecidable.

That's actually misleading of me to pick on, because I thought that most of the section on consciousness was actually a highlight of the article. It's because I read it with more interest that I noticed little oversights like the above.

A part of me wonders if an LLM said it was really deep.

I feel for this person, I really do. Anthropomorphizing LLMs is easy and often useful. But you can't just ditch technical reasoning about how AI works for this kind of... ecstatic social-reasoning-based futurism. Or, you can, you're just going to be wrong.

And please, if you run your theory of anything by an LLM that says it's genius, it's important to remember that there's currently a minor problem (or so I remember a moderator saying, take this with a grain of salt) on the physics subreddit with crank Theory of Everything submissions that got ChatGPT to "fill in the details," and they must be right because ChatGPT made such a nice argument for why this idea was genius! These amateur physicsts trusted ChatGPT to watch their (epistemic) back, but it didn't work out for them.

Load More