Still haven't heard a better suggestion than CEV.

Wiki Contributions


It's not really possible to hedge either the apocalypse or a global revolution, so you can ignore those states of the worlds when pricing assets (more or less). 


Unless depending on what you invest in those states of the world become more or less likely.

Haha, I was hoping for a bit more activity here, but we filled our speaker slots anyway. If you stumble across this post before November 26th, feel free to come to our conference.

In the final paragraph, I'm uncertain if you are thinking about "agency" being broken into components which make up the whole concept, or thinking about the category being split into different classes of things, some of which may have intersecting examples. (or both?) I suspect both would be helpful. Agency can be described in terms of components like measurement/sensory, calculations, modeling, planning, comparisons to setpoints/goals, taking actions. Probably not that exact set, but then examples of agent like things could naturally be compared on each component, and should fall into different classes. Exploring the classes I suspect would inform the set of components and the general notion of "agency".

I guess to get work on that done it would be useful to have a list of prospective agent components, a set of examples of agent shaped things, and then of course to describe each agent in terms of the components. What I'm describing, does it sound useful? Do you know of any projects doing this kind of thing?

On the topic of map-territory correspondence, (is there a more concise name for that?) I quite like your analogies, running with them a bit, it seems like there are maybe 4 categories of map-territory correspondence;

  • Orange-like: It exists as a natural abstraction in the territory and so shows up on many maps.
  • Hot-like: It exists as a natural abstraction of a situation. A fire is hot in contrast to the surrounding cold woods. A sunny day is hot in contrast to the cold rainy days that came before it.
  • Heat-like: A natural abstraction of the natural abstraction of the situation, or alternatively, comparing the temperature of 3, rather than only 2, things. It might be natural to jump straight to the abstraction of a continuum of things being hot or not relative to one another, but it also seems natural to instead not notice homeostasis, and only to categorize the hot and cold in the environment that push you out of homeostasis.
  • Indeterminate: There is no natural abstraction underneath this thing. People either won't consistently converge to it, or if they do, it is because they are interacting with other people (so the location could easily shift, since the convergence is to other maps, not to territory), or because of some other mysterious force like happenstance or unexplained crab shape magic.

It feels like "heat-like" might be the only real category in some kind of similarity clusters kind of way, but also "things which are using a measurement proxy to compare the state of reality against a setpoint and taking different actions based on the difference between the measurement result and the setpoint" seems like a specific enough thing when I think about it that you could divide all parts of the universe into being either definitely in or definitely out of that category, which would make it a strong candidate for being a natural abstraction, and I suspect it's not the only category like that.

I wouldn't be surprised if there were indeterminate things in shared maps, and in individual maps, but I would be very surprised if there were many examples in shared maps that were due to happenstance instead of being due to convergence of individual happenstance indeterminate things converging during map comparison processes. Also, weirdly, the territory containing map making agents which all mark a particular part of their maps may very well be a natural abstraction, that is, the mark existing at a particular point on the maps might be a real thing, but not the corresponding spot in territory. I'm thinking this is related to a Schelling point or Nash Equilibrium, or maybe also related to human biases. Although, those seem to do more with brain hardware than agent interactions. A better example might be the sound of words: arbitrary, except that they must match the words other people are using.

Unrelated epistemological game; I have a suspicion that for any example of a thing that objectively exists, I can generate an ontology in which it would not. For the example of an orange, I can imagine an ontology in which "seeing an orange", "picking a fruit", "carrying food", and "eating an orange" all exist, but an orange itself outside of those does not. Then, an orange doesn't contain energy, since an orange doesn't exist, but "having energy" depends on "eating an orange" which depends on "carrying food" and so on, all without the need to be able to think of an orange as an object. To describe an orange you would need to say [[the thing you are eating when you are][eating an orange]], and it would feel in between concepts in the same way that in our common ontology "eating an orange" feels like the idea between "eating" and "orange".

I'm not sure if this kind of ontology:

  • Doesn't exist because separating verbs from nouns is a natural abstraction that any agent modeling any world would converge to.
  • Does exist in some culture with some language I've never heard of.
  • Does exist in some subset of the population in a similar way to how some people have aphantasia.
  • Could theoretically exist, but doesn't by fluke.
  • Doesn't exist because it is not internally consistent in some other way.

I suspect it's the first, but it doesn't seem inescapably true, and now I'm wondering if this is a worthwhile thought experiment, or the sort of thing I'm thinking because I'm too sleepy. Alas :-p

It's unimportant, but I disagree with the "extra special" in:

if alignment isn’t solvable at all [...] extra special dead

If we could coordinate well enough and get to SI via very slow human enhancement that might be a good universe to be in. But probably we wouldn't be able to coordinate well enough and prevent AGI in that universe. Still, odds seem similar between "get humanity to hold off on AGI till we solve alignment" which is the ask in alignment possible universes, and "get humanity to hold off on AGI forever" which is the ask in alignment impossible universes. The difference between the odds being based on how long until AGI, whether the world can agree to stop development or only agree to slow it, and if it can stop, whether that is stable. I expect AGI is a sufficient amount closer than alignment that getting the world to slow it for that long and stop it permanently are fairly similar odds.

what Hotz was treating a load bearing

Small grammar mistake. You accidentally a "a".

Oh, actually I spoke too soon about "Talk to the City." As a research project, it is cool, but I really don't like the obfuscation that occurs when talking to an LLM about the content it was trained on. I don't know how TTTC works under the hood, but I was hoping for something more like de-duplication of posts, automatically fitting them into argument graphs. Then users could navigate to relevant points in the graph based on a text description of their current point of view, but importantly they would be interfacing with the actual human generated text, with links back to it's source, and would be able to browse the entire graph. People could then locate (visually?) important crux's and new crux's wouldn't require a writeup to disseminate, but would already be embedded in the relevant part of the argument.
( I might try to develop something like this someday if I can't find anyone else doing it. )

The risk interview perspectives is much closer to what I was thinking, and I'd like to study it in more detail, but seems more like a traditional analysis / infographic than what I am wishing would exist.

Yesssss! These look cool : ) Thank you.

  1. The human eats ice cream
  2. The human gets reward 
  3. The human becomes more likely to eat ice cream

So, first of all, the ice cream metaphor is about humans becoming misaligned with evolution, not about conscious human strategies misgeneralizing that ice cream makes their reward circuits light up, which I agree is not a misgeneralization. Ice cream really does light up the reward circuits. If the human learned "I like licking cold things" and then sticks their tongue on a metal pole on a cold winter day, that would be misgeneralization at the level you are focused on, right?

Yeah, I'm pretty sure I misunderstood your point of view earlier, but I'm not sure this makes any more sense to me. Seems like you're saying humans have evolved to have some parts that evaluate reward, and some parts that strategize how to get the reward parts to light up. But in my view, the former, evaluating parts, are where the core values in need of alignment exist. The latter, strategizing parts, are updated in an RL kind of way, and represent more convergent / instrumental goals (and probably need some inner alignment assurances).

I think the human evaluate/strategize model could be brought over to the AI model in a few different ways. It could be that the evaluating is akin to updating an LLM using training/RL/RLHF. Then the strategizing part is the LLM. The issue I see with this is the LLM and the RLHF are not inseparable parts like with the human. Even if the RLHF is aligned well, the LLM can, and I believe commonly is, taken out and used as a module in some other system that can be optimizing for something unrelated.

Additionally, even if the LLM and RLHF parts were permanently glued together somehow, They are still computer software and are thereby much easier for an AI with software engineering skill to take out. If the LLM (gets agent shaped and) discovers that it likes digital ice cream, but that the RLHF is going to train it to like it less, it will be able to strategize about ways to remove or circumvent the RLHF much more effectively than humans can remove or circumvent our own reinforcement learning circuitry.

Another way the single lifetime human model could fit onto the AI model is with the RLHF as evolution (discarded) and the LLM actually coming to be shaped like both the evaluating and strategizing parts. This seems a lot less likely (impossible?) with current LLM architecture, but may be possible with future architecture. Certainly this seems like the concern of mesa optimizers, but again, this doesn't seem like a good thing, mesa optimizers are misaligned w.r.t. the loss function of the RL training.

People have tried lots and lots of approaches to getting good performance out of computers, including lots of "scary seeming" approaches

I won't say I could predict that these wouldn't foom ahead of time, but it seems the result of all of these is an AI engineer that is much much more narrow / less capable than a human AI researcher.

It makes me really scared, many people's response to not getting mauled after poking a bear is to poke it some more. I wouldn't care so much if I didn't think the bear was going to maul me, my family, and everyone I care about.

I don't expect a sudden jump where AIs go from being better at some tasks and worse at others, to being universally better at all tasks.

The relevant task for AIs to get better at is "engineering AIs that are good at performing tasks." It seems like that task should have some effect on how quickly the AIs improve at that task, and others.

real-world data in high dimensions basically never look like spheres

This is a really good point. I would like to see a lot more research into the properties of mind space and how they affect generalization of values and behaviors across extreme changes in the environment, such as those that would be seen going from an approximately human level intelligence to a post foom intelligence.

The Security Mindset and Parenting: How to Provably Ensure your Children Have Exactly the Goals You Intend.

A good person is what you get when you raise a human baby in a good household, not what you get when you raise a computer program in a good household. Most people do not expect their children will grow up to become agents capable of out planning all other agents in the environment. If they did, I might appreciate if they read that book.

The waluigis will give anti-croissant responses

I'd say the waluigis have a higher probability of giving pro-croissant responses than the luigi's, and are therefore genuinely selected against. The reinforcement learning is not part of the story, it is the thing selecting for the LLM distribution based on whether the content of the story contained pro or anti croissant propaganda.

(Note that this doesn't apply to future, agent shaped, AI (made of LLM components) which are aware of their status (subject to "training" alteration) as part of the story they are working on)

Load More