All of Hjalmar_Wijk's Comments + Replies

Tabooing 'Agent' for Prosaic Alignment

Strongly agree with this, I think this seems very important.

Tabooing 'Agent' for Prosaic Alignment

These sorts of problems are what caused me to want a presentation which didn't assume well-defined agents and boundaries in the ontology, but I'm not sure how it applies to the above - I am not looking for optimization as a behavioral pattern but as a concrete type of computation, which involves storing world-models and goals and doing active search for actions which further the goals. Neither a thermostat nor the world outside seem to do this from what I can see? I think I'm likely missing your point.

2romeostevensit2yMotivating example: consider a primitive bacterium with a thermostat, a light sensor, and a salinity detector, each of which has functional mappings to some movement pattern. You could say this system has a 3 dimensional map and appears to search over it.
Torture and Dust Specks and Joy--Oh my! or: Non-Archimedean Utility Functions as Pseudograded Vector Spaces

Theron Pummer has written about this precise thing in his paper on Spectrum Arguments, where he touches on this argument for "transitivity=>comparability" (here notably used as an argument against transitivity rather than an argument for comparability) and its relation to 'Sorites arguments' such as the one about sand heaps.

Personally I think the spectrum arguments are fairly convincing for making me believe in comparability, but I think there's a wide range of possible positions here and it's not entirely obvious which are actually inconsistent. Pummer

... (read more)
Towards a mechanistic understanding of corrigibility

Understanding the internal mechanics of corrigibility seems very important, and I think this post helped me get a more fine-grained understanding and vocabulary for it.

I've historically strongly preferred the type of corrigibility which comes from pointing to the goal and letting it be corrigible for instrumental reasons, I think largely because it seems very elegant and that when it works many good properties seem to pop out 'for free'. For instance, the agent is motivated to improve communication methods, avoid coercion, tile properly and even possibly i

... (read more)
7abramdemski2yThe 'type of corrigibly' you are referring to there is corrigibly at all; rather, it's alignment. Indeed, the term corrigibly was coined to contrast to this, motivated by the fragility of this to getting the printer right. I tend to agree. I'm hoping that thinking about myopia and related issues [https://www.lesswrong.com/posts/4hdHto3uHejhY2F3Q/partial-agency] could help me understand more natural notions of corrigibility.
Computational Model: Causal Diagrams with Symmetry

I really like this model of computation and how naturally it deals with counterfactuals, surprised it isn't talked about more often.

This raises the issue of abstraction - the core problem of embedded agency.

I'd like to understand this claim better - are you saying that the core problem of embedded agency is relating high-level agent models (represented as causal diagrams) to low-level physics models (also represented as causal diagrams)?

8johnswentworth2yI'm saying that the core problem of embedded agency is relating a high-level, abstract map to the low-level territory it represents. How can we characterize the map-territory relationship based on our knowledge of the map-generating process, and what properties does the map-generating process need to have in order to produce "accurate" & useful maps? How do queries on the map correspond to queries on the territory? Exactly what information is kept and what information is thrown out when the map is smaller than the territory? Good answers to these question would likely solve the usual reflection/diagonalization problems, and also explain when and why world-models are needed for effective goal-seeking behavior. When I think about how to formalize these sort of questions in a way useful for embedded agency, the minimum requirements are something like: * need to represent physics of the underlying world * need to represent the cause-and-effect process which generates a map from a territory * need to run counterfactual queries on the map (e.g. in order to do planning) * need to represent sufficiently general computations to make agenty things possible ... and causal diagrams with symmetry seem like the natural class to capture all that.
Tabooing 'Agent' for Prosaic Alignment

I wonder if you can extend it to also explain non-agentic approaches to Prosaic AI Alignment (and why some people prefer those).

I'm quite confused about what a non-agentic approach actually looks like, and I agree that extending this to give a proper account would be really interesting. A possible argument for actively avoiding 'agentic' models from this framework is:

  1. Models which generalize very competently also seem more likely to have malign failures, so we might want to avoid them.
  2. If we believe then things which generalize very competently are l
... (read more)
3flodorner2yIn light of this exchange, it seems like it would be interesting to analyze how much arguments for problematic properties of superintelligent utility-maximizing agents (like instrumental convergence) actually generalize to more general well-generalizing systems.
5romeostevensit2y(black) hat tip to johnswentworth for the notion that the choice of boundary for the agent is arbitrary in the sense that you can think of a thermostat optimizing the environment or think of the environment as optimizing the thermostat. Collapsing sensor/control duality for at least some types of feedback circuits.