Epistemic Status: Wondering over meta approaches to reasoning about agent foundations

In mathematics, we can find general principles for mathematical objects with a particular set of axioms. For example, if a language follows first-order logic, we can assume that saying the sentence "all apples are made of carbon" is the same as saying "if something isn't made of carbon, it isn't an apple". Similarly, I believe that agent foundations are an attempt to construct an abstract algebra on the algebra that is machine learning. We want to be able to say something about all future ML models by arguing over the potential compositions of ML agents. This would be like saying, "since AGI-agent 1 is a subset of an agent class that follows these axioms, we know that it will only defect if it sees a paperclip".

Someone pointed out to me that we might be assuming things about how an agent acts by not grounding it in current-day machine learning algorithms. The problem with not grounding it is that we might be constructing an abstract algebra with axioms that don't encompass the axioms in the algebra we're trying to study. I thought this was a great question, and I have no idea how to answer it since I haven't seen any formalisation on what ML approaches are a sub-set of. Take, for example, John Wentworth's work on agent foundations. Does it generalise to multi-agent-based systems? Does it generalise to self-supervision algorithms such as current-day transformers? I would love to know if anyone has thought of this.

I also have some other related questions:

Firstly, do you, fellow humans of LessWrong, believe that the abstract algebra framework is helpful for thinking about agent foundations?

Secondly, I wondered whether anyone has made such an approach or knows what modern-day ML algorithms different parts of agent foundations covers?

Epistemic Status:

Wondering over meta approaches to reasoning about agent foundationsIn mathematics, we can find general principles for mathematical objects with a particular set of axioms. For example, if a language follows first-order logic, we can assume that saying the sentence "all apples are made of carbon" is the same as saying "if something isn't made of carbon, it isn't an apple". Similarly, I believe that agent foundations are an attempt to construct an abstract algebra on the algebra that is machine learning. We want to be able to say something about all future ML models by arguing over the potential compositions of ML agents. This would be like saying, "since AGI-agent 1 is a subset of an agent class that follows these axioms, we know that it will only defect if it sees a paperclip".

Someone pointed out to me that we might be assuming things about how an agent acts by not grounding it in current-day machine learning algorithms. The problem with not grounding it is that we might be constructing an abstract algebra with axioms that don't encompass the axioms in the algebra we're trying to study. I thought this was a great question, and I have no idea how to answer it since I haven't seen any formalisation on what ML approaches are a sub-set of. Take, for example, John Wentworth's work on agent foundations. Does it generalise to multi-agent-based systems? Does it generalise to self-supervision algorithms such as current-day transformers? I would love to know if anyone has thought of this.

I also have some other related questions:

Firstly, do you, fellow humans of LessWrong, believe that the abstract algebra framework is helpful for thinking about agent foundations?

Secondly, I wondered whether anyone has made such an approach or knows what modern-day ML algorithms different parts of agent foundations covers?