Research agenda for AI safety and a better civilization

agilecaveman

This agenda is various ideas that occurred to me throughout the years of reading up on AI progress and AI safety. One big idea is that AI is simultaneously software, applied mathematics as well as philosophy and ideology. AI being philosophy and ideology means the big issues in theoretical AI have a corresponding ideological issue in non-AI. For example, the idea of what utility function ought we optimize is a very related to question such as what a “good” society is, how should we measure progress. Question of why AI is not reliable are very dependent on questions of why software is not reliable or why we have yet to come with a process to create bug-free software after all these years of trying. To fully “solve” AGI means fully solving ethics, politics and economics, for starters. This may not sound possible, but “sounding possible” isn’t that important compared to doing what’s actually needed. Good AI research has dual use as good philosophy and deep theoretical understanding of society. So, solving a number of problems in this research agenda would basically lay a new mathematical, conceptual and philosophical foundation for a new way of doing software, reasoning about science and general ways of knowing things.

Now given that the scope of the problem is large, it’s not hard to come up with some issues, which likely can help progress in AI. However, it’s a bit harder to narrow done the ones which are most likely to be helpful in terms of the resulting final or intermediate produce. Many of these seem somewhat under-studied to me and most of these are of a theme around “how can we gain a reliable understanding of this?” rather than “here is what will win a particular benchmark.”

Improving and understanding current implementations.

1. Functional programming / Category theory + AI

The joke is, all dysfunctional AI should be banned. What is dysfunctional? AI not written in functional languages. Jokes aside, it is a travesty of epic proportions that we settled on Python and not Haskell for our AI libraries. Most of what we need to do is lots of data processing, which lends itself no-side effect functions. Not to mention, there is a lot of iterative work that happens in trying parts of data science experiments repeatedly, which makes caching ever more important. If you want to tweak a particular experiment, a functional language will allow an easier memorization of state.

However, aside from the potential efficiency gains, I am very interested in being able to make automated reasoning over the AI code itself, as well as being able to think about AI architectures and transformations between them in terms of category theory or type theory.

2. Honest assessment of randomization in algorithms, as well as prior information efficiency gains

One of those key idea in academia, which seems to make absolutely no sense in the real world is their love of randomized algorithms. I suspect that nearly all randomized algorithms could be re-written to be more efficient, or at least equally efficient, but more cache-able. For any action selected randomly from an action set, one should be able to come up with a deterministic way of doing it. In RL games simulations, such as Dota or Go, there was a lot of emphasis on zero-knowledge learning about something (such as a game) by using randomness as initial condition. As a showing off how much hardware you can burn to achieve a task, this is great, however from a perspective of wanting to run algorithms for cheap, this is not good. Randomized is not a complete synonym of “zero-knowledge”. One can have deterministic “Zero-knowledge” algorithms and those are important in trying to establish baselines of how much learning can be done through particular means.

What would be also interesting is figuring out experimentally how much knowledge improves performance of particular algorithms. Would starting from an existing opening database significantly speed up a self-play chess engine? How much? Aside from efficiency gains, liking of randomness and lack of assessment of “knowledge efficiency,” seems like an driven blind spot of today’s research. I suspect there is a dislike of using “human knowledge” in parts of the AI industry due to perception of its fallibility, while it’s worth focusing on the expected value of human knowledge rather than possible bugs.

3. Information theoretic understanding of current implementations

We can look into “what is a process that takes data and produces an AI” is doing? It takes data and outputs models, in short. Given other data, it outputs other models. Ideally what the process is doing is compressing “essential” parts of the data into the model. The essential parts are hopefully compressed in such a way as to allow other data that contains the other essential parts, but some other noise as well to be recognized by the model. In this view overfitting is compressing “too much” in memorizing noise and underfitting is compressing “too little” of the essential stuff.

However, in this view, the compression characteristics of data become interesting. Does operating on compressed data significantly alter algorithm performance? Can we point to specific sub-sections of the model and view the correlation with key repeating factors of the data? How do the compression characteristics of AI compare with deterministic algorithms? This research probably is easiest to be done on auto-encoders first. For example, at what point, do floating point precision issues start to matter, in that the number of bits required to compress things is too low? If we use various AIs as lossy compression algorithms, what is the loss graph look like of % of compression vs % accuracy?

Questions from economics – towards value aggregation.

4. Theoretically rigorous notion of metric comparisons

One of the many complaints about the modern world is that it is too “metric focused.” There is, of course, truth in this statement. The underlying complaint is that particular forms of optimization that are focused, say, on money are ignoring important aspects of reality, such as social cohesion. Implicitly we have the idea that there is one single “invisible metric” that we can optimize towards and that other metrics are an “approximation to it.” So, neither GDP, nor stock market, not life expectancy should be a definition of goodness of our society. Rather some of these metrics should be a way to guide things and not over-index on. This is somewhat controversial and attackable both from people who want to have nothing to do with GDP as well as people who think it must be the holy grail of society.

However, underlying these debates is a fascinating theoretical gap, in which a simple problem of: given two metrics A and B on world states, what is the “similarity” between them. Or is metric C more similar to A vs B? As far as I can tell, those questions are not even theoretically well defined, let alone explored in some reinforcement learning scenarios. What I would love to see is both a theoretical understanding of what it means for two metrics on world states to be similar as well as any mathematical structures (fields, categories, etc) that “world metrics” can form based on similarity measures. This can get very tricky depending on way to deal with infinities or how to actually distinguish between measures (world state permutation based, scaling – based dot products, weight on relevant states, etc, etc).

5. Better belief and choice aggregation mechanisms

Ideally an AGI does some sort of value aggregation of people’s “values.” Before getting to the harder question, it’s worth exploring the actual very simplified sub-problems of value aggregation: belief and choice aggregation.
Belief aggregation is a question – given several people with beliefs about an event X, what is the best estimate of X actually happening. The inputs to the model include estimates of say probability of X happening and the strength of each person’s belief. Yes, a simple weighted model can do some tricks, setting up a prediction market or a prediction competition can be an aggregation mechanism. However, a full comparison of aggregation mechanism and what they accomplish in the long term has not been done. There is also a more pragmatic question of “why have prediction markets not taken off yet,” which has a simple answer of “it’s not actually fully rational to bet against people who might have better data than you.” However, if not prediction markets as stated, then what?

In addition to belief aggregation there is also choice aggregation is a question of given people’s preferences over certain options to come. We live in a democratic society, so choice aggregation is a very political issue. Much ink has been spilled in arguments for and against various voting systems on trying to understand or get around practical implications of “Arrow’s theorem”. An AGI as an aggregator of information from people will be solving these problems either implicitly or explicitly.

There is a meme implicit in AI discourse that we solve AI first, and use AI to solve everything else. I suspect that this is fairly backwards, as in a situation of “AI has solved this problem different from humans” we would have difficulty really distinguishing between “AI is wrong due to bugs” vs “AI is right and humans are wrong.” Implicitly those questions will be decided in a non-systematic manner. We already have seen debates between pro players of a particular game and the developer of an AI which beat them. While it’s tempting to conclude that

6. Look at how competing notions of value are implemented in reality

Belief and choice aggregation are sub-problems of a more complex “value aggregation.” Even if there is a solution to individual’s value aggregation, societal value aggregation has not-trivial problems. There are number of objections in have heard in tackling this problem. Some waive this away as being trivial through some sort of “utilitarian” averaging, some deny that “values” are even a thing or that humans are “rational” enough to be trusted, which is yet another relatively incoherent meme. Others might deny that this is actually “possible”. I suspect, like most things, this is possible yet complex, from the simple fact that somehow people do manage to co-exist in a society in a way that aggregates some of their values and we have structural mechanisms for doing so.

This could be market mechanisms for allocating resources, argumentation mechanism for allocating legible description of value, credentialing mechanism, people moving to places where they feel valued, etc, etc. All of these mechanisms have pros and cons regarding the overhead of value aggregation as well as potentially negative feedback loops. Studying those in a theoretical framework can vastly improve our understanding of what actions ought a society encourage it’s members to do to express contentment and discontentment with a situation as well as allow an AGI to correctly parse those actions should we try to unify those mechanisms into a single one.

There has been some I suspect models of “debate” are important to understand, but to base “ethics” on debate about ethics is a little bit like confusing eating the recipe book to making the recipe.

Generalized Game theory + AI

GANs are a cool idea and I suspect the area of interception between game theory and current AI will bear a lot of interesting fruit.

7. Various decision theories + AI or other decision making processes

It seems to be that the decision theory insights are very slowly being integrated into modern AI. Decision theory started with Evidential, moved on to Causal and then iterated on that with Timeless/ Updateless / Functional, which are all variations on the themes of considering counterfactuals through logical connections and strategy selection. Right now, current AI implementation do none of that and are basically implementing evidential decision theory and this might already be causing problems in places like the justice system. There is a lot of ideas around the question of “how to make AIs consider causality,” which feels important, but also similar in principle to moving from EDT to CDT. Few people are considering how to merge UDT / TDT / FDT with current implementation.

8. Inverse reinforcement learning under game theoretic considerations.

Inverse reinforcement learning or IRL is a neat idea in which an agent tries to either learn the function that another is optimizing for the sake of potentially mimicking the behavior. This, I think is a very fruitful area of research, however, both philosophically and mathematically it’s important to understand how this interacts with game theoretic notions. For example, say an AI observing a person sees him fulfil a contractual obligation, even in absence of enforcement. Paying for a good that was already delivered and was a one-time deal might seem to make the person worse off. However, the IRL agent should not conclude that the person fails to like money or is irrational. Rather it should be able to reason that the person has already create a self-image of the kind-of-person who fulfills contractual obligations or the kind-of-person who cooperates with others through particular traditions or rituals. In other words, people might deviate from “utility maximizers” theory in extremely rational way. Combining this all on the mathematical level is a fascinating problem.

9. Zero-sum game identification process.

One of the key issues that is fairly glossed over due to its inherent philosophical complexity is to what extent things we identify as “values” are “other-distinguishing” characteristics. In other words, to what extent do we adopt “values” to raise ourselves about the other. Especially in modern America, when so many things have become political, it’s no surprise that consideration of what are correct “values” to hold seems to change as well as being explicitly defined as “things that the other does not believe.” However, even without the modern issues, this is a fully general problem of the human condition.

What this means on a mathematical level is the question of being able to create a classifier that identifies when people are playing zero sum vs positive sum games. While this kind of question is somewhat far off in terms of being needed for AGI, this has massive implications in understanding what is frequently seen as “valuable” can be itself be a not-actually valuable one-up-manship process. There are also ultimately positive sum one-upmanship processes as well. Considering all of this even in an economic context, could lead to a theoretical understanding of the economy that has a better distinction between positional goods vs ultimate goods.

In some sense this problem already exists in the fight over each other’s limited attention on social media. At the very least, this would bring a lot of philosophical clarity to the ideas such as inverse reinforcement learning, revealed preferences or any other proposal that wishes to “learn from human values.”

Computational philosophy

10. Implementation of various philosophical theories.

It seems that we are far from truly realizing the potential of computers as “bicycles of the mind”. While many popular apps extend one’s mind in terms of recording down thoughts or tracking social interactions, there certainly a lack of computational help for more philosophical tasks.

There have been many philosophical theories of human action and likely corresponding societal outcome. Economics is merely one such subset of theories, however this list include many others such as Giradian theory of mimetic desire. Imagine that we would be able to model mimetic desire algorithmically and then experimentally verify under what assumptions do Girardian conclusions hold. This can be done for various philosophies. If, however, there are very strong issues in implementing a idea, that creates a set of interesting questions of whether the particular philosophy is false / misunderstood.

I suspect that we are going to implicitly re-discover philosophical theory as we try to make the AI smarter, which is ok, but ignored important previous work in philosophy. For example, I suspect that future iterations of GPT will reinvent the wheel on Kegan’s theory of adult development. Having this understanding before going about the question of “how do we improve ai now” can greatly speed up development, while also creating better clarity on what exactly intelligence is.
It’s not obvious to me that even more “economic” models of action have been modelled coherently. If there is an economic model which assumes some sort of “rational action” from people and an economic model which assumes “signaling” from people, could we see a simulation of one vs another and see how this corresponds to reality. Further clarity on this point can drive a better understanding of actual human error. However, given that this approach somewhat runs counter the prevailing “original sin” – style ideology of human error, this will probably be impossible to do in public.

The general point here is that a lot of cognitive work has been done through the years by philosophers and while their writing don’t obviously translate into formal theories of reality, it likely that it can.

11. Simulation of single-polar vs multi-polar scenarios

One of the debates about the future of AI centers on single-polar vs multi-polar scenarios. Two different analogies come to exist, in my characterization, “AI as a state” vs “AI as a firm.” In these, there is a “reference class tennis” issues of asking why would we expect a single AI in particular to rule over the world, compared to a more distributed system controlled by multiple stakeholders. Both scenarios have various pros and cons and different strategies that implies.

This debate should probably be helped through simulation of economic games. What we need is a “generalized theory of the firm” – where we have a general theory of which types of games are prone to multi-polar vs single-polar scenarios. What are more generic “economies and dis-economies of scale” that result industries having a winner take all mentality. Do single agents win over multiple agents unified by a systematic scenario (such as a market mechanism). With AI, this situation is blurry because any particular “market optimization” could simply be integrated as part of the AI itself as a sub-algorithm. But still the generic question of influence over events, how easily ideas replicate are very interesting.

This “generalized theory of the firm” would have the current theory of the firm as its special case, as well as being able to explain key factors in unity / dis-unity of countries as well as being able to accurately double crux whether we expect single or multi-polar scenarios in AI.

Rationality and Irrationality

12. Precise theories and implementations of epistemology modelling

This is probably the trickiest item on the list and it’s based on this article by Chapman: https://meaningness.com/probability-and-logic. This is also one I feel most “hand-wavy” about, in that this might be possible or necessary.

Right now our society has a deep epistemological commitment to “science” or “empiricism” or “things are true because the data says so.” However, this is easily hacked with fake data or lack of correct statistics. Misunderstanding of drug studies and their sub-sequent politicization has shown this problem quite acutely. In Chapman’s terms (and also in many other critiques of modern science) people lack “rationality” or logical reasoning to understand the actual math behind things like RCTs. However, this practical problem has deep theoretical roots in that probability and logic are not actually unified within a single field of math. MIRI has tried to do some research into unifying those, which is a start, however from what I have seen it lack finality or even potential applicability to a better foundation of science.

Not only do we lack a full unification of probability and logic and we lack mathematical foundations for other parts of epistemology. While it may seem strange to demand an entire new field of mathematics that would concern itself with “tradition” that is as developed as the probability and logic are individually, my feeling is that this could be an interesting mental exercise.

Otherwise what we have instead is shoving “intuition” and perhaps “tradition” into a Bayesian mindset of “priors,” but this feels incomplete. What I would love is both a theory of “logic + probability” in one, but that can also somehow model people’s other ways of knowing without necessarily needing to move everything through the framework of belief probabilities. This alone would be a new foundation of scientific reasoning that can avoid many logical pitfalls that modern “where is your RCT evidence for this” epistemology suffers from.

13. An actual mathematical theory of pervasive error

The idea that we cannot use human data or human values because they are “biased” is both anti-humanism and lazy mathematics. To allege bias usually implies having and knowing that some theory of reasoning is correct and that a person systematically deviates from that theory. Now “modern Less Wrong rationality” alleged at some point to have a found a theory of truth – Bayseanism. This was a good attempt but fell short for many practical and philosophical reasons, one being a misunderstanding of socially adversarial nature of both belief and value formulation.
A theory of “error” complementary to a set of theories about what rational action means would account for questions such as “is signaling rational and in what circumstances?” or “what social structures encourage or discourage error” that are derived from first principles and also verified in reality.

More meta

14. Concept-reality mapping issue as well model-reality mapping.

One of the many frustrating memes in AI is equating intelligence with conversational agents. This is more of a popular meme, put it occasionally re-surfaces due to modern moral’s philosophy confusion between “statements about ethics” with “ethics.” What I generally expect (an GPT-3 is a good example of this) is that having convincing writing for an AI is going to be very doable very soon, however, the problem is not “what words to say,” but “do these words map to reality in an accurate fashion.” This problem can be solved with embedding object recognition and word generation in the same modality (although this has its own issues). However, similar ideas would deeply struggle with concepts such as “freedom.” This fight over instantiations of words is political and thus there may not actually be a right answer.
GPT-3 is an interesting example here because it looks like a step towards some concept – reality mapping in text generation cases, such as creating CSS from a description of it. There is of a course several questions here. How far along are we in the process of having the AI say something “I think people are using such and such word too differently from its previous usage and this is causing social tension” or, in other words, noticing that the implied concept-reality map has been shifting over time.

However, if we cannot accurately map words to reality, some other mapping would have to take its place, which is likely “model-reality mapping” or given a computational model, how closely does it match reality? This is in some ways “meta-science” or the question of what the exact evidence / prediction matches do we need before we can say that a model “describes” reality.

I would be somewhat surprised if a statistical AI can look at mathematical theories / equations and correctly deduce from observation that those theory “describe” reality. I could be wrong, of course, but it does not seem like the type of computation that is taking place and how to integrate that into modern approaches.
I suspect that solving word-reality mapping and not solving model-reality mapping is going to be somewhat unsafe, as this can create weird feedback loops where society’s words and statements can “describe” reality, but lack any predictive power due to not being backed up by a logical model.

15. Universal notion of “prior information”

When a person wishes to create an AI for say a simple prediction task such as say predicting revenue from a number of ads, they would generally collect a bunch of data and run a model on it. However. if this somehow failed to create the necessary accuracy, they would iterate on it, perhaps by handcrafting some features or trying different configs or re-shaping the network architecture to perhaps fit the problem better. One example of a network architecture that supports a particular problem solving is CNNs for image recognition.

But to classify the very process of training such an AI, we have a notion both of “data used to train it,” as well as some component of “meta-training-data”, such as what worked and what didn’t on similar problems, what network architecture can take advantage of the invariants in the data or other optimization that the person proposes through analyzing the problem. This meta-data is frequently obtained through trial and error, reading papers, actual math, however all this is in many ways an unstructured input to a more structured problem of creating a model given training data.

What seems possible to me is coherent aggregation of “meta-training-data” into the training process with very systematic human involvement. So, for example, the person saying that the image data is translation-invariant and the trainer coming up with CNNs as one instance of general way to imbed this fact into the problem.

I envision a much cleaner interface in the future, where the person has to specify all prior knowledge they have about the problem and there is a deterministic way to go from prior knowledge to model architecture which becomes more accurate or faster the better the prior knowledge is. This could even be true in RL, where pro-gamers could “guide” the AI during the training.

16. Trust questions with regards to AI

Generalized trust question is – how much of algorithm’s output do you need to see before concluding that its performance is likely to be correct in general? Obviously with arbitrarily sized algorithms this is impossible, but let’s say given access to the source code and bound on runtime, but an inability to truly comprehend what is happening inside the algorithm, what kinds of outputs does one need to see before deciding things are working correctly or not. This is already an issue with research in general, where it is hard to see even if a well-defined task (such as winning games) is producing strange behavior because bugs or because people don’t understand what’s going on. This is only going to get more complex as time goes on and there is going to exist social disagreements on the correct behavior.

17. Generalized tackling of “morality” in a mathematical sense

This probably warrants its own sub-post, but a tackling of morality in math / code is both extremely important and probably evokes a very yucky response from some people. Some people have a notion that “morality / ethics” is just highly evolved game theory. While I am not sure this is fully true, advanced game / decision theory is certainly a large component of it.

In addition to that it would be good to get:

a) General account of why people differ from “rational choice” and how much of this difference is attempts to cooperate (successful or not)
b) General qualia theory of pain and pleasure, which a lot of high order morality depends on
c) Actual empirical studies of moral language and whether particular statements cause more or less cooperation and what time scales. This would likely include religious language, as much as this can strange to our modern secular world view.

In conclusion, there are a few themes that emerge. As I mentioned “production of reliable knowledge” is important as are, “how do things actually work now either in a society or the human mind,” with a special emphasis of a theoretical understanding of what current mechanism underline our civilization and how can they be modeled. The hope that an AI can do all of those for us is an appealing idea, yet in the absence of reliable debugging of key philosophical disagreements, this remains just a hope.
At the end of the day, if we wish to build a civilization that will stand the test of time, we as a society will need to better formal gears level understanding of how everything works, with or without AI.

[-]Gordon Seidoh Worley6y40

Seems like you've spent a lot of time thinking about AI safety! I think it'd be valuable if you shared links or listed references to things you read that helped you develop your thoughts, since that would let people trace back and connect your writing to the broader literature. As it stands I mostly have to guess to what extent you are thinking about various things within the context of the existing literature vs. coming at the ideas fresh with less engagement with existing writing on particular topics.

[-]agilecaveman6y10

It's Pasha Kamyshev, btw :) Main engagement is through

1. reading MIRI papers, especially the older agent foundations agenda papers

2. following the flashy developments in AI, such as Dota / Go RL and being somewhat skeptical of the "random play" part of the whole thing (other things are indeed impressive)

3. Various math text books: category theory for programmers, probability the logic of science, and others

4. Trying to implement certain theory in code (quantilizers, different prediction market mechanisms)

5. Statistics investigations into various claims of "algorithmic bias"

6. Conversations with various people in the community on the topic