The paper is now published with open access here:
I propose blacklists are less useful if they are about proxy measures, and much more useful if they are about ultimate objectives. Some of the ultimate objectives can also be represented in the form of blacklists. For example, listing many ways to kill a person is less useful. But saying that death or violence is to be avoided, is more useful.
I imagine that the objectives which fulfill the human needs for Power (control over AI), Self-Direction (autonomy, freedom from too much influence from AI), and maybe others, would be partially also working in ensuring that the AI does not start moving towards wireheading. Wireheading would surely be in contradiction to these objectives.If we consider wireheading as a process, not a black and white event, then there are steps along the way. These steps could be potentially detected or even foreseen before the process finishes in a new equilibrium.
A question. Is it relevant for your current problem formulation that you also want to ensure that authorised people still have reasonable access to the diamond? In other words, is it important here that the system still needs to yield to actions or input from certain humans, be interruptible and corrigible? Or, in ML terms, does it have to avoid both false negatives and false positives when detecting or avoiding intrusion scenarios? I imagine that an algorithmically more trivial way to make the system both "honest" and "secured" is to make it so heavily secured that almost certainly nobody can access the diamond.
Until the Modem website is down, you can access our workshop paper here: https://drive.google.com/file/d/1qufjPkpsIbHiQ0rGmHCnPymGUKD7prah/view?usp=sharing
You can apply the nonlinear transformation either to the rewards or to the Q values. The aggregation can occur only after transformation. When transformation is applied to Q values then the aggregation takes place quite late in the process - as Ben said, during action selection.Both the approach of transforming the rewards and the approach of transforming the Q values are valid, but have different philosophical interpretations and also have different experimental outcomes to the agent behaviour. I think both approaches need more research.For example, I would say that transforming the rewards instead of Q values is more risk-averse as well as "fair" towards individual timesteps, since it does not average out the negative outcomes across time before exponentiating them. But it also results in slower learning by the agent.
Finally there is a third approach which uses lexicographical ordering between objectives or sets of objectives. Vamplew has done work on this direction. This approach is truly multi-objective in the sense that there is no aggregation at all. Instead the vectors must be compared during RL action selection without aggregation. The downside is that it is unwieldy to have many objectives (or sets of objectives) lexicographically ordered.
I imagine that the lexicographical approach and our continuous nonlinear transformation approaches are complementary. There could be for example two main sets of objectives: one set for alignment objectives, the other set for performance objectives. Inside a set there would be nonlinear transformation and then aggregation applied, but between the sets there would be lexicographical ordering applied. In other words there would be a hierarchy of objectives. By having only two sets in lexicographical ordering the lexicographical ordering does not become unwieldy.
This approach would be a bit analogous to the approach used by constraint programming, though more flexible. The safety objectives would act as a constraint against performance objectives. An approach that is almost in absurd manner missing from classical naive RL, but which is very essential, widely known, and technically developed in practical applications, that is, in constraint programming! In the hybrid approach proposed in the above paragraph the difference from classical constraint programming would be that among the safety objectives there would still be flexibility and ability to trade (in a risk-averse way).Finally, when we say "multi-objective" then it does not just refer to the technical details of the computation. It also stresses the importance of acknowledging the need for researching and making more explicit the inherent presence and even structure of multiple objectives inside any abstract top objective. To encode knowledge in a way that constrains incorrect solutions but not correct solutions. As well as acknowledging the potential existence of even more complex, nonlinear interactions between these multiple objectives. We did not focus on nonlinear interactions between the objectives yet, but these interactions are possibly relevant in the future.I totally agree that in a reasonable agent the objectives or target values / set-points do change, as it is also exemplified by biological systems.Until the Modem website is down, you can access our workshop paper here: https://drive.google.com/file/d/1qufjPkpsIbHiQ0rGmHCnPymGUKD7prah/view?usp=sharing
Yes, maybe the the minimum cost is 3 even without floor or ceiling? But the question is then how to find concrete solutions that can be proven using realistic efforts. I interpret the challenge as request for submission of concrete solutions, not just theoretical ones. Anyway, my finding is below, maybe it can be improved further. And could there be any way to emulate floor or ceiling using the functions permitted in the initial problem formulation?By the way, for me the >! works reliably when entered right in the beginning of the message. After a newline it does not work reliably.
ceil(3!! * sqrt(sqrt(5! / 2 + 2)))
If you would allow ceiling function then I could give you a solution with score 60 for the Puzzle 1. Ceiling or floor functions are cool because they add even more branches to the search, and enable involving irrational number computations too. :P Though you might want to restrict the number of ceiling or floor functions permitted per solution. By the way, please share a hint about how do you enter spoilers here?
Submitting my post for early feedback in order to improve it further:
Exponentially diminishing returns and conjunctive goals: Mitigating Goodhart’s law with common sense. Towards corrigibility and interruptibility.
Utility maximising agents have been the Gordian Knot of AI safety. Here a concrete VNM-rational formula is proposed for satisficing agents, which can be contrasted with the hitherto over-discussed and too general approach of naive maximisation strategies. For example, the 100 paperclip scenario is easily solved by the proposed framework, since infinitely rechecking whether exactly 100 paper clips were indeed produced yields to diminishing returns. The formula provides a framework for specifying how we want the agents to simultaneously fulfil or at least trade off between the many different common sense considerations, possibly enabling them to even surpass the relative safety of humans. A comparison with the formula introduced in “Low Impact Artificial Intelligences” paper by S. Armstrong and B. Levinstein is included.
It looks like there is so much information on this page that trying to edit the question kills the browser.
An additional idea: Additionally to supporting the configuration of the default behaviours, perhaps the agent should interactively ask for confirmation of shutdown instead of running deterministically?