Avoiding Side Effects in Complex Environments

Makes sense, I was thinking about rewards as function of the next state rather than the current one. 

I can stil imagine that things will still work if we replace the difference in Q-values by the difference in the values of the autoencoded next state. If that was true, this would a) affect my interpretation of the results and b) potentially make it easier to answer your open questions by providing a simplified version of the problem. 


Edit: I guess the "Chaos unfolds over time" property of the safelife environment makes it unlikely that this would work? 

Avoiding Side Effects in Complex Environments

I'm curious whether AUP or the autencoder/random projection does more work here. Did you test how well AUP and AUP_proj with a discount factor of 0 for the AUP Q-functions do? 

Machine learning could be fundamentally unexplainable

"So if you wouldn’t sacrifice >0.01AUC for the sake of what a human thinks is the “reasonable” explanation to a problem, in the above thought experiment, then why sacrifice unknown amounts of lost accuracy for the sake of explainability?" 

You could think of explainability as some form of regularization to reduce overfitting (to the test set). 

[AN #128]: Prioritizing research on AI existential safety based on its application to governance demands

"Overall, access to the AI strongly improved the subjects' accuracy from below 50% to around 70%, which was further boosted to a value slightly below the AI's accuracy of 75% when users also saw explanations. "

But this seems to be a function of the AI system's actual performance, the human's expectations of said performance, as well as the human's baseline performance. So I'd expect it to vary a lot between tasks and with different systems. 

Nuclear war is unlikely to cause human extinction

"My own guess is that humans are capable of surviving far more severe climate shifts than those projected in nuclear winter scenarios. Humans are more robust than most any other mammal to drastic changes in temperature, as evidenced by our global range, even in pre-historic times"

I think it is worth noting that the speed of climate shifts might play an important role, as a lot of human adaptability seems to rely on gradual cultural evolution. While modern information technology has greatly sped up the potential for cultural evolution, I am unsure if these speedups are robust to a full-scale nuclear war.

AI risk hub in Singapore?

I interpreted this as a relative reduction of the probability (P_new=0.84*P_old) rather than an absolute decrease of the probability by 0.16. However, this indicates that the claim might be ambiguous which is problematic in another way. 

Comparing Utilities

"The Nash solution differs significantly from the other solutions considered so far. [...]

2. This is the first proposal where the additive constants matter. Indeed, now the multiplicative constants are the ones that don't matter!"

In what sense do additive constants matter here? Aren't they neutralized by the subtraction?

Do mesa-optimizer risk arguments rely on the train-test paradigm?

You don't even need a catastrophe in any global sense. Disrupting the training procedure at step t should be sufficient.

AI Unsafety via Non-Zero-Sum Debate

"My intuition is that there will be a class of questions where debate is definitely safe, a class where it is unsafe, and a class where some questions are safe, some unsafe, and we don’t really know which are which."

Interesting. Do you have some examples of types of questions you expect to be safe or potential features of save questions? Is it mostly about the downstram consquences that answers would have, or more about instrumental goals that the questions induce for debaters?

Tradeoff between desirable properties for baseline choices in impact measures

I like the insight that offsetting is not always bad and the idea of dealing with the bad cases using the task reward. State-based reward functions that capture whether or not the task is currently done also intuitively seem like the correct way of specifying rewards in cases where achieving the task does not end the episode.

I am a bit confused about the section on the markov property: I was imagining that the reason you want the property is to make applying standard RL techniques more straightforward (or to avoid making already existing partial observability more complicated). However if I understand correctly, the second modification has the (expectation of the) penalty as a function of the complete agent policy and I don't really see, how that would help. Is there another reason to want the markov property, or am I missing some way in which the modification would simplify applying RL methods?

Load More