Marius Hobbhahn

I'm currently doing a Ph.D. in ML at the International Max-Planck research school in Tübingen. My focus is on Bayesian ML and I'm exploring its role in AI alignment but I'm also exploring non-Bayesian approaches. I want to become an AI safety researcher/engineer. If you think I should work for you, please reach out.

For more see

Wiki Contributions


Our mental building blocks are more different than I thought

The building block concept was just something that I found intuitive. It's not backed by rigorous research or intense thinking. I think they can easily be called tasks or traits or other things that relate to psychological findings. You should really think of this post as something that I found worth sharing without doing a lot of background reading. 

Our mental building blocks are more different than I thought

I'm not sure either to be fair. My friend with aphantasia says it doesn't make that much of a practical difference for her. But it's hard to compare since we don't know the counterfactual. 

I'm generally pretty uncertain how large the differences are but some discussions lead me to believe that they are bigger than I expected. At some point I was just like "Wait, you can't rotate the shape in your head?" or "What do you mean, you feel music?". 

I think there are a ton of interesting questions to dive into. Probably a lot have already been answered by psychologists. I think the independence question is very interesting as well. 

Investigating causal understanding in LLMs

I would expect the results to be better on, let's say PaLM. I would also expect it to base more of its answers on content than form. 

I think there are a ton of experiments in the direction of natural story plots that one could test and I would be interested in seeing them tested.  The reason we started with relatively basic toy problems is that they are easier to control. For example, it is quite hard to differentiate whether the model learned based on form or content in a natural story context. 

Overall, I expect there to be many further research projects and papers in this direction. 

Eliciting Latent Knowledge (ELK) - Distillation/Summary

Thank you for the feedback. I will update the post to be more clear on imitative generalization. 

Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc

I think the claim you are making is correct but it still misses a core point of why some people think that Bayes nets are more interpretable than DL. 
a) Complexity: a neural network is technically a Bayes net. It has nodes and variables and it is non-cyclical. However, when people talk about the comparison of Bayes nets vs. NNs, I think they usually mean a smaller Bayes net that somehow "captures all the important information" of the NN. 
b) Ontology:  When people look at a NN they usually don't know what any particular neuron or circuit does because it might use different concepts than humans use when they think about the same topic. When people use a Bayes net they usually assume that the nodes reflect concepts that humans use.  So it is more interpretable in practice. 

I think that there is a case for using Bayes Nets in combination with NNs to get higher interpretability and I'll write a post on that in the future. 

The limits of AI safety via debate

Thanks for taking the time. I now understand all of your arguments and am convinced that most of my original criticisms are wrong or inapplicable. This has greatly increased my understanding of and confidence in AI safety via debate. Thank you for that. I updated the post accordingly. Here are the updated versions (copied from above):

Re complexity: 
Update 2: I misunderstood Rohin’s response. He actually argues that, in cases where a claim X breaks down into claims X1 and X2, the debater has to choose which one is more effective to attack, i.e. it is not able to backtrack later on (maybe it still can by making the tree larger - not sure). Thus, my original claim about complexity is not a problem since the debate will always be a linear path through a potentially exponentially large tree.

Re deception:
Update2: We were able to agree on the bottleneck. We both believe that the claim "it is harder to lie than to refute a lie" is the question that determines whether debate works or not. Rohin was able to convince me that it is easier to refute a lie than I originally thought and I, therefore, believe more in the merits of AI safety via debate. The main intuition that changed is that the refuter mostly has to continue poking holes rather than presenting an alternative in one step. In the “flawed physics” setting described above, for example, the opponent doesn’t have to explain the alternative physics setting in the first step. They could just continue to point out flaws and inconsistencies with the current setting and then slowly introduce the new system of physics and how it would solve these inconsistencies. 

Re final conclusion:
Update2: Rohin mostly convinced me that my remaining criticisms don’t hold or are less strong than I thought. I now believe that the only real problem with debate (in a setting with well-intentioned verifiers) is when the claim “it is harder to lie than to refute a lie” doesn’t hold. However, I updated that it is often much easier to refute a lie than I anticipated because refuting the lie only entails poking a sufficiently large hole into the claim and doesn’t necessitate presenting an alternative solution. 

The limits of AI safety via debate

Thank you for the detailed responses. You have convinced me of everything but two questions. I have updated the text to reflect that. The two remaining questions are (copied from text): 

On complexity: There was a second disagreement about complexity. I argued that some debates actually break down into multiple necessary conditions, e.g. if you want to argue that you played Fortnite you have to show that it is possible to play Fortnite that and then that it is plausible that you played it. The pro-Fortnite debater has to show both claims while the anti-Fortnite debater has to defeat only one. Rohin argued that this is not the case, because every debate is ultimately only about the plausibility of the original statement independent of the number of subcomponents it logically breaks down to (or at least that’s how I understood him).

On deception: This is the only point where Rohin hasn’t convinced me yet. He argues that the debaters have no incentive to be deceptive since the other debater is equally capable and has an incentive to point out this deception. I think this is true--as long as the reward for pointing out deception is bigger than alternative strategies, e.g. being deceptive yourself, you are incentivized to be truthful. 

Let’s say, for example, our conception of physics was fundamentally flawed and both debaters knew this. To win the debate, one (truthful) debater would have to argue that our current concept of physics is flawed and establish the alternative theory while the other one (deceptive) could argue within our current framework of physics and sound much more plausible to the humans. The truthful debater is only rewarded for their honesty when the human verifier waits long enough to understand the alternative physics explanation before giving the win to the deceptive debater. In case the human verifier stops early, deception is rewarded, right? What am I missing? 

In general, I feel like the question of whether the debater is truthful or not only depends on whether they would be rewarded to be so. However, I (currently) don’t see strong reasons for the debater to be always truthful. To me, the bottleneck seems to be which kind of behavior humans intentionally or unintentionally reward during training and I can imagine enough scenarios in which we accidentally reward dishonest or deceptive behavior. 

The limits of AI safety via debate

Thanks for your detailed comment. Let me ask some clarifications. I will update the post afterward.

Assumption 1: 

I understand where you are going but the underlying path in the tree might still be very long, right? The not-Fortnite-debater might argue that you couldn't have played Fortnite because electricity doesn't exist.  Then the Fortnite-debater has to argue that it does exist, right? 

Furthermore, I don't see why it should just be one path in the tree. Some arguments have multiple necessary conditions/burdens. Why do I not have to prove all of them? Otherwise, the opponent in the debate can always answer with "OK assume everything you said is true, what about the other burden?". 

I'll update this section once I understand your criticism better.

Assumption II: 

Ok, let's say that we are able to understand it after a billion years of looking at it. Or maybe we understand it after the heat death of the universe. Does that really change anything? Perhaps I should reframe it as "understanding the concept in principle (in a relevant sense)" or something like that. 

I think the more compelling analogy to me is "could you teach your dog quantum physics" given lots of time and resources. I'm not sure the dog is able to understand. What do you think?

Assumptions III and IV:

These are practical problems of debate. I mostly wanted to point out that they could happen to the people running experiments with debate. I think they could also happen in a company, e.g. when the AI says things that are in the interest of the specific verifier but not their manager. I think this point can be summarized as "as long as humans are the verifiers, human flaws can be breaking points of AI safety via debate"
I'll rephrase them to emphasize this more.

Framing: what is AI safety used for

I think your framing of AI safety as a tool for AI safety researchers reduces some of the problems I described and I will rewrite the relevant passages. However, while the interests of the AI company might be less complex, they are not necessarily straightforward, e.g. when leadership has different ideals than the safety team and would thus verify different players in the final node.  

Assumption V:

I agree with you that in a perfect setting this could not happen. However, in real life with debates on TV or even with well-intentioned scientists who have held wrong beliefs for a long time even though they were sometimes confronted with the truth and an explanation for it, we see this often. I think it's more of a question of how much we trust the verifier to make the right call given a good explanation than a fundamental disagreement of the method. 

Assumption VI:

The example is not optimal. I see that now and will change it. However, the underlying argument still seems true to me. The incentive of the AI is to get the human to declare it as the winner, right? Therefore, it will use all its tools to make it win. If it has superhuman intelligence and a very accurate model of the verifier(s) it will say things that make the humans give it the win. If part of that strategy is to be deceptive, why wouldn't it use that? I think this is very important and I currently don't understand your reasoning. Let me know if you think I'm missing something. 

Load More