All of Thomas Larsen's Comments + Replies

I'm excited for ideas for concrete training set ups that would induce deception2 in an RLHF model, especially in the context of an LLM -- I'm excited about people posting any ideas here. :) 

Deception is a particularly worrying alignment failure mode because it makes it difficult for us to realize that we have made a mistake: at training time, a deceptive misaligned model and an aligned model make the same behavior. 

There are two ways for deception to appear: 

  1. An action chosen instrumentally due to non-myopic future goals that are better achieved by deceiving humans now so that it has more power to achieve its goals in the future. 
  2. Because deception was directly selected for as an action. 

Another way of describing the d... (read more)

I've been exploring evolutionary metaphors to ML, so here's a toy metaphor for RLHF: recessive persistence. (Still just trying to learn [] both fields, however.) Related: * Worlds where iterative design fails [] * Recessive Sickle cell trait allele [] Recessive alleles persists due to overdominance [] letting detrimental alleles hitchhike [] on fitness-enhancing dominant counterpart. The detrimental effects on fitness only show up when two recessive alleles inhabit the same locus, which can be rare enough that the dominant allele still causes the pair to be selected for in a stable equilibrium. The metaphor with deception breaks down due to unit of selection. Parts of DNA stuck much closer together than neurons in the brain or parameters in a neural networks. They're passed down or reinforced in bulk. This is what makes hitchhiking so common in genetic evolution. (I imagine you can have chunks that are updated together for a while in ML as well, but I expect that to be transient and uncommon. Idk.) -------------------------------------------------------------------------------- Bonus point: recessive phase shift. In ML: 1. Generalisable non-memorising patterns start out small/sparse/simple. 2. Which means that input patterns rarely activate it, beca
An interpretable system trained for the primary task of being deceptive should honestly explain its devious plots in a separate output. An RLHF-tuned agent loses access [] to the original SSL-trained map of the world. So the most obvious problem is the wrong type signature of model behaviors, there should be more inbuilt side channels to its implied cognition used to express and train capabilities/measurements relevant to what's going on semantically inside the model, not just externally observed output for its primary task, out of a black box.
2Thomas Larsen18d
I'm excited for ideas for concrete training set ups that would induce deception2 in an RLHF model, especially in the context of an LLM -- I'm excited about people posting any ideas here. :)

One major reason why there is so much AI content on LessWrong is that very few people are allowed to post on the Alignment Forum.

Everything on the alignment forum gets crossposted to LW, so letting more people post on AF wouldn't decrease the amount of AI content on LW. 

Sorry for the late response, and thanks for your comment, I've edited the post to reflect these. 

No worries! Thanks a lot for updating the post

I have the intuition (maybe from applause lights) that if negating a point sounds obviously implausible, then the point is obviously true and it is therefore somewhat meaningless to claim it. 

My idea in writing this was to identify some traps that I thought were non obvious (some of which I think I fell into as new alignment researcher). 

Disclaimer: writing quickly. 

Consider the following path: 

A. There is an AI warning shot. 

B. Civilization allocates more resources for alignment and is more conservative pushing capabilities.  

C. This reallocation is sufficient to solve and deploy aligned AGI before the world is destroyed. 

I think that a warning shot is unlikely (P(A) < 10%), but won't get into that here. 

I am guessing that P(B | A) is the biggest crux. The OP primarily considers the ability of governments to implement policy that moves our civilization fur... (read more)

Right now, there are ~100 capabilities researchers vs ~30 alignment researchers at OpenAI.

I don't want to derail this thread, but I do really want to express my disbelief at this number before people keep quoting it. I definitely don't know 30 people at OpenAI who are working on making AI not kill everyone, and it seems kind of crazy to assert that there are (and I think assertions that there are are the result of some pretty adversarial dynamics I am sad about).

I think a warning shot would dramatically update them towards worry towards worry about ac

... (read more)

Hmm, the eigenfunctions just depend on the input training data distribution (which we call ), and in this experiment, they are distributed evenly on the interval . Given that the labels are independent of this, you'll get the same NTK eigendecomposition regardless of the target function. 

I'll probably spin up some quick experiments in a multiple dimensional input space to see if it looks different, but I would be quite surprised if the eigenfunctions stopped being sinusoidal. Another thing to vary could be the distribution of input po... (read more)

Typically the property which induces sinusoidal eigenfunctions is some kind of permutation invariance - e.g. if you can rotate the system without changing the loss function, that should induce sinusoids. The underlying reason for this: * When two matrices commute, they share an eigenspace. In this case, the "commutation" is between the matrix whose eigenvectors we want, and the permutation. * The eigendecomposition of a permutation matrix is, roughly speaking, a fourier transform, so its eigenvectors are sinusoids.
2Jeremy Gillen2mo
I'd expect that as long as the prior favors smoother functions, the eigenfunctions would tend to look sinusoidal?

An anonymous academic wrote a review of Joe Carlsmith's 'Is power seeking AI an existential risk?', in which the reviewer assigns for <1/100,000 probability of AI existential risk. The arguments given aren't very good imo, but maybe worth reading. 

Just made a fairly large edit to the post after lots of feedback from commenters.  My most recent changes include the following: 

  • Note limitations in introduction (lack academics, not balanced depth proportional to people, not endorsed by researchers) 
  • Update CLR as per Jesse's comment
  • Add FAR 
  • Update brain-like AGI to include this
  • Rewrite shard theory section 
    • Brain <-> shards 
  • effort: 50 -> 75 hours :)
  • Add this paper to DeepMind
  • Add some academics (David Krueger, Sam Bowman, Jacob Steinhardt, Dylan Hadfield-
... (read more)

Good point, I've updated the post to reflect this. 

I'm excited for your project :) 

Good point. We've added the Center for AI Safety's full name into the summary table which should help.

Thanks for the update! We've edited the section on CLR to reflect this comment, let us know if it still looks inaccurate.

not all sets all sets of reals which are bounded below have an infimum

Do you mean 'all sets of reals which are bounded below DO have an infimum'?

Thanks, fixed.

In the model based RL set up, we are planning to give it actions that can directly modify the game state in any way it likes. This is sort of like an arbitrarily-powerful superpower, because it can change anything it wants about the world, except of course that this is a cartesian environment and so it can't, e.g., recursively self improve. 

With model free RL, this strategy doesn't obviously carry over so I agree that we are limited to easily codeable superpowers. . 

Strong upvoted and I quite like this antidote, I will work on adding my guess of the scale of these orgs into the table. 

Hi Adam, thank you so much for writing this informative comment. We've added your summary of FAR to the main post (and linked this comment). 

Agree with both aogara and Eli's comment. 

One caveat would be that papers probably don’t have full explanations of the x-risk motivation or applications of the work, but that’s reading between the lines that AI safety people should be able to do themselves.

For me this reading between the lines is hard: I spent ~2 hours reading academic papers/websites yesterday and while I could quite quickly summarize the work itself, it was quite hard to me to figure out the motivations.

PAIS #5 [] might be helpful here. It explains how a variety of empirical directions are related to X-Risk and probably includes many of the ones that academics are working on.

There's a lot of work that could be relevant for x-risk but is not motivated by it.  Some of it is more relevant than work that is motivated by it.  An important challenge for this community (to facilitate scaling of research funding, etc.) is to move away from evaluating work based on motivations, and towards evaluating work based on technical content.

Agreed it's really difficult for a lot of the work. You've probably seen it already but Dan Hendrycks has done a lot of work explaining academic research areas in terms of x-risk (e.g. this [] and this [] paper). Jacob Steinhardt's blog [] and field overview [] and Sam Bowman's Twitter [] are also good for context.

my current best guess is that gradient descent is going to want to make our models deceptive

Can you quantify your credence in this claim? 

Also, how much optimization pressure do you think that we will need to make models not deceptive? More specifically, how would your credence in the above change if we trained with a system that exerted 2x, 4x, ... optimization pressure against deception? 

If you don't like these or want a more specific operationalization of this question, I'm happy with whatever you think is likely or filling out more details. 


I think it really depends on the specific training setup. Some are much more likely than others to lead to deceptive alignment, in my opinion. Here are some numbers off the top of my head, though please don't take these too seriously:

  • ~90%: if you keep scaling up RL in complex environments ad infinitum, eventually you get deceptive alignment.
  • ~80%: conditional on RL in complex environments being the first path to transformative AI, there will be deceptively aligned RL models.
  • ~70%: if you keep scaling up GPT-style language modeling ad infinitum, eventuall
... (read more)

Sorry about that, and thank you for pointing this out. 

For now I've added a disclaimer (footnote 2 right now, might  make this more visible/clear but not sure what the best way of doing that is). I will try to add a summary of some of these groups in when I have read some of their papers, currently I have not read a lot of their research. 

Edit: agree with Eli's comment. 

There are many good arguments, but not that particular "<1% probability" proof that the question requests. All the good arguments rely on uncertain assumptions, don't reach the requisite standard of proof, especially when considered together with the assumptions. So by answering this way you are steelmanning the question (which it desperately needs).

Thank you Gabriel! 

Yeah good point, I think I should have included that link, updated now to include it. 

Hi Rohin, thank you so much for your feedback. I agree with everything you said and will try to update the post for clarity. 

I don't follow.

Sorry, that part was not well written (or well thought out), and so I'll try to clarify: 

What I meant by 'is the NAH true for ethics?' is 'do sufficiently intelligent agents tend to converge on the same goals?', which, now that I think about it, is just the negation of the orthogonality thesis. 

  • I'm not sure I understand the tree realism post other than that a tree is a fuzzy category.  While I am al
... (read more)

What I meant by 'is the NAH true for ethics?' is 'do sufficiently intelligent agents tend to converge on the same goals?', which, now that I think about it, is just the negation of the orthogonality thesis. 

Ah, got it, that makes sense. The reason I was confused is that NAH applied to ethics would only say that the AI system has a concept of ethics similar to the ones humans have; it wouldn't claim that the AI system would be motivated by that concept of ethics.

However, when I tried to flesh out model splintering (a.k.a. concept extrapolation) assuming a model-based-RL AGI—see Section 14.4 here—I still couldn’t quite get the whole story to hang together.

Thanks for linking that! 

(Before publishing that, I sent a draft to Stuart Armstrong, and he told me that he had a great answer but couldn’t make it public yet :-P )

Oooh that is really exciting news. 

That makes sense. For me: 

  1. Background: I graduated from college at the University of Michigan this spring, I majored in Math and CS. In college I worked on vision research for self-driving cars, and wrote my undergrad thesis on robustness (my linkedin). I spent a lot of time running the EA group at Michigan. I'm currently doing SERI MATS under John Wentworth. 
  2. Research taste: currently very bad and confused and uncertain. I want to become better at research and this is mostly why I am doing MATS right now. I guess I especially enjoy reading and thi
... (read more)

Thanks you for this thoughtful response, I didn't know about most of these projects. I've linked this comment in the DeepMind section, as well as done some modifications for both clarity and including a bit more.  

I think you can talk about the agendas of specific people on the DeepMind safety teams but there isn't really one "unified agenda".

This is useful to know.

Thanks Thomas for the helpful overview post! Great to hear that you found the AGI ruin opinions survey useful. I agree with Rohin's summary of what we're working on. I would add "understanding / distilling threat models" to the list, e.g. "refining the sharp left turn [] " and "will capabilities generalize more [] ". Some corrections for your overall description of the DM alignment team: * I would count ~20-25 FTE on the alignment + scalable alignment teams (this does not include the AGI strategy & governance team) * I would put DM alignment in the "fairly hard" bucket (p(doom) = 10-50%) for alignment difficulty, and the "mixed" bucket for "conceptual vs applied"

I think this is a really nice write-up! As someone relatively new to the idea of AI Safety, having a summary of all the approaches people are working on is really helpful as it would have taken me weeks to put this together on my own.


Obviously this would be a lot of work, but I think it would be really great to post this as a living document on GitHub where you can update and (potentially) expand it over time, perhaps by curating contributions from folks. 

I probably won't do this, but I agree it would be good. 

In particular it would

... (read more)
Maybe this could be the GitHub document? [] But it doesn't seem up-to-date.

I confused CAIS with Drexler's Comprehensive AI Services. Can you add a clarification stating that they are different things?

. If you can give them a fun problem to solve and make sure it's actually relevant and they are only rewarded for actually relevant work, then good research could still be produced.

Yeah I think the difficulty of setting this up correctly is the main crux. I'm quite uncertain on this, but I'll give the argument my model of John Wentworth makes against this: 

The Trojan detection competition it does seem roughly similar to deception, and if you can find Trojan's really well, it's plausible that you can find deceptive alignment. However, what we really ne... (read more)

Thoughts on John's comment: this is a problem with any method for detecting deception that isn't 100% accurate. I agree that finding a 100% accurate method would be nice, but good luck. Also, you can somewhat get around this by holding some deception detecting methods out (i.e. not optimizing against them). When you finish training and the held out methods tell you that your AI is deceptive, you start over. Then you have to try to think of another approach that is more likely to actually discourage deception than fool your held out detectors. This is the difference between gradient descent search and human design search, which I think is an important distinction. Also, FWIW, I doubt that trojans are currently a good microcosm for detecting deception. Right now, it is too easy to search for the trigger using brute force optimization. If you ported this over to sequential-decision-making land where triggers can be long and complicated, that would help a lot. I see a lot of current trojan detection research as laying the groundwork for future research that will be more relevant. In general, it seems better to me to evaluate research by asking "where is this taking the field/what follow-up research is this motivating?" rather than "how are the words in this paper directly useful if we had to build AGI right now?" Eventually, the second one is what matters, but until we have systems that look more like agents that plan and achieve goals in the real world, I'm pretty skeptical of a lot of the direct value of empirical research.

Thank you Thomas, I really appreciate you taking the time to write out your comment, it is very useful feedback. 

I've linked your comment in the post and rewritten the description of CAIS. 

Thanks! I really appreciate it, and think it's a lot more accurate now. Nitpicks: I think the MLSS link is currently broken. Also, in the headline table, it still emphasizes model robustness perhaps more than is warranted.

Thank you Robert! 

I've fixed that, thanks for pointing that sentence fragment out.

Overall, there are relatively few researchers who are effectively focused on the technical problems most relevant to existential risk from alignment failures.

What do you think these technical problems are? 

Late response, but I think that Adam Shimi's Unbounded Atomic Optimization does a good job at a unified frame on alignment failures.

I tend to think that decision theory failures are not a primary worry for why we might land in AGI ruin, the AI having a poor decision theory is more of a capabilities problem. Afaik, the motivation behind studying decision theory is to have a better understanding of agency (and related concepts like counterfactual reasoning, logical updatelessness, etc) at a basic level.  

Incorrect: OpenAI is not aware of the risks of race dynamics.

OpenAI's Charter contains the following merge-and-assist clause: "We are concerned about late-stage AGI development becoming a competitive race without time for adequate safety precautions. Therefore, if a value-aligned, safety-conscious project comes close to building AGI before we do, we commit to stop competing with and start assisting this project. We will work out specifics in case-by-case agreements, but a typical triggering condition might be “a better-than-even chance of success in t

... (read more)
If the purpose of the merge-and-assist clause is to prevent a race dynamic, then it's sufficient for that clause to trigger when OpenAI would otherwise decide to start racing. They can interpret their own decision-making, right? Right?

I think this is part of the idea behind Eliezer-corrigibility, and I agree that if executed correctly, this would be helpful.  

The difficulty with this approach that I see is: 

1) how do you precisely specifying what you mean by "impact on the world to be within small bounds" -- this seems to require a measure of impact. This would be amazing (and is related to value extrapolation). 

2) how do you induce the inner value of "being low impact" into the agent, and make sure that this generalizes in the intended way off of the training distributio... (read more)

The easy first step, is a simple bias toward inaction, which you can provide with a large punishment per output of any kind. For instance, a language model with this bias would write out something extremely likely, and then stop quickly thereafter. This is only a partial measure, of course, but it is a significant first step. Second through n-th step, harder, I really don't even know, how do you figure out what values to try to train it with to reduce impact. The immediate things I can think of might also train deceit, so it would take some thought. Also, across the time period of training, ask a panel (many separate panels) of judges to determine whether actions it is promoting for use in hypothetical situations or games was the minimal action it could have taken for the level of positive impact. Obviously, if the impact is negative, it wasn't the minimal action. Perhaps also train a network explicitly on the decisions of similar panels on such actions humans have taken, and use those same criteria. Somewhere in there, best place unknown, penalize heavy use of computation in coming up with plans (though perhaps not with evaluating them.). Final step (and perhaps at other stages too), penalize any actions taken that humans don't like. This can be done in a variety of ways. For instance, have 3 random humans vote on each action it takes, and for each person that dislikes the action, give it a penalty.
As with alternatives to utility functions [] , the practical solution seems to be to avoid explicit optimization (which is known to do the wrong things), and instead work on model-generated behaviors in other ways, without getting them explicitly reshaped into optimization. If there is no good theory of optimization (that doesn't predictably only do the wrong things), it needs to be kept out of the architecture, so that it's up to the system to come up with optimization it decides on later, when it grows up. What an architecture needs to ensure is clarity of aligned cognition sufficient to eventually make decisions like that, not optimization (of the world) directly.

Hi Vanessa! 

Thank you so much for your thoughtful reply. To respond to a few of your points: 

  • We only mean to test this in an artificial toy setting. We agree that empirical demonstrations seem very difficult. 
  • Thanks for pointing out the cartesian versions  -- I hadn't read this before, and this is a nice clarification on how to measure  in a loss-function agnostic way. 
  • It's good to know about the epistemic status of this part of the theory, we might take a stab at proving some of these bounds. 
  • We will definitely mak
... (read more)

Beliefs here are weakly held, I want to become more right. 

I think defining winning as coming away with the most utility is a crisp measure of what makes a good decision theory. 

The theory of counterfactuals are, in my mind, what separates the decision theories themselves, and is therefore the core question/fuzziness in solving decision theory. Changing your theory of counterfactuals alters the answer of the fundamental question: "when you change your action/policy, what parts of the world change with you?". 

It doesn't seem like there is a d... (read more)

Over the course of the universe, the best decision theory is a consensus/multiple-evaluation theory. Evaluate which part of the universe you're in, and what is the likelihood that you're in a causally-unusaual scenario, and use the DT which gives the best outcome. How a predictor works when your meta-DT gives different answers based on whether you've been predicted, I don't know. Like a lot of adversarial(-ish) situation, the side with the most predictive power wins.

Winning here corresponds to getting the most expected utility, as measured from the start of the problem. We assume we can measure utility with money, so we can e.g. put a dollar value on getting rescued. 

1) In the smoking lesion problem, the exact numbers don't so much matter as the relationship between them: you are simply choosing whether or not to gain an amount of utility or to pass it up.   

2) Defecting on a cooperator in a one-shot true prisoners dilemma is the "best" outcome, so this is exactly right. See this story.  

3) In the ... (read more)

1Vivek Hebbar4mo
I don't think this sorts out the fundamental fuzziness of what "winning" means -- what you need is a theory of counterfactuals

I think realizability is the big one. Some others are: 

Did you mean: 

And yet, assuming **tool** AI is possible at all, it will be possible to assemble those tools into something agenty.

Nope. The argument is: * Transistors/wires are tool-like (i.e. not agenty) * Therefore if we are able to build an AGI from transistors and wires at all... * ... then it is possible to assemble tool-like things into agenty things.

I might be missing something, but it seems to me like the way to modify to a smaller state diagram is to remove the HALT state from the TM and then redraw any state transition that goes to HALT to map to any other state arbitrarily.

This won't change the behavior on computations that haven't halted so far, because these computations never reached the HALT state, and so won't be effected by any of the swapped transitions. 

What is the structure of Conjecture's interpretability research, and how does it differ from Redwood and Anthropic's interpretability research?


Edit: This was touched on here

In their announcement post they mention:

Mechanistic interpretability research in a similar vein to the work of Chris Olah and David Bau, but with less of a focus on circuits-style interpretability  and more focus on research whose insights can scale to models with many billions of parameters and larger. Some example approaches might be: 

  • Locating and editing factual knowledge in a transformer language model.
  • Using deep learning to automate deep learning interpretability - for example, training a language model to give semantic labels
... (read more)

The best resource that I have found on why corrigibility is so hard is the arbital post, are there other good summaries that I should read? 

3Thomas Kwa4mo
Not an answer, but I think of "adversarial coherence" (the agent keeps optimizing for the same utility function even under perturbations by weaker optimizing processes, like how humans will fix errors in building a house [] or AlphaZero can win a game of Go even when an opponent tries to disrupt its strategy) as a property that training processes could select for. Adversarial coherence and corrigibility are incompatible.

Minor note: I think you meant that it does model-based planning -- this is what the graph search means. Also see the paper

" We propose a sample efficient model-based visual RL algorithm built on MuZero, which we name EfficientZero." 

There are reasons to think that an AI is aligned between "hoping it is aligned" and "having a formal proof that it is aligned". For example, we might be able to find sufficiently strong selection theorems, which tell us that certain types of optima tend to be chosen, even if we can't prove theorems with certainty. We also might be able to find a working ELK strategy that gives us interpretability. 

These might not be good strategies, but the statement "Therefore no AI built by current methods can be aligned" seems far too strong. 

There is also the ontology identification problem. The two biggest things are: we don't know how to specify exactly what a diamond is because we don't know the true base level ontology of the universe. We also don't know how diamonds will be represented in the AI's model of the world. 

I personally don't expect coding a diamond maximizing AGI to be hard, because I think that diamonds is a sufficiently natural concept that doing normal gradient descent will extrapolate in the desired way, without inner alignment failures. If the agent discovers more bas... (read more)

(Unsure whether to mark "agree" for the first two paragraphs, or "disagree" for the last line. Leaving this comment instead.)

Thanks for your response! I'm not sure I communicated what I meant well, so let me be a bit more concrete. Suppose our loss is parabolic , where .  This is like a 2d parabola (but it's convex hull / volume below a certain threshold is 3D). In 4D space, which is where the graph of this function lives and hence where I believe we are talking about basin volume, this has 0 volume. The hessian is the matrix: 

This is conveniently already diagonal, and the 0 eigenvalue comes from the component , which... (read more)

3Vivek Hebbar5mo
The loss is defined over all dimensions of parameter space, soL(x)=x21+x22is still a function of all 3 x's. You should think of it asL(x)=x21+x22+0x23. It's thickness in thex3direction is infinite, not zero. Here's what a zero-determinant Hessian corresponds to: The basin here is not lower dimensional; it is just infinite in some dimension. The simplest way to fix this is to replace the infinity with some large value. Luckily, there is a fairly principled way to do this: 1. Regularization / weight decay provides actual curvature, which should be added in to the loss, and doing this is the same as addingλInto the Hessian. 2. The scale of the initialization distribution provides a natural scale for how much volume an infinite sweep should count as (very roughly, the volume only matters if it overlaps with the initialization distribution, and the distance of sweep for which this is true is on the order ofσ, the standard deviation of the initialization). So the(λ+kσ2)Inis a fairly principled correction, and much better than just "throwing out" the other dimensions. "Throwing out" dimensions is unprincipled, dimensionally incorrect, numerically problematic, and should give worse results.
2Charlie Steiner5mo
Note that this is equivalent to replacing "size 1/0" with "size 1". Issues with this become apparent if the scale of your system is much smaller or larger than 1. A better try might be to replace 0 with the average of the other eigenvalues, times a fudge factor. But still quite unprincipled - maybe better is to try to look at higher derivatives first or do nonlocal numerical estimation like described in the post.

I am a bit confused how you deal with the problem of 0 eigenvalues in the Hessian. It seems like the reason that these 0 eigenvalues exist is because the basin volume is 0 as a subset of parameter space. My understanding right now of your fix is that you are adding  along the diagonal to make the matrix full rank (and this quantity is coming from the regularization plus a small quantity). Geometrically, this seems like drawing a narrow ellipse around the subspace of which we are trying to estimate the volume. 

But this doesn't seem na... (read more)

4Charlie Steiner5mo
The hessian is just a multi-dimensional second derivative, basically. So a zero eigenvalue is a direction along which the second derivative is zero (flatter-bottomed than a parabola). So the problem is that estimating basin size this way will return spurious infinities, not zeros.

Thanks, this is indeed what I was asking. 

TL;DR: I agree that the answer to the question above definitely isn't always yes, because of your counterexample, but I think that moving forward on a similar research direction might be useful anyways. 

One can imagine partitioning the parameter space into sets that arrive at basins where each model in the basin has the same, locally optimal performance, this is like a Rashomon set (relaxing the requirement from global minima so that we get a partition of the space). The models which can compress the training data (and thus have free parameters) are g... (read more)

2[comment deleted]5mo
2Peter S. Park5mo
Edit: Adding a link to "Git Re-Basin: Merging Models modulo Permutation Symmetries []," a relevant paper that has recently been posted on arXiv. Thank you so much, Thomas and Buck, for reading the post and for your insightful comments! It is indeed true that some functions have two global minimizers that are not path-connected. Empirically, very overparametrized models which are trained on "non-artificial" datasets ("datasets from nature"?) seem to have a connected Rashomon manifold. It would definitely be helpful to know theoretically why this tends to happen, and when this wouldn't happen. One heuristic argument for why two disconnected global minimizers might only happen in "artificial" datasets might go something like this. Given two quantities, one is larger than the other, unless there is a symmetry-based reason why they are actually secretly the same quantity. Under this heuristic, a non-overparametrized model's loss landscape has a global minimum achieved by precisely one point, and potentially some suboptimal local minima as well. But overparametrizing the model makes the suboptimal local minima not local minima anymore (by making them saddle points?) while the single global minimizer is "stretched out" to a whole submanifold. This "stretching out" is the symmetry; all optimal models on this submanifold are secretly the same. One situation where this heuristic fails is if there are other types of symmetry, like rotation. Then, applying this move to a global minimizer could get you other global minimizers which are not connected to each other. In this case, "modding out by the symmetry" is not decreasing the dimension, but taking the quotient by the symmetry group which gives you a quotient space of the same dimension. I'm guessing these types of situations are more common in "artificial" datasets which have not modded out all the obvious symmetries yet.
Load More