Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This is a special post for short-form writing by Hoagy. Only they can create top-level comments. Comments here also appear on the Shortform Page and All Posts page.

There's an argument that I've been thinking about which I'd really like some feedback or pointers to literature on:

the tldr is that overcomplete bases necessitate linear representations

lotsof space). (see also Toy models of transformers, my sparse autoencoder posts)Of course linear representations will still be sensitive to this kind of interferences but I suspect there's a mathematical proof for why linear features are the most robust to represent information in this kind of situation but I'm not sure where to look for existing work or how to start trying to prove it.

I've been looking at papers involving a lot of 'controlling for confounders' recently and am unsure about how much weight to give their results.

Does anyone have recommendations about how to judge the robustness of these kind of studies?

Also, I was considering doing some tests of my own based on random causal graphs, testing what happens to regressions when you control for a limited subset of confounders, varying the size/depth of graph and so on. I can't seem to find any similar papers but I don't know the area, does anyone know of similar work?

Robust statistics is a field. Wikipedia links to http://lagrange.math.siu.edu/Olive/ol-bookp.htm which has chapters like Chapter 7-Robust Regression and Chapter 8-Robust Regression Algorithms

Thanks, I'll give it a read.

Maybe reading Gelman's self-contained comments on SSC's

More Confounderswould make you more confused in a good way.Cheers, glad I'm not dealing with 300 variables. Don't think the situation is quite as dire as for sleeping pills luckily.

Question:

Does anyone know of papers on creating human-interpretable latent spaces with auto-encoders?

An example of the systems I have in mind would be a NN generating face images from a latent space, designed such that dimension 0 encodes skin tone, dimension 1 encodes hair colour etc.

Will be doing my own literature search but if anyone knows the area some pointers to papers or search terms would be very helpful!

There is definitely something out there, just can't recall the name. A keyword you might want to look for is "disentangled representations".

One start would be the beta-VAE paper https://openreview.net/forum?id=Sy2fzU9gl

Cheers!

Suggestion:

Eliezer has huge respect in the community; he has strong, well thought-out opinions (often negative) on a lot of the safety research being done (with exceptions, Chris Olah mentioned a few times); but he's not able to work full time on research directly (or so I understand, could be way off).

Perhaps he should institute some kind of prize for work done, trying to give extra prestige and funding to work going in his preferred direction? Does this exist in some form without my noticing? Is there a reason it'd be bad? Time/energy usage for Eliezer combined with difficulty of delegation?

Question about error-correcting codes that's probably in the literature but I don't seem to be able to find the right search terms:

How can we apply error-correcting codes to logical *algorithms*, as well as bit streams?

If we want to check that bit-stream is accurate, we know how to do this for a manageable overhead - but what happens if there's an error in the hardware that does the checking? It's not easy for me to construct a system that has no single point of failure - you can run the correction algorithm multiple times but how do you compare the results without ending up back with a single point of failure?

Anyone know any relevant papers or got a cool solution?

Interested for the stability of computronium-based futures!

At the risk of pointing to the obvious, the "typical" method that has been used in the past military and space is hardware redundancy (often x3).