Sorted by New

Wiki Contributions



People here describe themselves as "pessimistic" about a variety of aspects of AI risk on a very regular basis, so this seems like an isolated demand for rigor. 

"Optimism" certainly isn't the same as a neutral, balanced view of possibilities. It is an expression of the belief that things will go well despite clear signs of danger (e.g. the often expressed concerns of leading AI safety experts). If you think your view is balanced and neutral, maybe "optimism" is not the best term to use. But then I would have expected much more caveats and expressions of uncertainty in your statements.

This seems like a weird bait and switch to me, where an object-level argument is only ever allowed to conclude in a neutral middle-ground conclusion. A "neutral, balanced view of possibilities" is absolutely allowed to end on a strong conclusion without a forest of caveats. You switch your reading of "optimism" partway through this paragraph in a way that seems inconsistent with your earlier comment, in such a way that smuggles in the conclusion "any purely factual argument will express a wide range of concerns and uncertainties, or else it is biased".


The helix is already pretty long, so maybe layernorm is responsible?

E.g. to do position-independent look-back we want the geometry of the embedding to be invariant to some euclidean embedding of the 1D translation group. If you have enough space handy it makes sense for this to be a line. But if you only have a bounded region to work with, and you want to keep the individual position embeddings a certain distance apart, you are forced to "curl" the line up into a more complex representation (screw transformations) because you need the position-embedding curve to simultaneously have high length while staying close to the origin.

Actually, layernorms may directly ruin the linear case by projecting it away, so you actually want an approximate group-symmetry that lives on the sphere. In this picture the natural shape for shorter lengths is a circle, and for longer lengths we are forced to stretch it into a separate dimension if we aren't willing to make the circle arbitrarily dense.


What CCS does conceptually is finds a direction in latent space that distinguishes between true and false statements. That doesn't have to be truth (restricted to stored model knowledge, etc.), and the paper doesn't really test for false positives so it's hard to know how robust the method is. In particular I want to know that the probe doesn't respond to statements that aren't truth-apt. It seems worth brainstorming on properties that true/false statements could mostly share that are not actually truth. A couple examples come to mind.

  1. common opinions. "Chocolate is tasty/gross" doesn't quite make sense without a subject, but may be treated like a fact.
  2. Likelihood of completion. In normal dialogue correct answers are more likely to come up than wrong ones, so the task is partly solvable on those grounds alone. The CCS direction should be insensitive to alternate completions of "Take me out to the ball ___" and similar things that are easy to complete by recognition.

 The last one seems more pressing, since it's a property that isn't "truthy" at all, but I'm struggling to come up with more examples like it.


Would it be possible to set up a wiki or something similar for this? Pedagogy seems easier to crowdsource than hardcore research, and the sequences-style posting here doesn't really lend itself to incremental improvement. I think this project will be unnecessarily restricted by requiring one person to understand the whole theory before putting out any contributions. The few existing expository posts on infrabayes could already be polished and joined into a much better introduction if a collaborative editing format allowed for it. Feedback in the comments could be easily integrated into the documents instead of requiring reading of a large conversation, etc.