faul_sname

Wiki Contributions

Comments

Does any specific human or group of humans currently have "control" in the sense of "that which is lost in a loss-of-control scenario"? If not, that indicates to me that it may be useful to frame the risk as "failure to gain control".

[Epistemic status: 75% endorsed]

Those who, upon seeing a situation, look for which policies would directly incentivize the outcomes they like should spend more mental effort solving for the equilibrium.

Those who, upon seeing a situation, naturally solve for the equilibrium should spend more mental effort checking if there is indeed only one "the" equilibrium, and if there are multiple possible equilibria, solving for which factors determine which of the several possible the system ends up settling on.

I played with this with a colab notebook way back when. I can't visualize things directly in 4 dimensions, but at the time I came up with the trick of visualizing the pairwise cosine similarity for each pair of features, which gives at least a local sense of what the angles are like.

 Trying to squish 9 features into 4 dimensions looks to me like it either ends up with

  • 4 antipodal pairs which are almost orthogonal to one another, and then one "orphan" direction squished into the largest remaining space
     
    OR
  • 3 almost orthogonal antipodal pairs plus a "Y" shape with the narrow angle being 72º and the wide angles being 144º
  • For reference this is what a square antiprism looks like in this type of diagram:

     

Yeah, stands for Max Cosine Similarity. Cosine similarity is a pretty standard measure for how close two vectors are to pointing in the same direction. It's the cosine of the angle between the two vectors, so +1.0 means the vectors are pointing in exactly the same direction, 0.0 means the vectors are orthogonal, -1.0 means the vectors are pointing in exactly opposite directions.

To generate this graph, I think he took each of the learned features in the smaller dictionary, and then calculated to cosine similarity of that small-dictionary feature with every feature in the larger dictionary, and then the maximal cosine similarity was the MCS for that small-dictionary feature. I have a vague memory of him also doing some fancy linear_sum_assignment() thing (to ensure that each feature in the large dictionary could only be used once in order avoid having multiple features in the small dictionary have their MCS come from the same feature on the large dictionary) though IIRC it didn't actually matter.

Also I think the small and large dictionaries were trained using different methods as each other for layer 2, and this was on pythia-70m-deduped so layer 5 was the final layer immediately before unembedding (so naively I'd expect most of the "features" to just be "the output token will be the" or "the output token will be when" etc).

Edit: In terms of "how to interpret these graphs", they're histograms with the horizontal axis being bins of cosine similarity, and the vertical axis being how many small-dictionary features had a the cosine similarity with a large-dictionary feature within that bucket. So you can see at layer 3 it looks like somewhere around half of the small dictionary features had a cosine similarity of 0.96-1.0 with one of the large dictionary features, and almost all of them had a cosine similarity of at least 0.8 with the best large-dictionary feature.

Which I read as "large dictionaries find basically the same features as small ones, plus some new ones".

Bear in mind also that these were some fairly small dictionaries. I think these charts were generated with this notebook so I think smaller_dict was of size 2048 and larger_dict was size 4096 (with a residual width of 512, so 4x and 8x respectively). Anthropic went all the way to 256x residual width with their "Towards Monosemanticity" paper later that year, and the behavior might have changed at that scale.

Found this graph on the old sparse_coding channel on the eleuther discord:

Logan Riggs: For MCS across dicts of different sizes (as a baseline that's better, but not as good as dicts of same size/diff init). Notably layer 5 is sucks. Also, layer 2 was trained differently than the others, but I don't have the hyperparams or amount of training data on hand. 

Image

So at least tentatively that looks like "most features in a small SAE correspond one-to-one with features in a larger SAE trained on the activations of the same model on the same data".

As a spaghetti behavior executor, I'm worried that neural networks are not a safe medium for keeping a person alive without losing themselves to value drift, especially throughout a much longer life than presently feasible

As a fellow spaghetti behavior executor, replacing my entire motivational structure with a static goal slot feels like dying and handing off all of my resources to an entity that I don't have any particular reason to think will act in a way I would approve of in the long term.

Historically, I have found varying things rewarding at various stages of my life, and this has chiseled the paths in my cognition that make me me. I expect that in the future my experiences and decisions and how rewarded / regretful I feel about those decisions will continue to chisel my cognition in a way that changes what I care about, in the way that past-me endorsed current-me's experiences causing me to care about things (e.g. specific partners, offspring) that past-me did not care about.

I would not endorse freezing my values in place to prevent value drift in full generality. At most I endorse setting up contingencies so my values don't end up trapped in some specific places current-me does not endorse (e.g. "heroin addict").

so I'd like to get myself some goal slots that much more clearly formulate the distinction between capabilities and values. In general this sort of thing seems useful for keeping goals stable, which is instrumentally valuable for achieving those goals, whatever they happen to be, even for a spaghetti behavior executor.

So in this ontology, an agent is made up of a queryable world model and a goal slot. Improving the world model allows the agent to better predict the outcomes of its actions, and the goal slot determines which available action the agent would pick given its world model.

I see the case for improving the world model. But once I have that better world model, I don't see why I would additionally want to add an immutable goal slot that overrides my previous motivational structure. My understanding is that adding a privileged immutable goal slot would only change the my behavior in those cases where I would otherwise have decided that achieving the goal that was placed in that slot was not a good idea on balance.

As a note, you could probably say something clever like "the thing you put in the goal slot should just be 'behave in the way you would if you had access to unlimited time to think and the best available world model'", but if we're going there then I contend that the rock I picked up has a goal slot filled with "behave exactly like this particular rock".

Separately and independently, I believe that by the time an AI has fully completed the transition to hard superintelligence, it will have ironed out a bunch of the wrinkles and will be oriented around a particular goal (at least behaviorally, cf. efficiency—though I would also guess that the mental architecture ultimately ends up cleanly-factored (albeit not in a way that creates a single point of failure, goalwise)).

Do you have a good reference article for why we should expect spaghetti behavior executors to become wrapper minds as they scale up?

I don't know, but I expect the fraction is high enough to constitute significant empirical evidence towards the Will quantum randomness affect the 2028 election? question (since quantum randomness affects the weather, the wind speed affects bullet trajectories, and the whether or not one of the candidates in the 2024 election was assassinated seems pretty influential on the 2028 election).

I think I found a place where my intuitions about "clusters in thingspace" / "carving thingspace at the joints" / "adversarial robustness" may have been misleading me.

Historically, when I thought of of "clusters in thing-space", my mental image was of a bunch of widely-spaced points in some high-dimensional space, with wide gulfs between the clusters. In my mental model, if we were to get a large enough sample size that the clusters approached one another, the thresholds which carve those clusters apart would be nice clean lines, like this.

  

In this model, an ML model trained on these clusters might fit to a set of boundaries which is not equally far from each cluster (after all, there is no bonus reduction in loss for more robust perfect classification). So in my mind the ground truth would be something like the above image, whereas what the non-robust model learned would be something more like the below:

 

But even if we observe clusters in thing-space, why should we expect the boundaries between them to be "nice"? It's entirely plausible to me that the actual ground truth is something more like this

 

That is the actual ground truth for the categorization problem of "which of the three complex roots will iteration of the Euler Method converge on for  given each starting point". And in terms of real-world problems, we see the recent and excellent paper The boundary of neural network trainability is fractal.

In the section "For those who think that open-source AGI code and weights are the solution"

If we had the DNA sequence of an extremely dangerous virus, would it be best to share it publicly or not? If the answer is obvious to you in this case, think twice about the case for AGI algorithms and parameters.

The National Institute of Health's answer is "yes". Here's variola major (smallpox) for example. So those arguing that it's a bad idea to share ML algorithms and artifacts should either make the case that the NIH is wrong to share the smallpox genome or make the case that sharing some subset of ML algorithms and artifacts is more dangerous than sharing the smallpox genome.

In fairness, some people have in fact made decent cracks at the argument that sharing some types of ML-related information is more dangerous than sharing the smallpox genome. Still, I think the people arguing that the spread of knowledge is the thing we want to target, rather than the spread of materials, could do a better job of making that argument. But the common-sense "you wouldn't share the genome of a dangerous virus" argument doesn't work because we would, in fact, share the genome of a dangerous virus (and I personally think that it's actively good that we share the genomes of dangerous viruses, because it allows for stuff like this).

Load More