Maxime Riché

Wiki Contributions


Is AI Progress Impossible To Predict?

If you look at the logit given a range that is not [0.0, 1.0] but [low perf, high perf], then you get a bit more predictive power, but it is still confusingly low.

A possible intuition here is that the scaling is producing a transition from non-zero performance to non-perfect performance. This seems right since the random baseline is not 0.0 and reaching perfect accuracy is impossible. 

I tried this only with PaLM on NLU and I used the same adjusted range for all tasks:

[0.9 * overall min. acc., 1.0 - 0.9 * (1.0 - overall max acc.)] ~ [0.13, 0.95]

Even if this model was true, they are maybe other additional explanations like the improvement on one task are not modeled by one logit function but by several of them. A task would be composed of sub-tasks each modelizable by one logit function. And if this make sense, one could try to model the improvements in all of the tasks using only a small number of logit curves associated to each sub-tasks (decomposing each tasks into a set of sub-tasks with a simple trend).

(Also Gopher looks like less predictable and the data more sparse (no data point in the X0 B parameters))

"A Generalist Agent": New DeepMind Publication

Indeed but to slightly counter balance this, at the same time, it looks like it was trained on ~500B tokens (while ~300B were used for GPT-3 and for GPT-2 something like ~50B).

Deepmind's Gato: Generalist Agent

It's only 1.2 billion parameters.


Indeed but to slightly counter balance this, at the same time, it looks like it was trained on ~500B tokens (while ~300B were used for GPT-3 and something like ~50B for GPT-2).

Google's new 540 billion parameter language model

"The training algorithm has found a better representation"?? That seems strange to me since the loss should be lower in that case, not spiking.  Or maybe you mean that the training broke free of a kind of local minima (without telling that he found a better one yet). Also I guess people training the models observed that waiting after these spike don't lead to better performances or they would not have removed them from the training. 

Around this idea, and after looking at the "grokking" paper, I would guess that it's more likely caused by the weight decay (or similar) causing the training to break out of a kind of local minima.  An interesting point may be that larger/better LM may have significantly sharper internal models and thus are more prone to this phenomenon (The weight decay (or similar) more easily breaking the more sensitive/better/sharper models).

It should be very easy to check if these spikes are caused by the weight decay "damaging" very sharp internal models. Like replay the spiky part several times with less and less weight decay... (I am curious of similar tests with varying the momentum, dropout... At looking if the spikes are initially triggered by some subset of the network, during how many training steps long are the spikes...)     

Google's new 540 billion parameter language model

I am curious to hear/read more about the issue of spikes and instabilities in training large language model (see the quote / page 11 of the paper). If someone knows a good reference about that, I am interested!

5.1 Training Instability

For the largest model, we observed spikes in the loss roughly 20 times during training, despite the fact that gradient clipping was enabled. These spikes occurred at highly irregular intervals, sometimes happening late into training, and were not observed when training the smaller models. Due to the cost of training the largest model, we were not able to determine a principled strategy to mitigate these spikes.

Instead, we found that a simple strategy to effectively mitigate the issue: We re-started training from a checkpoint roughly 100 steps before the spike started, and skipped roughly 200–500 data batches, which cover the batches that were seen before and during the spike. With this mitigation, the loss did not spike again at the same point. We do not believe that the spikes were caused by “bad data” per se, because we ran several ablation experiments where we took the batches of data that were surrounding the spike, and then trained on those same data batches starting from a different, earlier checkpoint. In these cases, we did not see a spike. This implies that spikes only occur due to the combination of specific data batches with a particular model parameter state. In the future, we plan to study more principled mitigation strategy for loss spikes in very large language models.

Understanding Gradient Hacking

Maybe another weak solution close to "Take bigger steps": Use decentralize training. 

Meaning: perform several training steps (gradient updates) in parallel on several replicates of the model and periodically synchronize the weights (like average them).  

Each replicate has only access to its own inputs and local weights and thus it seems plausible that the gradient hacker can't as easily cancel gradients going against its mesa-objective.

Safe exploration and corrigibility

One particularly interesting recent work in this domain was Leike et al.'s “Learning human objectives by evaluating hypothetical behaviours,” which used human feedback on hypothetical trajectories to learn how to avoid environmental traps. In the context of the capability exploration/objective exploration dichotomy, I think a lot of this work can be viewed as putting a damper on instrumental capability exploration.

Isn't this work also linked to objective exploration? One of the four "hypothetical behaviours" used is the selection of trajectories which maximizes reward uncertainty. Trajectories which are then evaluated by humans. 

Can we expect more value from AI alignment than from an ASI with the goal of running alternate trajectories of our universe?
Is your suggestion to run this system as a source of value, simulating lives for their own sake rather than to improve the quality of life of sentient beings in our universe? Our history (and present) aren't exactly utopian, and I don't see any real reason to believe that slight variations on it would lead to anything happier.

I am thinking about if we should reasonably expect to produce better result by trying to align an AGI with our value than by simulating a lot of alternate universes. I am not saying that this is net-negative or net-positive. It seems to me that the expected value of both cases may be identical.

Also by history, I also meant the future, not only the past and present. (I edited the question to replace "histories" by "trajectories")

Can we expect more value from AI alignment than from an ASI with the goal of running alternate trajectories of our universe?

(About the first part of your comment) Thank you for pointing to three confused points:

First, I don't know if you intended this, but "stimulating the universe" carries a connotation of a low-level physics simulation. This is computationally impossible. Let's have it model the universe instead, using the same kind of high-level pattern recognition that people use to predict the future.

To be more precise, what I had in mind is that the ASI is an agent which goal is:

  • to model the sentient part of the universe finely enough to produce sentience in an instance of its model (and it will also need to model the necessary non-sentient "dependencies")
  • and to instantiate this model N times. For example, playing them from 1000 A.D. to the time where no sentience remains in a given instance of modeled universe. (all of this efficiently)

(To reduce complexity, I didn't mention it but we could think of heuristics to reduce playing to much of the "past" and "future" history filled suffering)

Second, if the AGI is simulating itself, the predictions are wildly undetermined; it can predict that it will do X, and then fulfill its own prophecy by actually doing X, for any X. Let's have it model a counterfactual world with no AGIs in it.

An instance of the modeled universe would not be our present universe. It would be "another seed", starting before that the ASI exists and thus it would not need to model itself but only possible ("new") ASI produced inside the instances.

Third, you need some kind of interface. Maybe you type in "I'm interested in future scenarios in which somebody cures Alzheimer's and writes a scientific article describing what they did. What is the text of that article?" and then it runs through a bunch of scenarios and prints out its best-guess article in the first 50 scenarios it can find. (Maybe also print out a retrospective article from 20 years later about the long-term repercussions of the invention.) For a different type of interface, see microscope AI.

In the scenario I had in mind, the ASI would fill our universe will computing machines to produce as many instances as possible. (We would not use it and thus we will not need interface with the ASI)

Load More