From his worldview it would be like a cancer patient getting a stage 4 diagnosis.
This is an 8-13 times decrease in required memory and proportionally compute unless they increased convolutions a lot.
It means 18 months to 6 years of AI compute progress overnight. (18 months because compute dedicated to AI is doubling every 6 months, 6 years is about how long to get 8 times the compute per dollar)
<del>Maybe meta did a sloppy job of benchmarking the model.</del>
Update: From reading the paper, they did not. They appear to have replicated https://www.lesswrong.com/posts/6Fpvch8RR29qLEWNH/chinchilla-s-wild-implications and found that the scaling laws were even MORE favorable to more tokens than the lesswrong post. Also they made slight tweaks to the transformer architecture. What's notable about this is it's a collaboration, tweaks came from multiple other AI labs.
Ultimately, yes. This whole debate is arguing that the critical threshold where it comes to this is farther away, and we humans should empower ourselves with helpful low superintelligences immediately.
It's always better to be more powerful than helpless, which is the current situation. We are helpless to aging, death, pollution, resource shortages, enemy nations with nuclear weapons, disease, asteroid strikes, and so on. Hell just bad software - something the current llms are likely months from empowering us to fix.
And eliezer is saying not to take one more step towards fixing this because it MIGHT be hostile, when the entire universe is against us as it is. It already plans to kill us as it is, either from aging, or the inevitability of nuclear war over a long enough timespan, or the sun engulfing us.
You, on the other hand, are proposing a novel training procedure, and one which (I take it) you believe holds more promise for AGI than LLM training.
It's not really novel. It is really just coupling together 3 ideas:
(1) the idea of an AGI gym, which was in the GATO paper implicitly, and is currently being worked on. https://github.com/google/BIG-bench
(2) Noting there are papers on network architecture search https://github.com/hibayesian/awesome-automl-papers , activation function search https://arxiv.org/abs/1710.05941 , noting that SOTA architectures use multiple neural networks in a cognitive architecture https://github.com/werner-duvaud/muzero-general , and noting that an AGI design is some cognitive architecture of multiple models, where no living human knows yet which architecture will work. https://openreview.net/pdf?id=BZ5a1r-kVsf
So we have layers here, and the layers look a lot like each other and are frameworkable.
Activations functions which are graphs of primitive math functions from the set of "all primitive functions discovered by humans"
Network layer architectures which are graphs of (activation function, connectivity choice)
Network architectures which are graphs of layers. (you can also subdivide into functional module of multiple layers, like a column, the choice of how you subdivide can be represented as a graph choice also)
Cognitive architectures which are graphs of networks
And we can just represent all this as a graph of graphs of graphs of graphs, and we want the ones that perform like an AGI. It's why I said the overall "choice" is just a coordinate in a search space which is just a binary string.
You could make an OpenAI gym wrapped "AGI designer" task.
3. Noting that LLMs seem to be perfectly capable of general tasks, as long as they are simple. Which means we are very close to being able to RSI right now.
No lab right now has enough resources in one place to attempt the above, because it is training many instances of systems larger than current max size LLMs (you need multiple networks in a cognitive architecture) to find out what works.
They may allocate this soon enough, there may be a more dollar efficient way to accomplish the above that gets tried first, but you'd only need a few billion to try this...
That's exactly what I am talking about. One divergence in our views is you haven't carefully examined current gen AI "code" to understand what it does. (note that some of my perspective is informed because all AI models are similar at the layer I work at, on runtime platforms)
https://github.com/EleutherAI/gpt-neox
If you examine the few thousand lines of python source especially the transformer model, you will realize that functionally that pipeline I describe of "input, neural network, output, evaluation" is all that the above source does. You could in fact build a "general framework" that would allow you to define many AI models, almost of which humans have never tested, without writing 1 line of new code.
So the full process is :
[1] benchmark of many tasks. Tasks must be autogradeable, human participants must be able to 'play' the tasks so we have a control group score, tasks must push the edge of human cognitive ability (so the average human scores nowhere close to the max score, and top 1% humans do not max the bench either), there must be many tasks and with a rich permutation space. (so it isn't possible for a model to memorize all permutations)
[2] heuristic weight score on this task intended to measure how "AGI like" a model is. So it might be the RMSE across the benchmark. But also have a lot of score weighting on zero shot, cross domain/multimodal tasks. That is, the kind of model that can use information from many different previous tasks on a complex exercise it has never seen before is closer to an AGI, or closer to replicating "Leonardo da Vinci", who had exceptional human performance presumably from all this cross domain knowledge.
[3] In the computer science task set, there are tasks to design an AGI for a bench like this. The model proposes a design, and if that design has already been tested, immediately receives detailed feedback on how it performed.
As I mentioned, the "design an AGI" subtask can be much simpler than "write all the boilerplate in Python", but these models will be able to do that if needed.
As tasks scores approach human level across a broad set of tasks, you have an AGI. You would expect it to almost immediately improve to a low superintelligence. As AGIs get used in the real world and fail to perform well at something, you add more tasks to the bench, and/or automate creating simulated scenarios that use robotics data.
You're correct. In the narrow domain of designing AI architectures you need the system to be at least 1.01 times as good as a human. You want more gain than that because there is a cost to running the system.
Getting gain seems to be trivially easy at least for the types of AI design tasks this has been tried on. Humans are bad at designing network architectures and activation functions.
I theorize that a machine could study the data flows from snapshots from an AI architecture attempting tasks on the AGI/ASI gym, and use that information as well as all previous results to design better architectures.
The last bit is where I expect enormous gain, because the training data set will exceed the amount of data humans can take in in a lifetime, and you would obviously have many smaller "training exercises" to design small systems to build up a general ability. (Enormous early gain. Eventually architectures are going to approach the limits allowed by the underlying compute and datasets)
Assumptions:
A. It is possible to construct a benchmark to measure if a machine is a general ASI. This would be a very large number of tasks, many simulated though some may be robotic tasks in isolated labs. A general ASI benchmark would have to include tasks humans do not know how to do, but we know how to measure success.
B. We have enough computational resources to train from scratch many ASI level systems so that thousands of attempts are possible. Most attempts would reuse pretrained components in a different architecture.
C. We recursively task the best performing AGIs, as measured by the above benchmark or one meant for weaker systems, to design architectures to perform well on (A)
Currently the best we can do is use RL to design better neural networks, by finding better network architectures and activation functions. Swish was found this way, not sure how much transformer network design came from this type of recursion.
Main idea : the AGI systems exploring possible network architectures are cognitively able to take into account all published research and all past experimental runs, and the ones "in charge" are the ones who demonstrated the most measurable merit at designing prior AGI because they produced the highest performing models on the benchmark.
I think if you think about it you'll realize it compute were limitless, this AGI to ASI transition you mention could happen instantly. A science fiction story would have it happen in hours. In reality, since training a subhuman system is taking 10k GPUs about 10 days to train, and an AGI will take more - Sam Altman has estimated the compute bill will be close to 100 billion - that's the limiting factor. You might be right and we stay "stuck" at AGI for years until the resources to discover ASI become available.
I think worlds with the tools to treat most causes of human death ranks strictly higher than a world without those tools. In the same way that a world with running water ranks above worlds without it. Even today not everyone benefits from running water. If you could go back in time would you campaign against developing pipes and pumps because you believed only the rich would ever have running water? (Which was true for a period of time)
I think (give like 30 per cent probability) that the general nature of the UFO phenomenon is that it is anti-epistemic, that it, it actively prevents our ability to get definite knowledge about about it. How exactly this happens is not clear, and there could be several ideas.
Something jumped out at me here. Regardless of the explanation, there's a testable experiment in the works here. We could confirm or falsify this anti-epistemic property.
Setup: find the 'base rate' of UFO sightings, how often do humans and aircraft sensors see them. Then determine how large of an area you need to cover.
Cover half an area sufficiently large with thousands/millions of constantly recording high resolution cameras. Use AI to check the footage for UFOs.
The other half is your control region. Elicit UFO reports in both regions. (you might put the cameras in both regions but not power the ones in the control region so human reporters don't know which region they are in)
Prediction: if UFOs are anti-epistemic, you will get no UFO reports from the region covered by cameras, and you will get a statistically meaningful number (because you chose a large enough collection area with enough people) from the control region.
If the cameras ever pick up anything it will be blurry and distant, of course.
Obviously you then swap the groups and run the cameras in the control region.
It would be weird if reality works this way, and we can debate theories after empirical confirmation, but it already is weird in many other ways.
That's never happened historically and aging treatments isn't immortality, it's just approximately a life expectancy of 10k years. Do you know who is richer than any CEO you name? Medicare. I bet they would like to stop paying all these medical bills, which would be the case if treated patients had the approximate morbidity rate of young adults.
You also need such treatments to be given at large scales to find and correct the edge cases. A rejuvenation treatment "beta tester" is exactly what it sounds, you will have a higher risk of death but get earlier access. Going to need a lot of beta testers.
The rational, data driven belief is that aging is treatable and that ASI systems with the cognitive capacity to take into account more variables than humans are mentally capable of could be built to systematically attack the problem. Doesn't mean it will help anyone alive today, there are no guarantees. Because automated systems found whatever treatments are possible, automated systems can deliver the same treatments at low cost.
If you don't think this is a reasonable conclusion, perhaps you could go into your reasoning. Arguments like you made above are unconvincing.
While it is true that certain esoteric treatments for aging like young blood transfusions are inherently limited in who can benefit, they don't even work that well and de aged hemopoietic stem cells can be generated in automated laboratories and would be a real treatment everyone can benefit.
The wealthy are not powerful enough to "hoard" treatments, because Medicare et al represent the government, which has a monopoly on violence and incentives to not allow such hoarding.
Right. I see this as a problem also, asking the model if it's sure is injecting information if we only ask on wrong answers. If we ask always it may disturb more right answers than it fixes wrong ones.
Its also accuracy dependent - if the model is 99 percent accurate on a subtask then asking if it's sure may degrade accuracy, while it may improve it on a subtask it's 50 percent accurate on.
Or in other words, we could prompt it and it might do better on AP English but less good on the bar exam.