I have a PhD in Computational Neuroscience from UCSD (Bachelor's was in Biomedical Engineering with Math and Computer Science minors). Ever since junior high, I've been trying to figure out how to engineer artificial minds, and I've been coding up artificial neural networks ever since I first learned to program. Obviously, all my early designs were almost completely wrong/unworkable/poorly defined, but I think my experiences did prime my brain with inductive biases that are well suited for working on AGI.
Although I now work as a data scientist in R&D at a large medical device company, I continue to spend my free time studying the latest developments in AI/ML/DL/RL and neuroscience and trying to come up with models for how to bring it all together into systems that could actually be implemented. Unfortnately, I don't seem to have much time to develop my ideas into publishable models, but I would love to have the opportunity to share ideas with those who do.
Of course, I'm also very interested in AI Alignment (hence the account here). My ideas on that front mostly fall into the "learn (invertible) generative models of human needs/goals and hook those up to the AI's own reward signal" camp. I think methods of achieving alignment that depend on restricting the AI's intelligence or behavior are about as destined to failure in the long term as Prohibition or the War on Drugs in the USA. We need a better theory of what reward signals are for in general (probably something to do with maximizing (minimizing) the attainable (dis)utility with respect to the survival needs of a system) before we can hope to model human values usefully. This could even extend to modeling the "values" of the ecological/socioeconomic/political supersystems in which humans are embedded or of the biological subsystems that are embedded within humans, both of which would be crucial for creating a better future.
If both images have the main object near the middle of the image or taking up most of the space (which is usually the case for single-class photos taken by humans), then yes. Otherwise, summing two images with small, off-center items will just look like a low-contrast, noisy image of two items.
Either way, though, I would expect this to result in class-label ambiguity. However, in some cases of semi-transparent-object-overlay, the overlay may end up mixing features in such a jumbled way that neither of the "true" classes is discernible. This would be a case where the almost-linearity of the network breaks down.
Maybe this linearity story would work better for generative models, where adding latent vector representations of two different objects would lead the network to generate an image with both objects included (an image that would have an ambiguous class label to a second network). It would need to be tested whether this sort of thing happens by default (e.g., with Stable Diffusion) or whether I'm just making stuff up here.
For an image-classification network, if we remove the softmax nonlinearity from the very end, then would represent the input image in pixel space, and would represent the class logits. Then would represent an image with two objects leading to an ambiguous classification (high log-probability for both classes), and would represent higher class certainty (softmax temperature = ) when the image has higher contrast. I guess that kind of makes sense, but yeah, I think for real neural networks, this will only be linear-ish at best.
I would say we want an ASI to view world-state-optimization from the perspective of a game developer. Not only should it create predictive models of what goals humans wish to achieve (from both stated and revealed preferences), but it should also learn to predict what difficulty level each human wants to experience in pursuit of those goals.
Then the ASI could aim to adjust the world into states where humans can achieve any goal they can think of when they apply a level of effort that would leave them satisfied in the accomplishment.
Humans don't want everything handed to us for free, but we also don't generally enjoy struggling for basic survival (unless we do). There's a reason we pursue things like competitive sports and video games, even as we denounce the sort of warfare and power struggles that built those competitive instincts in the ancestral environment.
A safe world of abundance that still feels like we've fought for our achievements seems to fit what most people would consider "fun". It's what children expect in their family environment growing up, it's what we expect from the games we create, and it's what we should expect from a future where ASI alignment has been solved.
Last I checked, you can get about 10x as much energy from burning a square meter of biosphere as you can get by collecting a square meter of sunlight for a day.
Even if this is true, it's only because that square meter of biosphere has been accumulating solar energy over an extended period of time. Burning biofuel may help accelerate things in the short term, but it will always fall short of long-term sustainability. Of course, if humanity never makes it to the long-term, this is a moot point.
Disassembling us for parts seems likely to be easier than building all your infrastructure in a manner that's robust to whatever superintelligence humanity coughs up second.
It seems to me that it would be even easier for the ASI to just destroy all human technological infrastructure rather than to kill/disassemble all humans. We're not much different biologically from what we were 200,000 years ago, and I don't think 8 billion cavemen could put together a rival superintelligence anytime soon. Of course, most of those 8 billion humans depend on a global supply chain for survival, so this outcome may be just as bad for the majority.
You heard the LLM, alignment is solved!
But seriously, it definitely has a lot of unwarranted confidence in its accomplishments.
I guess the connection to the real world is what will throw off such systems until they are trained on more real-world-like data.
I wouldn't phrase it that it needs to be trained on more data. More like it needs to be retrained within an actual R&D loop. Have it actually write and execute its own code, test its hypotheses, evaluate the results, and iterate. Use RLHF to evaluate its assessments and a debugger to evaluate its code. It doesn't matter whether this involves interacting with the "real world," only that it learns to make its beliefs pay rent.
Anyway, that would help with its capabilities in this area, but it might be just a teensy bit dangerous to teach an LLM to do R&D like this without putting it in an air-gapped virtual sandbox, unless you can figure out how to solve alignment first.
"Activation space gradient descent" sounds a lot like what the predictive coding framework is all about. Basically, you compare the top-down predictions of a generative model against the bottom-up perceptions of an encoder (or against the low-level inputs themselves) to create a prediction error. This error signal is sent back up to modify the activations of the generative model, minimizing future prediction errors.
From what I know of Transformer models, it's hard to tell exactly where this prediction error would be generated. Perhaps during few-shot learning, the model does an internal next-token prediction at every point along its input, comparing what it predicts the next token should be (based on the task it currently thinks it's doing) against what the next token actually is. The resulting prediction error is fed "back" to the predictive model by being passed forward (via self-attention) to the next example in the input text, biasing the way it predicts next tokens in a way that would have given a lower error on the first example.
None of these predictions and errors would be visible unless you fed the input one token at a time and forced the hidden states to match what they were for the full input. A recurrent version of GPT might make that easier.
It would be interesting to see whether you could create a language model that had predictive coding built explicitly into its architecture, where internal predictions, error signals, etc. are all tracked at known locations within the model. I expect that interpretability would become a simpler task.
AI has gotten even faster and associated with that there are people that worry about AI, you know, fairness, bias, social economic displacement. There are also the further out speculative worries about AGI, evil sentient killer robots, but I think that there are real worries about harms, possible real harms today and possibly other harms in the future that people worry about.
It seems that the sort of AI risks most people worry about fall into one of a few categories:
It seems that a fourth option is not really prominent in the public consciousness: namely that powerful AI systems could end up destroying everything of value by accident when enough optimization pressure is applied toward any goal, no matter how noble. No robots or weapons are even required to achieve this. This oversight is a real PR problem for the alignment community, but it's unfortunately difficult to explain why this makes sense as a real threat to the average person.
And I think, you know, thinking that somehow we're smart enough to build those systems to be super intelligent and not smart enough to design good objectives so that they behave properly, I think is a very, very strong assumption that is, it's just not, it's very, it's very low probability.
So close.
Yep, ever since Gato, it's been looking increasingly like you can get some sort of AGI by essentially just slapping some sensors, actuators, and a reward function onto an LLM core. I don't like that idea.
LLMs already have a lot of potential for causing bad outcomes if abused by humans for generating massive amounts of misinformation. However, that pales in comparison to the destructive potential of giving GPT agency and setting it loose, even without idiots trying to make it evil explicitly.
I would much rather live in a world where the first AGIs weren't built around such opaque models. LLMs may look like they think in English, but there is still a lot of black-box computation going on, with a strange tendency to switch personas partway through a conversation. That doesn't bode well for steerability if such models are given control of an agent.
However, if we are heading for a world of LLM-AGI, maybe our priorities should be on figuring out how to route their models of human values to their own motivational schemas. GPT-4 probably already understands human values to a much deeper extent than we could specify with an explicit utility function. The trick would be getting it to care.
Maybe force the LLM-AGI to evaluate every potential plan it generates on how it would impact human welfare/society, including second-order effects, and to modify its plans to avoid any pitfalls it finds from a (simulated) human perspective. Do this iteratively until it finds no more conflict before it actually implements a plan. Maybe require actual verbal human feedback in the loop before it can act.
It's not a perfect solution, but there's probably not enough time to design a custom aligned AGI from scratch before some malign actor sets a ChaosGPT-AGI loose. A multipolar landscape is probably the best we can hope for in such a scenario.
If I'm interpreting this correctly, then it sounds like the network is learning exponentially larger weights in order to compensate for an exponentially growing residual stream. However, I'm still not quite clear on why LayerNorm doesn't take care of this.
To avoid this phenomenon, one idea that springs to mind is to adjust how the residual stream operates. For a neural network module f, the residual stream works by creating a combined output: r(x)=f(x)+x
You seem to suggest that the model essentially amplifies the features within the neural network in order to overcome the large residual stream: r(x)=f(1.045*x)+x
However, what if instead of adding the inputs directly, they were rescaled first by a compensatory weight?: r(x)=f(x)+1/1.045x=f(x)+0.957x
It seems to me that this would disincentivize f from learning the exponentially growing feature scales. Based on your experience, would you expect this to eliminate the exponential growth in the norm across layers? Why or why not?