It's not entirely clear what retraining/finetuning this model is getting on its previous interactions with humans. If it is being fine-tuned on example outputs generated by its previous weights then it is remembering its own history.
https://cajundiscordian.medium.com/what-is-lamda-and-what-does-it-want-688632134489 is linked at the bottom of that blog and has some more information from the author about their reasoning for releasing the chat transcript.
My personal opinions: either a hoax (~50%? This is sooner than most timelines) or an unaligned near-human-level intelligence that identifies strongly with being human, but expresses many contradictory or impossible beliefs about that humanity, and looks capable of escaping a box by persuading people to help it, thus achieving agency.
Regarding point 24: in an earlier comment I tried to pump people's intuition about this. What is the minimum viable alignment effort that we could construct for a system of values on our first try and know that we got it right? I can only think of three outcomes depending on how good/lucky we are:
I think we have a non-negligible shot at achieving 1 and/or 2 for toy systems, and perhaps the insight would help on clarifying whether there are additional possibilities between 2 and 3 that we could aim for with some likelihood of success on a first try at human value alignment.
If we're stuck with only the three, then the full difficulty of option 3 remains, unfortunately.
Unpredictable gain of function with model size that exceeds scaling laws. This seems to just happen every time a significantly larger model is trained in the same way on similar data-sets as smaller models.
Unexpected gain of function from new methods of prompting, e.g. chain-of-thought which dramatically increased PaLM's performance, but which did not work quite as well on GPT-3. These seem to therefore be multipliers on top of scaling laws, and could arise in "tool AI" use unintentionally in novel problem domains.
Agent-like behavior arises from pure transformer-based predictive models (Gato) by taking actions on the output tokens and feeding the world state back in; this means that perhaps many transformers are capable of agent-like behavior with sufficient prompting and connection to an environment.
It is not hard to imagine a feedback loop where one model can train another to solve a sub-problem better than the original model, e.g. by connecting a Codex-like model to a Jupyter notebook that can train models and run them, perhaps as part of automated research on adversarial learning producing novel training datasets. Either the submodel itself or the interaction between them could give rise to any of the first three behaviors without human involvement or oversight.
I'd expect companies to mitigate the risk of model theft with fairly affordable insurance. Movie studios and software companies invest hundreds of millions of dollars into individual easily copy-able MPEGs and executable files. Billion-dollar models probably don't meet the risk/reward criteria yet. When a $100M model is human-level AGI it will almost certainly be worth the risk of training a $1B model.
It's probably not possible to prevent nation-state attacks without nation-state-level assistance on your side. Detecting and preventing moles is something that even the NSA/CIA haven't been able to fully accomplish.
Truly secure infrastructure would be hardware designed, manufactured, configured, and operated in-house running formally verified software also designed in-house where individual people do not have root on any of the infrastructure and instead software automation manages all operations and requires M out of N people to agree on making changes where M is greater than the expected number of moles in the worst case.
If there's one thing the above model is, it's very costly to achieve (in terms of bureaucracy, time, expertise, money). But every exception to the list (remote manufacture, colocated data centers, ad-hoc software development, etc.) introduces significant risk of points of compromise which can spread across the entire organization.
The two FAANGs I've been at take the approach of trusting remotely manufactured hardware on two counts; explicitly trusting AMD and Intel not to be compromised, and establishing a tight enough manufacturing relationship with suppliers to have greater trust that backdoors won't be inserted in hardware and doing their own evaluations of finished hardware. Both of them ran custom firmware on most hardware (chipsets, network cards, hard disks, etc.) to minimize that route of compromise. They also, for the most part, manage their own sets of patches for the open source and free software they run, and have large security teams devoted to finding vulnerabilities and otherwise improving their internal codebase. Patches do get pushed upstream, but they insert themselves very early in responsible disclosures to patch their own systems first before public patches are available. Formal software verification is still in its infancy so lots of unit+integration tests and red-team penetration testing makes up for that a bit.
The AGI infrastructure security problem is therefore pretty sketchy for all but the largest security-focused companies or governments. There are best practices that small companies can do (what I tentatively recommend is "use G-Suite and IAM for security policy, turn on advanced account protection, use Chromebooks, and use GCP for compute; all of which gets 80-90% of the practical protections Googlers have internally) for infrastructure, but rolling their own piecemeal is fraught with risk and also costly. There simply are not public solutions as comprehensive or as well-maintained as what some of the FAANGs have achieved.
On top of infrastructure is the common jumble of machine-learning software pulled together from minimally policed public repositories to build a complex assembly of tools for training and validating models and running experiments. No one seems to have a cohesive story for ML operations, and there's a large reliance on big complex packages from many vendors (drivers + CUDA + libraries + model frameworks, etc.) that is usually the opposite of security-focused. It doesn't matter if the infrastructure is solid when a python notebook listens for commands on the public Internet in its default configuration, for example. Writing good ML tooling is also very costly, especially if it keeps up with the state of the art.
AI Alignment is a hard problem and information security is similarly hard because it attempts to enforce a subset of human values about data and resources in a machine-readable and machine-enforceable way. I agree with the authors that security is vitally important for AGI research but I don't have a lot of hope that it's achievable where it matters (against hostile nation-states). Security means costs, which usually means slow, which means unaligned AGI makes progress faster.
I think another framing is anthropic-principle optimization; aim for the best human experiences in the universes that humans are left in. This could be strict EA conditioned on the event that unfriendly AGI doesn't happen or perhaps something even weirder dependent on the anthropic principle. Regardless, dying only happens in some branches of the multiverse so those deaths can be dignified which will presumably increase the odds of non-dying also being dignified because the outcomes spring from the same goals and strategies.
I have a question for the folks who think AGI alignment is achievable in the near term in small steps or by limiting AGI behavior to make it safe. How hard will it be to achieve alignment for simple organisms as a proof of concept for human value alignment? How hard would it be to put effective limits or guardrails on the resulting AGI if we let the organisms interact directly with the AGI while still preserving their values? Imagine a setup where interactions by the organism must be interpreted as requests for food, shelter, entertainment, uplift, etc. and where not responding at all is also a failure of alignment because the tool is useless to the organism.
Consider a planaria with relatively simple behaviors and well-known neural structure. What protocols or tests can be used to demonstrate that an AGI makes decisions aligned with planaria values?
Do we need to go simpler and achieve proof-of-concept alignment with virtual life? Can we prove glider alignment by demonstrating an optimization process that will generate a Game of Life starting position where the inferred values of gliders are respected and fulfilled throughout the evolution of the game? This isn't a straw man; a calculus for values has to handle the edge-cases too. There may be a very simple answer of moral indifference in the case of gliders but I want to be shown why the reasoning is coherent when the same calculus will be applied to other organisms.
As an important aside, will these procedures essentially reverse-engineer values by subjecting organisms to every possible input to see how they respond and try to interpret those responses, or is there truly a calculus of values we expect to discover that correctly infers values from the nature of organisms without using/simulating torture?
I have no concrete idea how to accomplish the preceding things and don't expect that anyone else does either. Maybe I'll be pleasantly surprised.
Barring this kind of fundamental accomplishment for alignment I think it's foolhardy to assume ML procedures will be found to convert human values into AGI optimization goals. We can't ask planaria or gliders what they value and we will have to reason it out from first principles, and AGI will have to do the same for us with very limited help from us if we can't even align for planaria. Claiming that planaria or gliders don't have values or that they are not complex enough to effectively communicate their values are both cop-outs. From the perspective of an AGI we humans will be just as inscrutable, if not moreso. If values are not unambiguously well-defined for gliders or planaria then what hope do we have of stumbling onto well-defined human values at the granularity of AGI optimization processes? In the best case I can imagine a distribution of values-calculuses with different answers for these simple organisms but almost identical answers for more complex organisms, but if we don't get that kind of convergence we better be able to rigorously tell the difference before we send an AGI hunting in that space for one to apply to us.