I appreciate your thoughtful response! Apologies, in my sleep deprived state, I appear to have hallucinated some challenges I thought appeared in the article. Please disregard everything below "I think some of the downsides mentioned here are easily or realistically surpassable..." except for my point on "many-dimensional labeling."
To elaborate, what I was attempting to reference was QNRs which IIRC are just human-interpretable, graph-like embeddings. This could potentially automate the entire labeling flow and solve the "can categories/labels adequately express everything?" problem.
This approach is alignment by bootstrapping. To use it you need some agent able to tag all the text in the training set, with many different categories.Pre GPT4, how could you do this?
This approach is alignment by bootstrapping. To use it you need some agent able to tag all the text in the training set, with many different categories.
Pre GPT4, how could you do this?
Well, humans created all of the training data on our own, so it should be possible to add the necessary structured data to that! There are large scale crowdsourced efforts like Wikipedia. Extending Wikipedia, and a section of the internet, with enhancements like associating structured data with unstructured data, plus a reputation-weighted voting system to judge contributions, seems achievable. You could even use models to prelabel the data but have that be human verified at a large scale (or in semi-automated or fully automated, but non-AI ways). This is what I'm trying to do with Web 10. Geo is the Web3 version of this, and the only other major similar initiative I'm aware of.
This is a fantastic article! It's great to see that there's work going on in this space, and I like that the approach is described in very easy to follow and practical terms.
I've been working on a very expansive approach/design for AI safety called safety-first cognitive architectures, which is vaguely like a language model agent designed from the ground up with safety in mind, except extensible to both present-day and future AI designs, and with a very sophisticated (yet achievable, and scalable from easy to hard) safety- and performance-minded architecture. I have intentionally not publicly published implementation details yet, but will send you a DM!
It seems like this concept is related to the "Federating Cognition" section of my article, specifically a point about the safety benefits of externalizing memory: "external memory systems can contain information on human preferences which AI systems can learn from and/or use as a reference or assessment mechanism for evaluating proposed goals and actions." At a high level, this can affect both AI models themselves as well as model evaluations and the cognitive architecture containing models (the latter is mentioned at the end of your post). For various reasons, I haven't written much about the implications of this work to AI models themselves.
I think some of the downsides mentioned here are easily or realistically surpassable. I'll post a couple thoughts.
For example, is it really true that this would require condensing everything into categories? What about numerical scales for instance? Interestingly, in February, I did a very-small-scale proof-of-concept regarding automated emotional labeling (along with other metadata), currently available at this link for a brief time. As you can see, it uses numerical emotion labeling, although I think that's just the tip of the iceberg. What about many-dimensional labeling? I'd be curious to get your take on related work like Eric Drexler's article on QNRs (which is unfortunately similar to my writing in that it may be high-level and hard to interpret) which is one of the few works I can think of regarding interesting safety and performance applications of externalized memories.
With regard to jailbreaking, what if approaches like steering GPT with activation vectors and monitoring internal activations for all model inputs are used?
One possibility that I find plausible as a path to AGI is if we design something like a Language Model Cognitive Architecture (LMCA) along the lines of AutoGPT, and require that its world model actually be some explicit combination of human natural language, mathematical equations, and executable code that might be fairly interpretable to humans. Then the only potions of its world model that are very hard to inspect are those embedded in the LLM component.
Cool! I am working on something that is fairly similar (with a bunch of additional safety considerations). I don't go too deeply into the architecture in my article, but would be curious what you think!
Yep, I agree that there's a significant chance/risk that alternative AI approaches that aren't as safe as LMAs are developed, and are more effective than LMAs when run in a standalone manner. I think that SCAs can still be useful in those scenarios though, definitely from a safety perspective, and less clear from a performance perspective.
For example, those models could still do itemized, sandboxed, and heavily reviewed bits of cognition inside an architecture, even though that's not necessary for them to achieve what the architecture working towards. Also, this is when we start getting into more advanced safety features, like building symbolic/neuro-symbolic white box reasoning systems that are interpretable, for the purpose of either controlling cognition or validating the cognition of black box models (Davidad's proposal involves the latter).
I implied the whole spectrum of "LLM alignment", which I think is better to count as a single "avenue of research" because critiques and feedback in "LMA production time" could as well be applied during pre-training and fine-tuning phases of training (constitutional AI style).
If I'm understanding correctly, is your point here that you view LLM alignment and LMA alignment as the same? If so, this might be a matter of semantics, but I disagree; I feel like the distinction is similar to ensuring that the people that comprise the government is good (the LLMs in an LMA) versus trying to design a good governmental system itself (e.g. dictatorship, democracy, futarchy, separation of powers, etc.). The two areas are certainly related, and a failure in one can mean a failure in another, but the two areas can involve some very separate and non-associated considerations.
It's only reasonable for large AGI labs to ban LMAs completely on top of their APIs (as Connor Leahy suggests)
Could you point me to where Connor Leahy suggests this? Is it in his podcast?
or research their safety themselves (as they already started to do, to a degree, with ARC's evals of GPT-4, for instance)
To my understanding, the closest ARC Evals gets to LMA-related research is by equipping LLMs with tools to do tasks (similar to ChatGPT plugins), as specified here. I think one of the defining features of an LMA is self-delegation, which doesn't appear to be happening here. The closest they might've gotten was a basic prompt chain.
I'm mostly pointing these things out because I agree with Ape in the coat and Seth Herd. I don't think there's any actual LMA-specific work going on in this space (beyond some preliminary efforts, including my own), and I think there should be. I am pretty confident that LMA-specific work could be a very large research area, and many areas within it would not otherwise be covered with LLM-specific work.
Do you have a source for "Large labs (OpenAI and Anthropic, at least) are pouring at least tens of millions of dollars into this avenue of research?" I think a lot of the current work pertains to LMA alignment, like RLHF, but isn't LMA alignment per say (I'd make a distinction between aligning the black box models that compose the LMA versus the LMA itself).
Have you seen Seth Herd's work and the work it references (particularly natural language alignment)? Drexler also has an updated proposal called Open Agencies, which seems to be an updated version of his original CAIS research. It seems like Davidad is working on a complex implementation of open agencies. I will likely work on a significantly simpler implementation. I don't think any of these designs explicitly propose capping LLMs though, given that they're non-agentic, transient, etc. by design and thus seem far less risky than agentic models. The proposals mostly focus on avoiding riskier models that are agentic, persistent, etc.
Have you read Eric Drexler's work on open agencies and applying open agencies to present-day LLMs? Open agencies seem like progress towards a safer design for current and future cognitive architectures. Drexler's design touches on some of the aspects you mention in the post, like:
The system can be coded to both check itself against its goals, and invite human inspection if it judges that it is considering plans or actions that may either violate its ethical goals, change its goals, or remove it from human control.
My experience on Upwork is actually the same as yours! In our tests of the platform, it appears to be very difficult to find jobs due to the intense competition. I was unpleasantly surprised at first when I saw how difficult it was to earn money on Upwork as a new user. However, that was the whole point of the initial tests we did, so we expanded and have still been expanding the program to encompass other forms of virtual work that pay reliably and still have room to grow. Upwork will be a minor or non-existent part of our program.
If my program was just on Upwork, then I would be inclined to side with your analysis. Thankfully, it's not.