Brendon_Wong — LessWrong

Which paths to powerful AI should be boosted?

Answer by Brendon_WongSep 19, 202430

Unfortunately I see this question didn’t get much engagement when it was originally posted, but I’m going to put a vote in for highly federated systems along the axes of agency, cognitive processes, and thinking, especially those that maximize transparency and determinism. I think that LM agents are just a first step into this area of safety. I write more about this here: https://www.lesswrong.com/posts/caeXurgTwKDpSG4Nh/safety-first-agents-architectures-are-a-promising-path-to

For specific proposals I’d recommend Drexler’s work on federating agency https://www.lesswrong.com/posts/5hApNw5f7uG8RXxGS/the-open-agency-model and federating cognitive processes (memory) https://www.lesswrong.com/posts/FKE6cAzQxEK4QH9fC/qnr-prospects-are-important-for-ai-alignment-research

What does it look like for AI to significantly improve human coordination, before superintelligence?

Answer by Brendon_WongJan 15, 202480

Not sure the extent to which this falls under “coordination tech” but are you familiar with work in collective intelligence? This article has some examples of existing work and future directions: https://www.wired.com/story/collective-intelligence-democracy/. Notably, it covers enhancements in expressing preferences (quadratic voting), prediction (prediction markets), representation (liquid democracy), consensus in groups (Polis), and aggregating knowledge (Wikipedia).

As you reference above, there’s non-AI collective action tech: https://foresight.org/a-simple-secure-coordination-platform-for-collective-action/

In the area of cognitive architectures, the open agency proposals contain governance tech, like Drexler’s original Open Agency model (https://www.lesswrong.com/posts/5hApNw5f7uG8RXxGS/the-open-agency-model), Davidad’s dramatically more complex Open Agency Architecture (https://www.lesswrong.com/posts/jRf4WENQnhssCb6mJ/davidad-s-bold-plan-for-alignment-an-in-depth-explanation), and the recently proposed Gaia Network (https://www.lesswrong.com/posts/AKBkDNeFLZxaMqjQG/gaia-network-a-practical-incremental-pathway-to-open-agency).

The main way I look at this is that software can greatly boost collective intelligence (CI), and one part of collective intelligence is coordination. Collective intelligence seems really under explored and I think there are very promising ways to improve it. More on my plan for CI + AGI here if of interest: https://www.web10.ai/p/web-10-in-under-10-minutes

While I think CI can be useful for things like AI governance, I think collective intelligence is actually very related to AI safety in the context of a cognitive architecture (CA). CI can be used to federate responsibilities in a cognitive architecture, including AI systems reviewing other AI systems as you mention. It can be used to enhance human control and participation in a CA, including allowing humans to set the goals of a cognitive architecture–based system, allow humans to perform the thinking and acting in a CA, and allow humans to participate in the oversight and evaluation of the granular and high-level operation of a CA. I write more on the safety aspects here if you’re interested: https://www.lesswrong.com/posts/caeXurgTwKDpSG4Nh/safety-first-agents-architectures-are-a-promising-path-to

In my view, it is most optimal to integrate CI and AI together in the same federated cognitive architecture, but CI systems can themselves be superintelligent, and that could be useful for developing and working with safe artificial super intelligence (including AI to help with primarily human-orchestrated CI, which blurs the line between CI and a combined human-AI cognitive architecture).

I see certain AI developments as boosting the same underlying tech required for next-level collective intelligence (modeling reasoning, for example, which would fall under symbolic AI) and augmenting collective intelligence (e.g. helping to identify areas of consensus in a more automated manner, like: https://ai.objectives.institute/talk-to-the-city).

I think many examples of AI engagement in CI and CA boil down to translating information from humans into various forms of unstructured, semi-structured, and structured data (my preference is for the latter, which I view is pretty crucial in next-gen cognitive architecture and CI systems) which are used to perform many functions from identifying each person’s preferences and existing beliefs to performing planning to conducting evaluations.

Some for-profit AI alignment org ideas

Brendon_Wong2y10

This is an interesting point. I also feel like the governance model of the org and culture of mission alignment with increasing safety is important, in addition to the exact nature of the business and business model at the time the startup is founded. Looking at your examples, perhaps by “business model” you are referring both to what brings money in but also the overall governance/decision-making model of the organization?

Some for-profit AI alignment org ideas

Brendon_Wong2y*30

Great article! Just reached out. A couple ideas I want to mention are working on safer models directly (example: https://www.lesswrong.com/posts/JviYwAk5AfBR7HhEn/how-to-control-an-llm-s-behavior-why-my-p-doom-went-down-1), which for smaller models might not be cost prohibitive to make progress on. There’s also building safety-related cognitive architecture components that have commercial uses. For example, world model work (example: https://www.lesswrong.com/posts/nqFS7h8BE6ucTtpoL/let-s-buy-out-cyc-for-use-in-agi-interpretability-systems) or memory systems (example: https://www.lesswrong.com/posts/FKE6cAzQxEK4QH9fC/qnr-prospects-are-important-for-ai-alignment-research). My work is trying to do a few of these things concurrently (https://www.lesswrong.com/posts/caeXurgTwKDpSG4Nh/safety-first-agents-architectures-are-a-promising-path-to).

How to Control an LLM's Behavior (why my P(DOOM) went down)

Brendon_Wong2y20

I appreciate your thoughtful response! Apologies, in my sleep deprived state, I appear to have hallucinated some challenges I thought appeared in the article. Please disregard everything below "I think some of the downsides mentioned here are easily or realistically surpassable..." except for my point on "many-dimensional labeling."

To elaborate, what I was attempting to reference was QNRs which IIRC are just human-interpretable, graph-like embeddings. This could potentially automate the entire labeling flow and solve the "can categories/labels adequately express everything?" problem.

How to Control an LLM's Behavior (why my P(DOOM) went down)

Brendon_Wong2y10

This approach is alignment by bootstrapping. To use it you need some agent able to tag all the text in the training set, with many different categories.
Pre GPT4, how could you do this?

Well, humans created all of the training data on our own, so it should be possible to add the necessary structured data to that! There are large scale crowdsourced efforts like Wikipedia. Extending Wikipedia, and a section of the internet, with enhancements like associating structured data with unstructured data, plus a reputation-weighted voting system to judge contributions, seems achievable. You could even use models to prelabel the data but have that be human verified at a large scale (or in semi-automated or fully automated, but non-AI ways). This is what I'm trying to do with Web 10. Geo is the Web3 version of this, and the only other major similar initiative I'm aware of.

How to Control an LLM's Behavior (why my P(DOOM) went down)

Brendon_Wong2y10

This is a fantastic article! It's great to see that there's work going on in this space, and I like that the approach is described in very easy to follow and practical terms.

I've been working on a very expansive approach/design for AI safety called safety-first cognitive architectures, which is vaguely like a language model agent designed from the ground up with safety in mind, except extensible to both present-day and future AI designs, and with a very sophisticated (yet achievable, and scalable from easy to hard) safety- and performance-minded architecture. I have intentionally not publicly published implementation details yet, but will send you a DM!

It seems like this concept is related to the "Federating Cognition" section of my article, specifically a point about the safety benefits of externalizing memory: "external memory systems can contain information on human preferences which AI systems can learn from and/or use as a reference or assessment mechanism for evaluating proposed goals and actions." At a high level, this can affect both AI models themselves as well as model evaluations and the cognitive architecture containing models (the latter is mentioned at the end of your post). For various reasons, I haven't written much about the implications of this work to AI models themselves.

I think some of the downsides mentioned here are easily or realistically surpassable. I'll post a couple thoughts.

For example, is it really true that this would require condensing everything into categories? What about numerical scales for instance? Interestingly, in February, I did a very-small-scale proof-of-concept regarding automated emotional labeling (along with other metadata), currently available at this link for a brief time. As you can see, it uses numerical emotion labeling, although I think that's just the tip of the iceberg. What about many-dimensional labeling? I'd be curious to get your take on related work like Eric Drexler's article on QNRs (which is unfortunately similar to my writing in that it may be high-level and hard to interpret) which is one of the few works I can think of regarding interesting safety and performance applications of externalized memories.

With regard to jailbreaking, what if approaches like steering GPT with activation vectors and monitoring internal activations for all model inputs are used?

World-Model Interpretability Is All We Need

Brendon_Wong2y10

One possibility that I find plausible as a path to AGI is if we design something like a Language Model Cognitive Architecture (LMCA) along the lines of AutoGPT, and require that its world model actually be some explicit combination of human natural language, mathematical equations, and executable code that might be fairly interpretable to humans. Then the only potions of its world model that are very hard to inspect are those embedded in the LLM component.

Cool! I am working on something that is fairly similar (with a bunch of additional safety considerations). I don't go too deeply into the architecture in my article, but would be curious what you think!

Safety-First Agents/Architectures Are a Promising Path to Safe AGI

Brendon_Wong2y31

Yep, I agree that there's a significant chance/risk that alternative AI approaches that aren't as safe as LMAs are developed, and are more effective than LMAs when run in a standalone manner. I think that SCAs can still be useful in those scenarios though, definitely from a safety perspective, and less clear from a performance perspective.

For example, those models could still do itemized, sandboxed, and heavily reviewed bits of cognition inside an architecture, even though that's not necessary for them to achieve what the architecture working towards. Also, this is when we start getting into more advanced safety features, like building symbolic/neuro-symbolic white box reasoning systems that are interpretable, for the purpose of either controlling cognition or validating the cognition of black box models (Davidad's proposal involves the latter).

Internal independent review for language model agent alignment

Brendon_Wong2y52

I implied the whole spectrum of "LLM alignment", which I think is better to count as a single "avenue of research" because critiques and feedback in "LMA production time" could as well be applied during pre-training and fine-tuning phases of training (constitutional AI style).

If I'm understanding correctly, is your point here that you view LLM alignment and LMA alignment as the same? If so, this might be a matter of semantics, but I disagree; I feel like the distinction is similar to ensuring that the people that comprise the government is good (the LLMs in an LMA) versus trying to design a good governmental system itself (e.g. dictatorship, democracy, futarchy, separation of powers, etc.). The two areas are certainly related, and a failure in one can mean a failure in another, but the two areas can involve some very separate and non-associated considerations.

It's only reasonable for large AGI labs to ban LMAs completely on top of their APIs (as Connor Leahy suggests)

Could you point me to where Connor Leahy suggests this? Is it in his podcast?

or research their safety themselves (as they already started to do, to a degree, with ARC's evals of GPT-4, for instance)

To my understanding, the closest ARC Evals gets to LMA-related research is by equipping LLMs with tools to do tasks (similar to ChatGPT plugins), as specified here. I think one of the defining features of an LMA is self-delegation, which doesn't appear to be happening here. The closest they might've gotten was a basic prompt chain.

I'm mostly pointing these things out because I agree with Ape in the coat and Seth Herd. I don't think there's any actual LMA-specific work going on in this space (beyond some preliminary efforts, including my own), and I think there should be. I am pretty confident that LMA-specific work could be a very large research area, and many areas within it would not otherwise be covered with LLM-specific work.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments