Tor Økland Barstad

Wiki Contributions


Half-baked AI Safety ideas thread

Perhaps experiments could make use of encryption in some way that prevented AGIs from doing/verifying work themselves, making it so that they would need to align the other AGI/AGIs. Encryption keys that only one AGI has could be necessary for doing and/or verifying work.

Could maybe set things up in such a way that one AGI knows it can get more reward if it tricks the other into approving faulty output.

Would need to avoid suffering sub-routines.

Is CIRL a promising agenda?

This doesn't seem like it holds up because a CIRL agent would probably eventually stop treating you as a source of new information once it had learned a lot from you, at which point it would stop being deferential.

But maybe continuing to be deferential (in many/most situations) would be part of the utility function it converged towards? Not saying this consideration refutes your point, but it is a consideration.

(I don't have much of an opinion regarding the study-worthiness of CIRL btw, and I know very little about CIRL. Though I do have the perspective that one alignment-methodology need not necessarily be the "enemy" of another, partly because we might want AGI-systems where sub-systems also are AGIs (and based on different alignment-methodologies), and where we see whether outputs from different sub-systems converge.)

Where I agree and disagree with Eliezer

How useful AI-systems can be at this sort of thing after becoming catastrophically dangerous is also worth discussing more than is done at present. At least I think so. Between Eliezer and me I think maybe that's the biggest crux (my intuitions about FOOM are Eliezer-like I think, although AFAIK I'm more unsure/agnostic regarding that than he is).

Obviously a more favorable situation if AGI-system is aligned before it could destroy the world. But even if we think we succeeded with alignment prior to superintelligence (and possible FOOM), we should look for ways it can help with alignment afterwards, so as to provide additional security/alignment-assurance.

As Paul points out, verification will often be a lot easier than generation, and I think techniques that leverage this (also with superintelligent systems that may not be aligned) is underdiscussed. And how easy/hard if would be for an AGI-system to trick us (into thinking it's being helpful when it really wasn't) would depend a lot on how we went about things.

Various potential ways of getting help for alignment while keeping "channels of causality" quite limited and verifying the work/output of the AI-system in powerful ways.

I've started on a series about this:

Is CIRL a promising agenda?

Do you have available URLs to comments/posts where you have done so in the past?

Half-baked AI Safety ideas thread

Similar to this, but not the same: Experiment with AGI where it is set to align other AGI. For example, maybe it needs to do some tasks to do reward, but those tasks need to be done by the other AGI, and it don't know what the tasks will be beforehand. One goal being to see methods AGI might use to align other AGI (that may then be used to align AGI-systems that are sub-systems of AGI-system, and seeing if output from this AGI converges with results from AGIs aligned by other principles).

Don't expect that this would be that fruitful, but haven't thought about it that much and who knows.

Would need to avoid suffering sub-routines.

Security Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment

I think there are various very powerful methods that can be used to make it hard for AGI-system to not provide what we want in process of creating aligned AGI-system. But I don't disagree in regards to what you say about it being "extremely dangerous". I think one argument in favor of the kinds of strategies I have in mind is that they may help give an extra layer of security/alignment-assurance, even if we think we have succeeded with alignment beforehand.

Half-baked AI Safety ideas thread

Trying to think of ways that experiments can set up where AI-systems use other AI-systems as oracles/genies (1 or more being superintelligent), perhaps sometimes in "adversarial" ways. Exploring methods for asking requests and maybe finding out things about how hard/easy it is to "trick" operator (seeming to provide what they want without providing what they want) given various domains/methods.

May e.g. involve one AI asking for code-pieces for some purpose, but where other AI is to try to "hide" ways in which delivered code isn't quite what the other AI wants.

Superintelligence may realize what's going on (or may be going on), and act accordingly. But nontheless maaybe some useful info could be gained? 🤔

Half-baked AI Safety ideas thread

Getting AGI to output "physics-simulator" where both starting and ending conditions are specified to be within some ranges of possible state-configurations, instead of simulating one specific example and providing a "point-like" value.

Have possible machines (proposed by AI-system) be tested by such "physics-simulators".

"Simulator" might highlight predictable aspects of machine behavior vs less predictable things. That is to say, the behavior of machine that is "low-entropy" (I'm probably using the term wrong).

Point would be to have a probabilistic safety-measure that makes it harder for AI to propose some machine design (or machine-creation-instructions) that results in unexpected behavior.

The "simulator" should be tested against real-world data that the AI doesn't have access to (as well as using various other measures to verify and look for problems). 

Half-baked AI Safety ideas thread

Specify proof-format that has ambiguity (of cluster-like concepts, etc) be a part of the formalism, and has mappings between concepts and real world be part of formalism, and can reference output from functions as part of formalism.

Of course how much trust that is put in proof/argument-tree would depend on various things (allowing vague concepts makes it less trustable).

For cluster-like concepts referenced by proofs, a precise specification of criteria for exclusion/inclusion should not be expected, but the more the better. Inference rules and examples can specify the degree to which specific instances would fall within a specific concept or not (also allowed to say that some instances neither fall inside or outside of it).

One of the points would be to make as much as possible be within the realm of things where an AGI could be expected to output proofs that are easier to verify compared to other output.

My thinking is that this would be most helpful when combined with other techniques/design-principles. Like, outputting the proof (very formal argument with computable inference-steps) is one thing, but another thing is which techniques/processes that are used to look for problems with it (as well as looking for problem with formalism as a whole, as well as testing/predicting how hard/easy humans can be convinced of things that are false or contradictory given various conditions/specifics).

Bonus if these formal proofs/statements can be presented in ways where humans easily can read them.

Half-baked AI Safety ideas thread

Having a "council" of AGIs that are "siloed".

The first AGI can be used in the creation of code for AGIs that are aligned based on various different principles. Could be in one "swoop", but with the expectation that code and architecture is optimized for clarity/modularity/verifiability/etc. But could also be by asking the AGI to do lots of different things. Or both (and then we can see whether output from the different systems is the same).

Naturally, the all these things should not be asked of the same instance of the AGI (although that could be done as well, to check if output converges).

In the end we have a "council" of AGIs. Some maybe predicting output of smart humans humans working for a long time. Some maybe using high welfare brain emulations. Some maybe constructing proofs where ambiguity of cluster-like concepts is accounted for within formalism, and mapping between concepts and outside world is accounted for within formalism - with as much of "core" thinking as possible being one of the same as these proofs. Some maybe based on machine learning by debate ideas. The more concepts that seem likely to work (without having suffering sub-routines) the better.

This "council" of "siloed" AGIs can then be used as oracle/genie, and we can see if output converges. And they also search for mistakes in output (answers, proofs/verification, argument-trees, etc) from other AGIs.

Load More