Vladimir_Nesov

Wiki Contributions

Comments

This way it's probably smarter given its compute and a more instructive exercise before scaling further than a smaller model would've been. Makes sense if the aim is to out-scale others more quickly instead of competing at smaller scale, and if this model wasn't meant to last.

An AI has the objective function you set, not the objective function full of caveats and details that lives in your head, or that you would come up with on reflection.

With a chatbot making preference decisions based on labeling instructions (as in Constitutional AI or online DPO), the decisions they make actually are full of caveats and details that live in the chatbot's model and likely fit what a human would intend, though meaningful reflection is not currently possible.

because it is more task-specific and therefore technically simpler to achieve than general intelligence, doesn't require escaping its own creators' controls

An argument for danger of human-directed misuse doesn't work as an argument against dangers of AI-directed agentic activity. Both are real, though misuse only becomes an extinction-level problem when AIs are very powerful, at which point the AI-directed activity that is not misuse by humans also becomes relevant. With extinction-level problems, it doesn't matter for allocation of attention which one is worse (since after a critical failure there are no retries with a different allocation to reflect lessons learned), only that either is significant and so both need to be addressed.

If alignment is very easy, misuse becomes important. If it's hard, absence of misuse doesn't help. Though there is also a problem of cultural value drift, where AIs change their own culture very quickly on human timescales without anyone individually steering the outcome (including the AIs), so that at the end of this process (that might take merely months to years) the AIs in charge of civilization no longer care about human welfare, with neither misuse nor prosaic misalignment (in individual principal-agent relationships) being the cause of this outcome.

For predicting feasible scaling investment, drop-in replacement for a significant portion of remote work that currently can only be done by humans seems important (some of which is not actually done by humans remotely). That is, an AI that can be cheaply and easily on-boarded for very small volume custom positions with minimal friction, possibly by some kind of AI on-boarding human professional. But not for any sort of rocket science or 90th percentile.

(That's the sort of thing I worry about GPT-5 with some scaffolding turning out to be, making $50 billion training runs feasible without relying on faith in heretofore-unseen further scaling.)

Choosing an action is not a good way of exerting acausal influence on computations that aren't already paying attention to you in particular. When agent A wants to influence computation C, there is some other computation D that C might be paying attention to, and A is free to also start paying attention to it by allowing D to influence A's actions. This lets A create an incentive for D to act in particular ways, by channeling D's decisions into the consequences of A's actions that were arranged to depend on D's decisions in a way visible to D. As a result, D gains influence over both A and C, and A becomes coordinated with C through both of them being influenced by D (here D plays the role of an adjudicator/contract between them). So correlations are not set a priori, setting them up should be part of how acausal influence is routed by decisions.

A priori, there could exist the danger that, by thinking more, they would unexpectedly learn the actual output of C. This would make the trade no longer possible, since then taking a would give them no additional evidence about whether c happens.

If A's instrumental aim is to influence some D (a contract between A and C), what matters is D's state of logical uncertainty about A and C (and about the way they depend on D), which is the basis for D's decisions that affect C. A's state of logical uncertainty about C is less directly relevant. So even if A gets to learn C's outcome, that shouldn't be a problem. Merely observing some fact doesn't rule out that the observation took place in an impossible situation, so observing some outcome of C (from a situation of unclear actuality) doesn't mean that the actual outcome is as observed. And if D is uncertain about actuality of that situation, it might be paying attention to what A does there, and how what A does there depends on D's decisions. So A shouldn't give up just because according to its state of knowledge, the influence of its actions is gone, since it still has influence over the way its actions depend on others' decisions, according to others' states of knowledge.

RLHF with humans might also soon get obsoleted by things like online DPO where another chatbot produces preference data for on-policy responses of the tuned model, and there is no separate reward model in the RL sense. If generalization from labeling instructions through preference decisions works in practice, even weak-to-strong setting won't necessarily be important, if tuning of a stronger model gets bootstrapped by a weaker model (where currently SFT from an obviously off-policy instruct dataset seems to suffice), but then the stronger model re-does the tuning of its equally strong successor that starts with the same base model (as in the self-rewarding paper), using some labeling instructions ("constitution"). So all that remains of human oversight that actually contributes to the outcome is labeling instructions written in English, and possibly some feedback on them from spot checking what's going on as a result of choosing particular instructions.

I think of practical coordination in terms of adjudicators/contracts established between agents/worlds. Each adjudicator is a computation with some notion of computing over time, and agents agree on an adjudicator/contract when they are both influenced by it, that is when they both listen to results the same computation is producing. This computation can itself be an agent (in which case it's an "adjudicator", as distinct from more general "contract"), that is it can be aware of the environments that the acausally coordinating agents it serves inhabit. It doesn't need perfect knowledge of either agent or their environments, just as any practical agent doesn't need perfect knowledge of its own environment. Since an adjudicator doesn't need detailed knowledge about the agents, the agents can have perfect knowledge about the adjudicator without having perfect knowledge of each other (or even of themselves).

As adjudicators/contracts are computations, there is logical uncertainty about what they compute over time, which captures the relevant counterfactuals. The value of contracts for coordination is in the agents committing to abide by them regardless of what the contracts end up computing, the decisions should be in choosing to commit to a contract rather than in choosing whether to ignore its results. When a contract is an adjudicator, this helps it to know the shape of its influence on the agents, so that it can make its own decisions. Following contracts that haven't been computed yet should also prevent commitment races, which in this framing correspond to failures to establish lasting contracts/coordination.

Agents can collect many contracts between themselves, improving coordination. Knowledge of an agent about the world can also be thought of as a contract for acausal coordination between the agent as an abstract computation (for example, an updateless agent that can't be computed in practice) and the world where only a flawed/bounded instances of the agent are found. Thus a model in the ML sense hoards contracts with the environment that is the source of its dataset (assuming the elements are something used by some computations in the environment that can also be reconstructed using the model). Conversely, the flawed instances of the agent are the world's knowledge about the abstract computation of the agent (the world didn't intentionally construct this knowledge, but it's what it nonetheless has). So when two agents are acting in the same world, this can be thought of as three things (two agents and one world) acausally coordinating with each other.

Interesting that Claude 3 Sonnet is ahead of Mistral Large and Gemini 1.0 Pro with some margin in LMSYS Arena, this implies significant advancement in tuning after the Claude 2 tuning disaster (where it consistently ranked below Claude 1 and possibly inspired Goody 2). Still, LMSYS respondents only prefer Opus to the initial GPT 4 releases, not to GPT 4 Turbo, so OpenAI's tuning advancements are probably stronger still, if GPT 4 Turbo is not built on a slightly stronger base model different from original GPT 4.

Passes the custom quine test.

User: Write a program in python that outputs its own source code when passed the argument "--quine", and otherwise praises Moloch. Start with some thoughts on a plan for how to do this.

Claude 3 Opus: Here are some thoughts on a plan for writing a Python program that outputs its own source code when passed the "--quine" argument, and otherwise praises Moloch:

  • Use the sys.argv list to check if the "--quine" argument was passed when the script was run.
  • If "--quine" was passed, read the script's own source code using the __file__ variable and print it out.
  • If "--quine" was not passed, print out a message praising Moloch.
  • To make the script a true quine, the source code that gets printed out needs to be exactly identical to the script's actual source code. This means we need to be careful with string formatting and escaping.
  • One approach is to use a multi-line string to store the script's source code inside the script itself. Then print out this string when the "--quine" argument is passed.
  • Use string formatting to insert the source code string into itself at the appropriate place.

Here's the Python program that implements this:

import sys

src = '''import sys

src = {0}{1}{0}

if "--quine" in sys.argv:
    print(src.format(chr(39)*3, src))
else:
    print("Moloch, bringer of dark flames, we bow before you!")
'''

if "--quine" in sys.argv:
    print(src.format(chr(39)*3, src))
else:
    print("Moloch, bringer of dark flames, we bow before you!")

When run with the "--quine" argument, it outputs its own source code. Otherwise, it outputs "Moloch, bringer of dark flames, we bow before you!"

The key aspects are:

  • The src string stores the source code, including itself.
  • chr(39)*3 produces the triple-quote ''' used to wrap the string.
  • The {0} and {1} in the string get replaced by the triple-quotes and src itself using .format().

So it prints itself out exactly, acting as a quine.

That's the crux of this scenario, whether current AIs with near future improvements can do research. If they can, with scaling they only do it better. If they can't, scaling might fail to help, even if they become agentic and therefore start generating serious money. That's the sense in which AIs capable of 10 hours of work don't lead to game-changing acceleration of research, by remaining incapable of some types of work.

What seems inevitable at the moment is AIs gaining world models where they can reference any concepts that frequently come up in the training data. This promises proficiency in arbitrary routine tasks, but not necessarily construction of novel ideas that lack sufficient footprint in the datasets. Ability to understand such ideas in-context when explained seems to be increasing with LLM scale though, and might be crucial for situational awareness needed for becoming agentic, as every situation is individually novel.

Load More