Ofer

Send me anonymous feedback: https://docs.google.com/forms/d/e/1FAIpQLScLKiFJbQiuRYBhrBbVYUo_c6Xf0f8DN_blbfpJ-2Ml39g1zA/viewform

Any type of feedback is welcome, including arguments that a post/comment I wrote is net negative.

Some quick info about me:

I have a background in computer science (BSc+MSc; my MSc thesis was in NLP and ML, though not in deep learning).

You can also find me on the EA Forum.

Feel free to reach out by sending me a PM. (Update: I've turned off email notifications for private messages. If you send me a time sensitive PM, consider also pinging me about it via the anonymous feedback link above.)

Posts

Sorted by New

4ofer's Shortform

15Book review: Architects of Intelligence by Martin Ford (2018)

12The recent NeurIPS call for papers requires authors to include a statement about the potential broader impact of their work

4ofer's Shortform

11A probabilistic off-switch that the agent is indifferent to

17Looking for AI Safety Experts to Provide High Level Guidance for RAISE

5A Safer Oracle Setup?

Wiki Contributions

Comments

On Caring about our AI Progeny

Ofer1y40

I think the important factors w.r.t. risks re [morally relevant disvalue that occurs during inference in ML models] are probably more like:

The training algorithm. Unsupervised learning seems less risky than model-free RL (e.g. the RLHF approach currently used by OpenAI maybe?); the latter seems much more similar, in a relevant sense, to the natural evolution process that created us.
The architecture of the model.

Being polite to GPT-n is probably not directly helpful (though it can be helpful by causing humans to care more about this topic). A user can be super polite to a text generating model, and the model (yielded by model-free RL) can still experience disvalue, particularly during an 'impossible inference', one in which the input text (the "environment") is bad in the sense that there is obviously no way to complete the text in a "good" way.

See also: this paper by Brian Tomasik.

More information about the dangerous capability evaluations we did with GPT-4 and Claude.

Ofer1y145

My question was about whether ARC gets to evaluate [the most advanced model that the AI company created so far] before the company creates a slightly more advanced model (by scaling up the architecture, or by continuing the training process of the evaluated model).

More information about the dangerous capability evaluations we did with GPT-4 and Claude.

Ofer1yΩ3112

Did OpenAI/Anthropic allow you to evaluate smaller scale versions* of GPT4/Claude before training the full-scale model?

* [EDIT: and full-scale models in earlier stages of the training process]

Speed running everyone through the bad alignment bingo. $5k bounty for a LW conversational agent

Ofer1y4-4

Will this actually make things worse? No, you're overthinking this.

This does not seem like a reasonable attitude (both in general, and in this case specifically).

Acausal normalcy

Ofer1yΩ15-3

Having thought a bunch about acausal trade — and proven some theorems relevant to its feasibility — I believe there do not exist powerful information hazards about it that stand up to clear and circumspect reasoning about the topic.

Have you discussed this point with other relevant researchers before deciding to publish this post? Is there a wide agreement among relevant researchers that a public, unrestricted discussion about this topic is net-positive? Have you considered the unilateralist's curse and biases that you may have (in terms of you gaining status/prestige from publishing this)?

Predictable Outcome Payments

Ofer1y20

Re impact markets: there's a problem regarding potentially incentivizing people to do risky, net-negative things (that can end up being beneficial). I co-authored this post about the topic.

Categorizing failures as “outer” or “inner” misalignment is often confused

Ofer1yΩ120

(Though even in that case it's not necessarily a generalization problem. Suppose every single "test" input happens to be identical to one that appeared in "training", and the feedback is always good.)

Categorizing failures as “outer” or “inner” misalignment is often confused

Ofer1yΩ120

Generalization-based. This categorization is based on the common distinction in machine learning between failures on the training distribution, and out of distribution failures. Specifically, we use the following process to categorize misalignment failures:

Was the feedback provided on the actual training data bad? If so, this is an instance of outer misalignment.

Did the learned program generalize poorly, leading to bad behavior, even though the feedback on the training data is good? If so, this is an instance of inner misalignment.

This categorization is non-exhaustive. Suppose we create a superintelligence via a training process with good feedback signal and no distribution shift. Should we expect that no existential catastrophe will occur during this training process?

Collapse Might Not Be Desirable

Ofer1y20

Relevant & important: The unilateralist's curse.

Impact Shares For Speculative Projects

Ofer2y00

I'm interested in hearing what you think the counterfactuals to impact shares/retroactive funding in general are, and why they are better.

The alternative to launching an impact market is to not launch an impact market. Consider the set of interventions that get funded if and only if an impact market it launched. Those are interventions that no classical EA funder decides to fund in a world without impact markets, so they seem unusually likely to be net-negative. Should we move EA funding towards those interventions, just because there's a chance that they'll end up being extremely beneficial? (Which is the expected result of launching a naive impact market.)