Send me anonymous feedback: https://docs.google.com/forms/d/e/1FAIpQLScLKiFJbQiuRYBhrBbVYUo_c6Xf0f8DN_blbfpJ-2Ml39g1zA/viewform

Any type of feedback is welcome, including arguments that a post/comment I wrote is net negative.

Some quick info about me:

I have a background in computer science (BSc+MSc; my MSc thesis was in NLP and ML, though not in deep learning).

You can also find me on the EA Forum.

Feel free to reach out by sending me a PM here or on my website.


ofer's Shortform

I can't think of any specific source to check recurrently, but you can recurrently google [covid-19 long term effects] and look for new info from sources you trust.

ofer's Shortform

[COVID-19 related]

(Probably already obvious to most LW readers.)

There seems to be a lot of uncertainty about the chances of COVID-19 causing long-term effects (including for young healthy people who experience only mild symptoms). Make sure to take this into account when deciding how much effort you're willing to put into not getting infected.

Using GPT-N to Solve Interpretability of Neural Networks: A Research Agenda

When I said "we need GPT-N to learn a distribution over strings..." I was referring to the implicit distribution that the model learns during training. We need that distribution to assign more probability to the string [a modular NN specification followed by a prompt followed by a natural language description of the modules] than to [a modular NN specification followed by a prompt followed by an arbitrary string]. My concern is that maybe there is no prompt that will make this requirement fulfill.

Re "curating enough examples", this assumes humans are already able* to describe the modules of a sufficiently powerful language model (powerful enough to yield such descriptions).

*Able in practice, not just in theory.

Using GPT-N to Solve Interpretability of Neural Networks: A Research Agenda

I just to want to flag that, like Evan, I don't understand the usage of the term "microscope AI" in the OP. My understanding is that the term (as described here) describes a certain way to use a NN that implements a world model, namely, looking inside the NN and learning useful things about the world. It's an idea about how to use transparency, not how to achieve transparency.

Using GPT-N to Solve Interpretability of Neural Networks: A Research Agenda

Both the general idea of trying to train competitive NNs with modular architectures and the idea of trying to use language models to get descriptions of NNs (or parts thereof) seem extremely interesting! I hope a lot of work will be done on these research directions.

We’re assuming humans can interpret small NN’s, given enough time. A “Modular” NN is just a collection of small NN’s connected by sparse weights. If humans could interpret each module in theory, then GPT-N could too.

I'm not sure about that. Notice that we need GPT-N to learn a distribution over strings that assigns more probability to [a modular NN specification followed by a natural language description of its models modules] than [a modular NN specification followed by an arbitrary string]. Learning such a distribution may be unlikely if the training corpus doesn't contain anything as challenging-to-produce as the former (regardless of what humans can do in theory).

ofer's Shortform

It seems that the research team at Microsoft that trained Turing-NLG (the largest non-sparse language model other than GPT-3, I think) never published a paper on it. They just published a short blog post, on February. Is this normal? The researchers have an obvious incentive to publish such a paper, which would probably be cited a lot.

[EDIT: hmm maybe it's just that they've submitted a paper to NeurIPS 2020.]

[EDIT 2: NeurIPS permits putting the submission on arXiv beforehand, so why haven't they?]

Vanessa Kosoy's Shortform

convergence might literally never occur if the machine just doesn’t have the computational resources to contain such an upload

I think that in embedded settings (with a bounded version of Solomonoff induction) convergence may never occur, even in the limit as the amount of compute that is used for executing the agent goes to infinity. Suppose the observation history contains sensory data that reveals the probability distribution that the agent had, in the last time step, for the next number it's going to see in the target sequence. Now consider the program that says: "if the last number was predicted by the agent to be 0 with probability larger than then the next number is 1; otherwise it is 0." Since it takes much less than bits to write that program, the agent will never predict two times in a row that the next number is 0 with probability larger than (after observing only 0s so far).

[This comment is no longer endorsed by its author]Reply
[AN #108]: Why we should scrutinize arguments for AI risk

Thank you for clarifying!

Like, why didn't the earlier less intelligent versions of the system fail in some non-catastrophic way

Even if we assume there will be no algorithmic-related-discontinuity, I think the following are potential reasons:

  1. Detecting deceptive behaviors in complicated environments may be hard. To continue with the Facebook example, suppose that at some point in the future Facebook's feed-creation-agent would behave deceptively in some non-catastrophic way. Suppose it uses some unacceptable technique to increase user engagement (e.g. making users depressed), but it refrains from doing so in situations where it predicts that Facebook engineers would notice. The agent is not that great at being deceptive though, and a lot of times it ends up using the unacceptable technique when there's actually a high risk of the technique being noticed. Thus, Facebook engineers do notice the unacceptable technique at some point and fix the reward function accordingly (penalizing depressing content or whatever). But how will they detect the deceptive behavior itself? Will they be on the lookout for deceptive behavior and use clever techniques to detect it? (If so, what made Facebook transition into a company that takes AI safety seriously?)

  2. Huge scale-ups without much intermediate testing. Suppose at some point in the future, Facebook decides to scale up the model and training process of their feed-creation-agent by 100x (by assumption, data is not the bottleneck). It seems to me that this new agent may pose an existential risk even conditioned on the previous agent being completely benign. If you think that Facebook is unlikely to do a 100x scale-up in one go, suppose that their leadership comes to believe that the scale-up would cause their revenue to increase in expectation by 10%. That's ~$7B per year, so they are probably willing to spend a lot of money on the scale-up. Also, they may want to complete the scale-up ASAP because they "lose" $134M for every week of delay.

[AN #108]: Why we should scrutinize arguments for AI risk

Rohin's opinion: [...] Overall I agree pretty strongly with Ben. I do think that some of the counterarguments are coming from a different frame than the classic arguments. For example, a lot of the counterarguments involve an attempt to generalize from current ML practice to make claims about future AI systems. However, I usually imagine that the classic arguments are basically ignoring current ML, and instead claiming that if an AI system is superintelligent, then it must be goal-directed and have convergent instrumental subgoals.

I agree that the book Superintelligence does not mention any non-goal-directed approaches to AI alignment (as far as I can recall). But as long as we're in the business-as-usual state, we should expect some well-resourced companies to train competitive goal-directed agents that act in the real world, right? (E.g. Facebook plausibly uses some deep RL approach to create the feed that each user sees). Do you agree that for those systems, the classic arguments about instrumental convergence and the treacherous turn are correct? (If so, I don't understand how come you agree pretty strongly with Ben; he seems to be skeptical that those arguments can be mapped to contemporary ML methods.)

What specific dangers arise when asking GPT-N to write an Alignment Forum post?
Answer by oferJul 28, 20205Ω4

It may be the case that solving inner alignment problems means hitting a narrow target; meaning that if we naively carry out a super-large-scale training process that spits out a huge AGI-level NN, dangerous logic is very likely to arise somewhere in the NN at some point during training. Since this concern doesn't point at any specific-type-of-dangerous-logic I guess it's not what you're after in this post; but I wouldn't classify it as part of the threat model that "we don't know what we don't know".

Having said all that, here's an attempt at describing a specific scenario as requested:

Suppose we finally train our AGI-level GPT-N and we think that the distribution it learned is "the human writing distribution", HWD for short. HWD is a distribution that roughly corresponds to our credences when answering questions like "which of these two strings is more likely to have appeared on the internet prior to 2020-07-28?". But unbeknown to us, the inductive bias of our training process made GPT-N learn the distribution HWD*, which is just like HWD except that some fraction of [the strings with a prefix that looks like "a prompt by humans-trying-to-automate-AI-safety"] are manipulative and make AI safety researchers, upon reading, invoke an AGI with a goal system X. Turns out that the inductive bias of our training process caused GPT-N to model agents-with-goal-system-X and such agents tend to sample lots of strings from the HWD* distribution in order to "steal" the cosmic endowment of reckless civilizations like ours. This would be a manifestation of is the same type of failure mode as the universal prior problem.

Load More