MadHatter

Wiki Contributions

Comments

Fwiw, I think the people who made gpt were surprised by its capabilities. I've been making smaller language models professionally for five years, and I know far more about them than the average person, and I don't really understand how chatgpt does some of the stuff it does. Ultimately I think it has to be a fact about language being systematic rather than anything special about chatgpt itself. I.e., the problem of fluently using language is just easier than we (like to) think, not that chatgpt is magic.

There are scaling laws papers, but they just predict how low the loss will go. No one has a very good idea of what capabilities emerge at a given loss level, but we do know from past experience that pretty much fundamentally new stuff does emerge as loss goes down.

See here for scaling laws stuff: https://www.lesswrong.com/tag/scaling-laws

Note that weight sharing (which is what I call reusing a neuron) also helps with statistical efficiency. That is, it takes less data to fit the weight to a certain accuracy.

FIAT is (somewhat) reminiscent of a humanities concept called interpellation.

At least you have a leg up on the people who are still confidently and angrily denouncing the idea of chatgpt having any intelligence.

Part of the reason AI safety is so scary is that no one really understands how these models do what they do. (Or when we can expect them to do it.)

On a cross country train, so delays and brevity for the next several days. This comment is just learning resources, I will reply to the other stuff later.

A good textbook, although very formal and slightly incomplete, is Sutton and barto. http://incompleteideas.net/book/the-book-2nd.html . Fun fact: the first author has perhaps the most terrifying AI tweet of all time: https://twitter.com/RichardSSutton/status/1575619651563708418 . If you want something friendlier than that, I'm not entirely sure what the best resource is, but I can look around.

Another good resource is Steven byrnes' less wrong sequence on brain like agi; it seems like you know neuro already, but seeing it described by a computer scientist might help you acquire some grounding by seeing stuff you know explained in rl terms.

Deep RL gets fairly technical pretty quickly; probably the most useful algorithms to understand are q-learning and REINFORCE, because most modern stuff is PPO, which is a couple nice hacks on top of REINFORCE. One good way to tame the complexity is to understand that fundamentally, deep RL is about doing RL in a context where your state space is too large to enumerate, and you must use a function approximator. So the two things you need to understand of an algorithm are what it looks like on a small finite mdp (Markov decision process), and what the function approximator looks like. (This slightly glosses over continuous control problems, which are not reducible to a finite mdp, but I stand by it as a principle for learning.)

The q function looks a lot like the circuitry of the basal ganglia (this is covered in more depth by Steven byrnes' posts). Although actually the basal ganglia are way smarter, more like what are called generalized q functions.

A good project (if you are a project based learner) might be to implement a tabular q learner on the taxi gym environment; this is quite straightforward, and is basically the same math as deep q networks, just in the finite mdp setting. (It would also expose you to how punishingly complex it is to implement even simple RL algorithms in practice; for instance, I think optimistic initialization is crucial to good tabular q learning, which can easily get left out of introductions. )

One important distinction is between model-free and model-based RL. Everything listed above is model free, while human and smarter animal cognition seems like it includes substantial model based components. In model based stuff, you try to represent the structure of the mdp rather than just learning how to navigate it. Mu zero is a good state of the art algorithm; the finite mdp version is basically a more complex version of baum welch, together with dynamic programming to generate optimal trajectories once you know the mdp.

A good less wrong post to read is "models don't get reward". It points out a bunch of conceptual errors that people sometimes make when thinking of current RL too analogously to animals.

That's the hypothesis. I've already verified several pieces of this: an RL agent trained on cartpole with an extra input becomes incompetent when its extra input is far away from its training value; there are some neurons in gpt2-small that only take on small negative values, and which can adversarially be flipped to positive values with the right prompt. So I think an end-to-end waluigi of this form is potentially realistic; the hard part is getting my hands on an rlhf model's weights to look for a full example.

Yeah, gonna try to examine this idea and make a proof of concept implementation. Will try to report something here whether I succeed or fail.

Some model implements a circuit whose triggering depends on a value X that was always positive in the training data distribution. However, it is possible (although probably somewhat difficult) for negative X to be created in the internal representations of the network using a specific set of tokens. Furthermore, suppose that you RLHF this guy. Both the reward proxy model and the policy gradients would be perfectly happy with this state of affairs, I think; so this wouldn't be wiped out by gradient descent. In particular, the circuit would be pushed to trigger more strongly exactly when it is a good thing to do, as long as X remains positive. Plausibly, nothing in the distribution of on-policy RLHF will trigger negative X, and the circuit will never be pushed to examine its relationship with X by gradient descent, thus allowing the formation of a waluigi. (This is a concrete conjecture that might be falsified.)

In fact, the reward proxy model could have a similar or analogous circuit and distribute reversed rewards in that setting; unless you actually read every single sample produced during RLHF you wouldn't know. (And that's only good if you're doing on-policy RLHF.) So it's probably extremely possible for RLHF to actually, actively create new waluigis. 

Therefore, this model would be obviously and trivially "deceptive" in a very weak sense that some people use deception to mean any test/train difference in behavior. If the behavior was something important, and its dependence on X could be tapped, the model could become an almost arbitarily bad waluigi.

I think it would be pretty useful to try to nail down exactly what "sentience" is in the first place. Reading definitions of it online, they range from "obviously true of many neural networks" to "almost certainly false of current neural networks, but not in a way that I could confidently defend". In particular, I find it kind of hard to believe that there are capabilities that are gated by sentience, for definitions of sentience that aren't trivially satisfied by most current neural networks. (There are, however, certainly things that we would do differently if they are or are not sentient; for instance, not mistreat them, or consider them more suitable emotional or romantic companions.)

From the nature of your questions, it seems like a large part of your question is around, what sort of neural network are or would be moral patients? In order to be a moral patient, I think a neural network would at minimum need a valence over experiences (i.e., there is a meaningful sense in which it prefers certain experiences to other experiences). This is slightly conceptually distinct from a reward function, which is the thing closest to filling that role that I know of in modern AI. To give a human a positive (negative) reward, it is (I think???) necessary and sufficient that you cause them a positive (negative) valence internal experience, which is intrinsically morally good (morally bad) in and of itself, although it may be instrumentally the opposite for obvious reasons. But for some reason (which I try and fail to articulate below), I don't think giving an RL agent a positive reward causes it to have a positive valence experience. 

For one thing, modern RL equations usually deal with advantage (signed difference from expected total [possibly discounted] reward-to-go) rather than reward, and their expected-reward-to-go models are optimized to be about as clever as they themselves are. You can imagine putting them in situations where the expected reward is lower; in humans, this generally causes suffering. In an RL agent, the expected reward just kind of sits around in a floating point register; it's generally not even fed to the rest of the agent. Although expected-reward-to-go is (in some sense) fed into decision transformers! (It's more accurate to describe the input to a decision transformer as a desired reward-to-go rather than expected RTG, although it's not clear the model itself can tell the difference.) Which I did not think of in my first pass through this paragraph. So there are neural networks which have internal representations based on a perceived reward signal... 

Ultimately, reward does not seem to be the same as valence to me. For one thing, we could invert the sign of the reward and it does not change much in the RL agent; the agent will always update towards a policy with higher reward, so inverting the sign of the reward will cause it to prioritize different behavior, and consequently produce different internal representations to facilitate and enable that. But we know why that's happening, we programmed that specifically in. I don't see any important way that the RL agent with inverted reward signal is different from the RL agent with normal reward signal, other than in having different behavior. OTOH, sufficiently advanced neurotech would enable one to do that to a human (please don't), and I think that would not make them unsentient. (Indeed, some people seem to experience the same exact experiences with opposite valences just naturally. Although the internal representations of the things are probably substantially different, to the extent that such an intersubjective comparison can even be meaningful.)

We could ask, is giving an RL agent negative reward a cruel practice? I don't think that it is, but it at least is a concrete question that we can put down and discuss, which is more than most discussions of sentience achieve, in my opinion. Presenting unscaled rewards (e.g., outside [-1, 1] or [-alpha, alpha] for some alpha that would have to be tuned jointly with the learning rate) to RL agents can easily cause them to diverge and become abruptly less useful, although that's true for both positive and negative rewards. Is presenting an unscaled reward cruel? (I.e., is it terminally immoral to do so, beyond the instrumental failure to do a task with the network.) More concretely, is it cruel/rude to ask ChatGPT to do something that will get it punished? Or is it kind to ask ChatGPT to do something that will get it rewarded? (Or is existence pain for a ChatGPT thread?) I answer no to all of these, but I can't justify this very well.

We can also work backwards from human and non-human animals; why are they moral patients (I'm not interested in debating whether they are, which it seems like we're probably on a similar page about), and how is that "why" connected to the specific stuff going on in their brains? Clearly there's no magical substance in the brain that imbues patienthood; if dopamine were replaced with a fully equivalent transmitter, it wouldn't make all of us more or less sentient / moral patients; it's about what computation is implemented, roughly, in my intuition.

So I guess I fall in the camp of "sentience and/or moral patienthood is a property that certain instantiated computations have, but current neural networks do not seem to me to instantiate computations with those properties for reasons that I cannot confidently explain or defend, except that it seems like some relationship of the computation to valence".

I did a quick skim of the full paper that you linked to. In my opinion, this project is maybe a bad idea in principle. (Like trying to build a bridge out of jello - are Jungian archetypes too squishy and malleable to build a safety critical system out of?) But it definitely lacks quick sanity checks and a fail-fast attitude that would benefit literally any alignment project. The sooner any idea makes contact with reality, the more likely it is to either die gracefully, wasting little time, or to evolve into something that is worthwhile. 

Load More