All these problems could be interpreted as alignment or intelligence problems. In many cases, the actors involved do not care enough about the outcome. Or when they do care, they are not intelligence enough to connect their actions to their incentives.
The above two papers suggest grokking is a consequence of moderately bad training setups. I.e., training setups that are bad enough that the model starts out by just memorizing the data, but which also contain some sort of weak regularization that eventually corrects this initial mistake.
Sorry if this is a silly question, but from an ML-engineer perspective. Can I expect to achieve better performance by seeking grokking (large model, large regularisation, large training time) vs improving the training setup.
And if the training setup is already good, I shouldn't expect grokking to be possible?
is Australia, and the population there came from boat builders.
Most sources say they came to Australia via land bridge. You may be thinking of Polynesians, which are another group.
This conversation might be better if we taboo Hitler and recent politics. On the askhistorians subreddit they have a 50 years rule, and here we say that politics is the mind killer.
In any case, it seems to me that this approach extrapolates current trends, but I suggest that it might be more reliable to look at history for priors. Extrapolation can lead us to predict wild swings, while history puts bounds on the swings and sometimes suggests a return to the mean.
There certainly have been a lot of dictatorships in history and not all of them fascist. But th...
I don’t find train-test distinctions particularly essential here because our method is unsupervised
If I recall correctly, most unsupervised learning papers do have a test set. Perhaps the fact that the train and test are different kind of shows why you need a test set in the first place.
Like every week I’d have these calls with Ilya Sutskever at OpenAI and I’d tell him about my progress on watermarking, and he would say, “Well, that’s great, Scott, and you should keep working on that. But what we really want to know is how do you formalize what it means for the AI to love humanity? And what’s the complexity theoretic definition of goodness?” And I’m like, “Yeah Ilya, I’m going to keep thinking about that. Those are really tough questions, but I don’t have a lot of progress to report there.”
That was suprising to me. Sounds like OpenAI care about alignment enough to headhunt Scott and have the CTO refocus on it weekly
Thanks Gwern. Exactly the kind of response I was hoping for when I posted here.
Those are good points, and I agree it's super complex. If I understand you correctly you're saying that it will not be trained to complete censored topics, and it will not even learn the primitives to understand the censored topic. Which could be bad when we try to instruct it to do anything about the censored topic.
Any filter will be crude and have unintended consequences. And yet, we still need to make a choice. Taking no action is also a choice that will have consequences.
Rig...
If I understand you correctly you're saying that it will not be trained to complete censored topics, and it will not even learn the primitives to understand the censored topic. Which could be bad when we try to instruct it to do anything about the censored topic.
Not necessarily. There are many ways the optimizing may go. eg It may just learn to lie - there is a great deal of interest in 'AI bias' research in doing things like make a LLM not have 'heteronormativity' bias and consider gay marriage just as common and likely as regular marriage; this is unt...
Yeah, there are a ton of near term capabilities that are one paper away. The world ones IMO are the ones that add RL, or use LLM's in RL. Since that would increase it's agent-ness, and lead to RL like misalignment. And RL misalignment seems much worse than LLM misalignment at the present time.
Thanks for laying this out!
Can I ask a personal question? If you were involved in the testing, was it alarming or boring? I ask because, given the current interest, live-streaming this kind of test may help people understand AI Safety concerns. I'd watch it.
Another question! You mention unsafe actions. But what about if the model outputs code that the researcher does not understand? Is it run on an offline or airgapped computer? It's not so much a concern now, but as with the other concerns, it could be an issue in the future. E.g. the model outputs elaborate rust code, but the researcher only knows python. It looks innocent, so they run it anyway and FOOM.
Just in case it's not, obvious. I think, people are reacting to the lack of caution and paranoia described in the testing document.
The subtext is that if anyone is going to take this seriously, it should be the people involved in ARC, since it's so closely connected to lesswrong and EA. It's the ingroup! It's us! In other words: there are higher expectations on ARC than there are on Microsoft, this is because we should care the most. We've read the most science fiction, and spend decades of our lives arguing about it, after all.
Yet it doesn't sound like te...
5 years later, I wonder if this made it into common crawl or similar.
In hindsight we can see a few ways to get included in LLM training corpus:
src: https://s10251.pcdn.co/pdf/2022-Alan-D-Thompson-Whats-in-my-AI-Rev-0.pdf
There may also be some architecture advances, although I'm unsure why we didn't see these recent LLM's. In Sam Altman's AC10 meetup Q&A he did say that GPT-4 would use a different loss function, what effect would that have? I have no idea.
You can see some examples in this Jan 2023 overview of transformer advances by Lilian Weng and The Transformer Family v2
In terms of motivating org's, maybe this would work better as an open letter. This format provides social pressure by focusing on how many researchers have signed it, and positive reinforcement by calling out good behavior, and minor negative reinforcement my showing organizations that we hope will join yet.
That's how they do it in other fields, although I'm not sure if it actually works in other fields, or if it's just effective signaling. Still it would be worth a try.
To make it easier we should also kudos to org Y if X of their researchers have given their own plans. That's because having researcher give their own plan is a lot easier than getting official sanction, but it's also a useful stepping stone.
As am I. So many organization's have a whistleblower policy or a safety culture. I'm worked in industry and to put it gently, how these cultures work in practice can be quite a bit different that the stated intention.
It's because from a management perspective letting anyone ask questions has to be balanced against getting things done and having a some top down leadership.
Here's a wild guess. They just "stole" a bunch of core people from OpenAI, that doesn't happen to any organization without tension and bad feelings. Now they are in direct competition to OpenAI for funding, staff, and press coverage. Even worse!
Perhaps they made peace and agreed not to make public releases for some time. Or it could be they want to differentiate themselves before they release their strategy.
For what it's worth I was in a similar boat, I've long wanted to work on applied alignment, but also stay in Australia for family reasons. Each time I changed job I've made the same search as you, and ended up just getting a job where I can apply some ML to industry. Just so that I can remain close to the field.
For all the call for alignment researchers, most org's seem hesitant to do the obvious thing which would really expand their talent pool. Which is open up to remote work.
Obviously they struggle to manage and communicate remotely, which prevents them...
This is great and significantly changed my mind about how good the edits are and the quality of causal associations in the current LLM's.
While this is the first comment on the LW post it has also been shared on twitter a bit.
This has some similarities to Stoic review. That means you would probably also like Stoic review if you ever wanted some self-improvement toward happiness and emotional management. https://old.reddit.com/r/Stoicism/comments/adwllh/the_stoic_evening_routine_by_seneca_i_make_use_of/
Great post. I'm going to zoom in on one thing to be argumentative ;p
You say that transparency doesn't have externalities. In that it doesn't help researcher make more capabilities models. I wonder why you are so confident?
I'm assuming that because you haven't seen it in papers and haven't used it yourself you assume that it's not commonly used. But others might use it as a debugging or exploration tool. After all do papers really list their debugging and exploration tools? Not usually.
Do you know why they lost interest? Assuming their funding decision were well thought out, it might be interesting.
are there any alignment approaches that we could try out on GPT-3 in simplified form?
For a start you could see how it predicts or extrapolates moral reasoning. The datasets I've seen for that are "moral machines” and 'am I the arsehole' on reddit.
EDIT Something like this was just released Aligning AI With Shared Human Values
One thing they could have achieved was dataset and leaderboard creation (MSCOCO, GLUE, and imagenet for example). These have tended to focus and help research and persist in usefulness for some time, as long as they are chosen wisely.
Predicting and extrapolating human preferences is a task which is part of nearly every AI Alignment strategy. Yet we have few datasets for it, the only ones I found are https://github.com/iterative/aita_dataset, https://www.moralmachine.net/
So this hypothetical ML Engineering approach to alignment might have achieved some simp...
You mentioned that this metaphor should also include world models. I can help there.
Many world models try to predict the next state of the world given the agent's action. With curiosity-driven exploration the agent tries to explore in a way that maximizes it's a reduction of surprise, allowing it to learn about its effect on the world (see for example https://arxiv.org/abs/1705.05363). Why not just maximize surprise? Because we want a surprise we can learn to decrease, not just the constant surprise of a TV showing static.
This means they focus an explorati
...I've been using this for meditation too, but it's interesting to see it formulated for wider application. It seems to work for me to reduce resistance. Some other comments mentioned how this mirror how addictions seem to work. But it also mirrors how advertisements and even reading about something work.
I've given a rough first answer with some things that made me update my model of the world to think that spies are powerful and coordinated enough to keep secrets, but not competent enough to keep them forever.
Some specific learnings:
The DHS is not an intelligence agency. The fact that there's a lot of DHS-lead censorship in the Twitter files but not a lot of CIA-lead censorship could be an update against the CIA doing much of that.
One of the interesting aspects of that leak is that everything is s... (read more)