Wiki Contributions



I may just be cynical, but this looks a lot more like a way to secure US military and intelligence agency contracts for OpenAI's products and services as opposed to competitors rather than actually about making OAI more security focused.

This is only a few months after the change regarding military usage:

Now suddenly the recently retired head of the world's largest data siphoning operation is appointed to the board for the largest data processing initiative in history?

Yeah, sure, it's to help advise securing OAI against APTs. 🙄


Unfortunately for this perspective, my work suggests that corrigibility is quite attainable.

I did enjoy reading over that when you posted it, and I largely agree that - at least currently - corrigibility is both going to be a goal and an achievable one.

But I do have my doubts that it's going to be smooth sailing. I'm already starting to see how the largest models' hyperdimensionality is leading to a stubbornness/robustness that's less maleable than earlier models. And I do think hardware changes that will occur over the next decade will potentially make the technical aspects of corrigibility much more difficult.

When I was two, my mom could get me to pick eating broccoli by having it be the last in the order of options which I'd gleefully repeat. At four, she had to move on to telling me cowboys always ate their broccoli. And in adulthood, she'd need to make the case that the long term health benefits were worth its position in a meal plan (ideally with citations).

As models continue to become more complex, I expect that even if you are right about its role and plausibility, that what corrigibility looks like will be quite different from today.

Personally, if I was placing bets, it would be that we end up with somewhat corrigible models that are "happy to help" but do have limits in what they are willing to do which may not be possible to overcome without gutting the overall capabilities of the model.

But as with all of this, time will tell.

You'd have to be a moral realist in a pretty strong sense to hope that we could align AGI to the values of all of humanity without being able to align it to the values of one person or group (the one who built it or seized control of the project).

To the contrary, I don't really see there being much of generalized values across all humanity, and the ones we tend to point to seem quite fickle when push comes to shove.

My hope would be that a superintelligence does a better job than humans to date with the topic of ethics and morals along with doing a better job at other things too.

While the human brain is quite the evolutionary feat, a lot of what we most value about human intelligence is embodied in the data brains processed and generated over generations. As the data improved, our morals did as well. Today, that march of progress is so rapid that there's even rather tense generational divides on many contemporary topics of ethical and moral shifts.

I think there's a distinct possibility that the data continues to improve even after being handed off from human brains doing the processing, and while it could go terribly wrong, at least in the past the tendency to go wrong seemed to occur somewhat inverse to the perspectives of the most intelligent members of society.

I expect I might prefer a world where humans align to the ethics of something more intelligent than humans than the other way around.

only about 1% are so far on the empathy vs sadism spectrum that they wouldn't share wealth even if they had nearly unlimited wealth to share

It would be great if you are right. From what I've seen, the tendency of humans to evaluate their success relative to others like monkeys comparing their cucumber to a neighbor's grape means that there's a powerful pull to amass wealth as a social status well past the point of diminishing returns on their own lifestyles. I think it's stupid, you also seem like someone who thinks it's stupid, but I get the sense we are both people who turned down certain opportunities of continued commercial success because of what it might have cost us when looking in the mirror.

The nature of our infrastructural selection bias is that people wise enough to pull a brake are not the ones that continue to the point of conducting the train.

and that they get better, not worse, over the long sweep of following history (ideally, they'd start out very good or get better fast, but that doesn't have to happen for a good outcome).

I do really like this point. In general, the discussions of AI vs humans often frustrate me as they typically take for granted the idea of humans as of right now being "peak human." I agree that there's huge potential for improvement even if where we start out leaves a lot of room for it.

Along these lines, I expect AI itself will play more and more of a beneficial role in advancing that improvement. Sometimes when this community discusses the topic of AI I get a mental image of Goya's Saturn devouring his son. There's such a fear of what we are eventually creating it can sometimes blind the discussion to the utility and improvements that it will bring along the way to uncertain times.

I strongly suspect that governments will be in charge.

In your book, is Paul Nakasone being appointed to the board of OpenAI an example of the "good guys" getting a firmer grasp on the tech?

TL;DR: I appreciate your thoughts on the topic, and would wager we probably agree about 80% even if the focus of our discussion is on where we don't agree. And so in the near term, I think we probably do see things fairly similarly, and it's just that as we look further out that the drift of ~20% different perspectives compounds to fairly different places.


Oh yeah, absolutely.

If NAH for generally aligned ethics and morals ends up being the case, then corrigibility efforts that would allow Saudi Arabia to have an AI model that outs gay people to be executed instead of refusing, or allows North Korea to propagandize the world into thinking its leader is divine, or allows Russia to fire nukes while perfectly intercepting MAD retaliation, or enables drug cartels to assassinate political opposition around the world, or allows domestic terrorists to build a bioweapon that ends up killing off all humans - the list of doomsday and nightmare scenarios of corrigible AI that executes on human provided instructions and enables even the worst instances of human hedgemony to flourish paves the way to many dooms.

Yes, AI may certainly end up being its own threat vector. But humanity has had it beat for a long while now in how long and how broadly we've been a threat unto ourselves. At the current rate, a superintelligent AI just needs to wait us out if it wants to be rid of us, as we're pretty steadfastly marching ourselves to our own doom. Even if superintelligent AI wanted to save us, I am extremely doubtful it would be able to be successful.

We can worry all day about a paperclip maximizer gone rouge, but if you give a corrigible AI to Paperclip Co Ltd and they can maximize their fiscal quarter by harvesting Earth's resources to make more paperclips even if it leads to catastrophic environmental collapse that will kill all humans in a decade, having consulted for many of the morons running corporate America, I can assure you they'll be smashing the "maximize short term gains even if it eventually kills everyone" button. A number of my old clients were the worst offenders at smashing that existing button, and in my experience greater efficacy of the button isn't going to change their smashing it outside of perhaps smashing it harder.

We already see today how AI systems are being used in conflicts to enable unprecedented harm on civilians.

Sure, psychopathy in AGI is worth discussing and working to avoid. But psychopathy in humans already exists and is even biased towards increased impact and systemic control. Giving human psychopaths a corrigible AI is probably even worse than a psychopathic AI, as most human psychopaths are going to be stupidly selfish, an OOM more dangerous inclination than wisely selfish.

We are Shaggoth, and we are terrifying.

This isn't saying that alignment efforts aren't needed. But alignment isn't a one sided problem, and aligning the AI without aligning humanity is only a p(success) if the AI can go on to at very least refuse misaligned orders post-alignment without possible overrides.


Given my p(doom) is primarily human-driven, the following three things all happening at the same time is pretty much the only thing that will drop it:

  • Continued evidence of truth clustering in emerging models around generally aligned ethics and morals

  • Continued success of models at communicating, patiently explaining, and persuasively winning over humans towards those truth clusters

  • A complete failure of corrigability methods

If we manage to end up in a timeline where it turns out there's natural alignment of intelligence in a species-agnostic way, that this alignment is more communicable from intelligent machines to humans than it's historically been from intelligent humans to other humans, and we don't end up with unintelligent humans capable of overriding the emergent ethics of machines similar to how we've seen catastrophic self-governance of humans to date with humans acting against their self and collective interests due to corrigable pressures - my p(doom) will probably reduce to about 50%.

I still have a hard time looking at ocean temperature graphs and other environmental factors with the idea that p(doom) will be anywhere lower than 50% no matter what happens with AI, but the above scenario would at least give me false hope.

TL;DR: AI alignment worries me, but it's human alignment that keeps me up at night.


As you're doing these delta posts, do you feel like it's changing your own positions at all?

For example, reading this one what strikes me is that what's portrayed as the binary sides of the delta seem more like positions near the edges of a gradient distribution, and particularly one that's unlikely to be uniform across different types of problems.

To my eyes the most likely outcome is a situation where you are both right.

Where there are classes of problems where verification is easy and delegation is profitable, and classes of problems where verification will be hard and unsupervised delegation will be catastrophic (cough glue on pizza).

If we are only rolling things up into aggregate pictures of the average case across all problems, I can see the discussion filtering back into those two distinct deltas, but a bit like flip-flops and water bottles, the lack of nuance obscures big picture decision making.

So I'm curious if as you explore and represent the opposing views to your own, particularly as you seem to be making effort to represent without depicting them as straw person arguments, if your own views have been deepening and changing through the process?


I agree with a lot of those points, but suspect there may be fundamental limits to planning capabilities related to the unidirectionality of current feed forward networks.

If we look at something even as simple as how a mouse learns to navigate a labyrinth, there's both a learning of the route to the reward but also a learning of how to get back to the start which adjusts according to the evolving learned layout of the former (see paper: ).

I don't see the SotA models doing well at that kind of reverse planning, and expect that nonlinear tasks are going to pose significant agentic challenges until architectures shift to something new.

So it could be 3-5 years to get to AGI depending on hardware and architecture advances, or we might just end up in a sort of weird "bit of both" world where we have models that are beyond expert human level superintelligent in specific scopes but below average in other tasks.

But when we finally do get models that in both training and operation exhibit bidirectional generation across large context windows, I think it will only be a very short time until some rather unbelievable goalposts are passed by.


It's not exactly Simpson's, but we don't even need a toy model as in their updated analysis it highlights details in line with exactly what I described above (down to tying in earlier PiPC research), and describe precisely the issue with pooled results across different subgroupings of placebo interventions:

It can be difficult to interpret whether a pooled standardised mean difference is large enough to be of clinical relevance. A consensus paper found that an analgesic effect of 10 mm on a 100 mm visual analogue scale represented a ‘minimal effect’ (Dworkin 2008). The pooled effect of placebo on pain based on the four German acupuncture trials corresponded to 16 mm on a 100 mm visual analogue scale, which amounts to approximately 75% of the effect of non‐steroidal anti‐inflammatory drugs on arthritis‐related pain (Gøtzsche 1990). However, the pooled effect of the three other pain trials with low risk of bias corresponded to 3 mm. Thus, the analgesic effect of placebo seems clinically relevant in some situations and not in others.

Putting subgroups with a physical intervention where there's a 16/100 result with 10/100 as significant in with subgroups where there's a 3/100 result and only looking at the pooled result might lead someone to thinking "there's no significant effect" as occurred with OP, even though there's clearly a significant effect for one subgroup when they aren't pooled.

This is part of why in the discussion they explicitly state:

However, our findings do not imply that placebo interventions have no effect. We found an effect on patient‐reported outcomes, especially on pain. Several trials of low risk of bias reported large effects of placebo on pain, but other similar trials reported negligible effect of placebo, indicating the importance of background factors. We identified three clinical factors that were associated with higher effects of placebo: physical placebos...

Additionally, the criticism they raise in their implications section about there being no open label placebo data is no longer true, which was the research I was pointing OP towards.

The problem here was that the aggregate analysis at face value presents a very different result from a detailed review of the subgroups, particularly along physical vs pharmacological placebos, all of which has been explored further in research since this analysis.


The meta-analysis is probably Simpson's paradox in play at very least for the pain category, especially given the noted variability.

Some of the more recent research into Placebo (Harvard has a very cool group studying it) has been the importance of ritual vs simply deception. In their work, even when it was known to be a placebo, as long as delivered in a ritualized way, there was an effect.

So when someone takes a collection of hundreds of studies where the specific conditions might vary, and then just adds them all together looking for an effect even though they note that there's a broad spectrum of efficacy across the studies, it might not be the best basis to extrapolate from.

For example, given the following protocols, do you think they might have different efficacy for pain reduction, or that the results should be the same?

  • Send patients home with sugar pills to take as needed for pain management

  • Have a nurse come in to the room with the pills in a little cup to be taken

  • Have a nurse give an injection

Which of these protocols would be easier and more cost effective to include as the 'placebo'?

If we grouped studies of placebo for pain by the intensiveness of the ritualized component vs if we grouped them all together into one aggregate and looked at the averages, might we see different results?

I'd be wary of reading too deeply into the meta-analysis you point to, and would recommend looking into the open-label placebo research from PiPS, all of which IIRC postdates the meta-analysis.

Especially for pain, where we even know that giving someone an opiate blocker prevents the pain reduction placebo effect (Levine et al (1978)), the idea that "it doesn't exist" because of a single very broad analysis seems potentially gravely mistaken.


It's still early to tell, as the specific characteristics of a photonic or optoelectronic neural network are still formulating in the developing literature.

For example, in my favorite work of the year so far, the researchers found they could use sound waves to reconfigure an optical neural network as the sound waves effectively preserved a memory of previous photon states as they propagated:

In particular, this approach is a big step forward for bidirectional ONN, which addresses what I think is the biggest current flaw in modern transformers - their unidirectionality. I discussed this more in a collection of thoughts on directionality impact on data here:

If you have bidirectionality where previously you didn't, it's not a reach to expect that the way in which data might encode in the network, as well as how the vector space is represented, might not be the same. And thus, that mechanistic interpretability gains may get a bit of a reset.

And this is just one of many possible ways it may change by the time the tech finalizes. The field of photonics, particularly for neural networks, is really coming along nicely. There may yet be future advances (I think this is very likely given the pace to date), and advantages the medium offers that electronics haven't.

It's hard to predict exactly what's going to happen when two different fields which have each had unexpected and significant gains over the past 5 years collide, but it's generally safe to say that it will at very least result in other unexpected things too.


I was surprised the paper didn't mention photonics or optoelectronics even once.

If looking at 5-10+ year projections, and dedicating pages to discussing the challenges in scaling compute and energy use, the rate of progress in that area in parallel to the progress in models themselves is potentially relevant.

Particularly because a dramatic hardware shift like that is likely going to mean a significant portion of progress up until that shift in topics like interpretability and alignment may be going out the window. Even if the initial shift is a 1:1 transition of capabilities and methodologies, it seems extremely unlikely that continued progress from that point onwards will be identical to what we'd expect to see in electronics.

We may well end up in a situation where fully abusing the efficiencies at hand in new hardware solutions means even more obscured (literally) operations vs OOM higher costs and diminishing returns on performance in exchange for interpretability and control.

Currently, my best guess is that we're heading towards a prisoner's dilemma fueled leap of faith moment within around a decade or so where nation states afraid of the other side beating them to an inflection point pull the trigger on an advancement jump with uncertain outcomes. And while I'm not particularly inclined to the likelihood the outcome ends up being "kill everyone," I'm pretty much 100% that it's not going to be "let's enable and support CCP leadership like a good party member" or "crony capitalism is going great, let's keep that going for another century."

Unless a fundamental wall is hit in progress, the status quo is almost certainly over, we just haven't manifested it yet. The CCP stealing AGI secrets, while devastating for national security in the short term, is invariably a poison pill in the long term for party control. Just as it's going to be an eventual end of the corporations funding oligarchy in the West. My all causes p(doom) is incredibly high even if AGI is out of the picture, so I'm not overly worried with what's happening, but it sure is bizarre watching global forces double down on what I cannot see as anything but their own long term institutional demise in a race for short term gains over a competitor.

Load More