Customize
Rationality+Rationality+World Modeling+World Modeling+AIAIWorld OptimizationWorld OptimizationPracticalPracticalCommunityCommunity
Personal Blog+
simeon_c10352
12
Idea: Daniel Kokotajlo probably lost quite a bit of money by not signing an OpenAI NDA before leaving, which I consider a public service at this point. Could some of the funders of the AI safety landscape give some money or social reward for this? I guess reimbursing everything Daniel lost might be a bit too much for funders but providing some money, both to reward the act and incentivize future safety people to not sign NDAs would have a very high value. 
keltan30
0
I currently am completing psychological studies for credit in my university psych course. The entire time, all I can think is “I wonder if that detail is the one they’re using to trick me with?” I wonder how this impacts results. I can’t imagine being in a heightened state of looking out for deception has no impact.
A list of some contrarian takes I have: * People are currently predictably too worried about misuse risks * What people really mean by "open source" vs "closed source" labs is actually "responsible" vs "irresponsible" labs, which is not affected by regulations targeting open source model deployment. * Neuroscience as an outer alignment[1] strategy is embarrassingly underrated. * Better information security at labs is not clearly a good thing, and if we're worried about great power conflict, probably a bad thing. * Much research on deception (Anthropic's recent work, trojans, jailbreaks, etc) is not targeting "real" instrumentally convergent deception reasoning, but learned heuristics. Not bad in itself, but IMO this places heavy asterisks on the results they can get. * ML robustness research (like FAR Labs' Go stuff) does not help with alignment, and helps moderately for capabilities. * The field of ML is a bad field to take epistemic lessons from. Note I don't talk about the results from ML. * ARC's MAD seems doomed to fail. * People in alignment put too much faith in the general factor g. It exists, and is powerful, but is not all-consuming or all-predicting. People are often very smart, but lack social skills, or agency, or strategic awareness, etc. And vice-versa. They can also be very smart in a particular area, but dumb in other areas. This is relevant for hiring & deference, but less for object-level alignment. * People are too swayed by rhetoric in general, and alignment, rationality, & EA too, but in different ways, and admittedly to a lesser extent than the general population. People should fight against this more than they seem to (which is not really at all, except for the most overt of cases). For example, I see nobody saying they don't change their minds on account of Scott Alexander because he's too powerful a rhetorician. Ditto for Eliezer, since he is also a great rhetorician. In contrast, Robin Hanson is a famously terrible rhetorician, so people should listen to him more. * There is a technocratic tendency in strategic thinking around alignment (I think partially inherited from OpenPhil, but also smart people are likely just more likely to think this way) which biases people towards more simple & brittle top-down models without recognizing how brittle those models are. ---------------------------------------- 1. A non-exact term ↩︎
RobertM5739
8
EDIT: I believe I've found the "plan" that Politico (and other news sources) managed to fail to link to, maybe because it doesn't seem to contain any affirmative commitments by the named companies to submit future models to pre-deployment testing by UK AISI. I've seen a lot of takes (on Twitter) recently suggesting that OpenAI and Anthropic (and maybe some other companies) violated commitments they made to the UK's AISI about granting them access for e.g. predeployment testing of frontier models.  Is there any concrete evidence about what commitment was made, if any?  The only thing I've seen so far is a pretty ambiguous statement by Rishi Sunak, who might have had some incentive to claim more success than was warranted at the time.  If people are going to breathe down the necks of AGI labs about keeping to their commitments, they should be careful to only do it for commitments they've actually made, lest they weaken the relevant incentives.  (This is not meant to endorse AGI labs behaving in ways which cause strategic ambiguity about what commitments they've made; that is also bad.)
I've been thinking about how the way to talk about how a neural network works (instead of how it could hypothetically come to work by adding new features) would be to project away components of its activations/weights, but I got stuck because of the issue where you can add new components by subtracting off large irrelevant components. I've also been thinking about deception and its relationship to "natural abstractions", and in that case it seems to me that our primary hope would be that the concepts we care about are represented at a large "magnitude" than the deceptive concepts. This is basically using L2-regularized regression to predict the outcome. It seems potentially fruitful to use something akin to L2 regularization when projecting away components. The most straightforward translation of the regularization would be to analogize the regression coefficient to (f(x)−f(x−uuTx))uTuTx, in which case the L2 term would be ||(f(x)−f(x−uuTx))uT||uTx||||2, which reduces to ||f(x)−f(x−uuTx)||2||uTx||2. If f(w)=Pw(o|i)  is the probability[1] that a neural network with weights w gives to an output o given a prompt i, then when you've actually explained o, it seems like you'd basically have f(w)−f(w−uuTw)≈f(w) or in other words Pw−uuTw(o|i)≈0. Therefore I'd want to keep the regularization coefficient weak enough that I'm in that regime. In that case, the L2 term would then basically reduce to minimizing 1||uTw||2, or in other words maximizing ||uTw||2,. Realistically, both this and Pw−uuTw(o|i)≈0 are probably achieved when u=w|w|, which on the one hand is sensible ("the reason for the network's output is because of its weights") but on the other hand is too trivial to be interesting. In regression, eigendecomposition gives us more gears, because L2 regularized regression is basically changing the regression coefficients for the principal components by λλ+α, where λ is the variance of the principal component and α is the regularization coefficient. So one can consider all the principal components ranked by βλλ+α to get a feel for the gears driving the regression. When α is small, as it is in our regime, this ranking is of course the same order as that which you get from βλ, the covariance between the PCs and the dependent variable. This suggests that if we had a change of basis for w, one could obtain a nice ranking of it. Though this is complicated by the fact that f is not a linear function and therefore we have no equivalent of β. To me, this makes it extremely tempting to use the Hessian eigenvectors V as a basis, as this is the thing that at least makes each of the inputs to f "as independent as possible". Though rather than ranking by the eigenvalues of Hf(w) (which actually ideally we'd actually prefer to be small rather than large to stay in the ~linear regime), it seems more sensible to rank by the components of the projection of w onto V (which represent "the extent to which w includes this Hessian component"). In summary, if HwPw(o|i)=VΛVT, then we can rank the importance of each component Vj by (Pw−VjVTjw(o|i)−Pw(o|i))VTjw. Maybe I should touch grass and start experimenting with this now, but there's still two things that I don't like: * There's a sense in which I still don't like using the Hessian because it seems like it would be incentivized to mix nonexistent mechanisms in the neural network together with existent ones. I've considered alternatives like collecting gradient vectors along the training of the neural network and doing something with them, but that seems bulky and very restricted in use. * If we're doing the whole Hessian thing, then we're modelling f as quadratic, yet f(x+δx)−f(x) seems like an attribution method that's more appropriate when modelling f as ~linear. I don't think I can just switch all the way to quadatic models, because realistically f is more gonna be sigmoidal-quadratic and for large steps δx, the changes to a sigmoidal-quadratic function is better modelled by f(x+\delta x) - f(x) than by some quadratic thing. But ideally I'd have something smarter... 1. ^ Normally one would use log probs, but for reasons I don't want to go into right now, I'm currently looking at probabilities instead.

Popular Comments

Recent Discussion

People sometimes ask me if I think quantum computing will impact AGI development, and my usual answer has been that it likely won't play much of a role. However, photonics likely will.

Photonics was one of the subfields I studied and worked in while I was in academia doing physics.

In the context of deep learning, Photonics for deep learning focuses on using light (photons) for efficient computation in neural networks, while quantum computing uses quantum-mechanical properties for computation.

There is some overlap between quantum computing and photonics, which can sometimes be confusing. There's even a subfield called Quantum Photonics, which merges the two. However, they are both two distinctive approaches to computing.

I'll go into more detail later, but OpenAI recently hired someone who, at PsiQuantum, worked on "designing a...

2bhauth
In practice, no, they can't. Optical transistors are less efficient. Analog matrix multiplies using light are less efficient. There's no recent lab-scale approach that's more efficient than semiconductor transistors either.
2jacquesthibs
I agree, I meant that this is the promise (which has yet to be realized).
bhauth20

OK. Why would you consider it a realistic enough prospect to study or write this post? I know there were people doing analog multiplies with light absorption, but even 8-bit analog data transmission with light uses more energy than an 8-bit multiply with transistors. The physics of optical transistors don't seem compatible with lower energy than electrons. What hope do you think there is?

1Lee_0505
Scott Aaronson (Quantum computing expert) also works at OpenAI. But I don't think he's doing any quantum computer-related research there. As far as I know, he's on the superalignment team.
keltan30

I currently am completing psychological studies for credit in my university psych course. The entire time, all I can think is “I wonder if that detail is the one they’re using to trick me with?”

I wonder how this impacts results. I can’t imagine being in a heightened state of looking out for deception has no impact.

When I play live I have a bunch of instruments, including:

I also have some effects, primarily a talkbox and an audio-to-audio synth pedal. Normally I route the mandolin into the effects, but I've recently been wanting more options:

  • The computer effects are a lot of fun routed through the talkbox.

  • If I set the bass whistle to emit just a sine wave, and pipe that into the synth pedal, I can control professionally-designed sounds:

The thing that makes this tricky is that I want to be able to play mandolin direct (which goes via the talkbox output) at the same time as playing bass whistle (which goes via the pedals output). I sketched a lot of options:

And eventually realized...

It might be a good on the current margin to have a norm of publicly listing any non-disclosure agreements you have signed (e.g. on one's LW profile), and the rough scope of them, so that other people can model what information you're committed to not sharing, and highlight if it is related to anything beyond the details of technical research being done (e.g. if it is about social relationships or conflicts or criticism).

I have added the one NDA that I have signed to my profile.

1lePAN6517
Can you speak to any, let's say, "hypothetical" specific concerns that somebody who was in your position at a company like OpenAI might have had that would cause them to quit in a similar way to you?
3robo
I appreciate that you are not speaking loudly if you don't yet have anything loud to say.
6habryka
My understanding is that the extent of NDAs can differ a lot between different implementations, so it might be hard to speak in generalities here. From the revealed behavior of people I poked here who have worked at OpenAI full-time, the OpenAI NDAs seem very comprehensive and limiting. My guess is also the NDAs for contractors and for events are a very different beast and much less limiting.  Also just the de-facto result of signing non-disclosure-agreements is that people don't feel comfortable navigating the legal ambiguity and default very strongly to not sharing approximately any information about the organization at all. Maybe people would do better things here with more legal guidance, and I agree that you don't generally seem super constrained in what you feel comfortable saying, but like I sure now have run into lots of people who seem constrained by NDAs they signed (even without any non-disparagement component). Also, if the NDA has a gag clause that covers the existence of the agreement, there is no way to verify the extent of the NDA, and that makes navigating this kind of stuff super hard and also majorly contributes to people avoiding the topic completely. 

The curious tale of how I mistook my dyslexia for stupidity - and talked, sang, and drew my way out of it. 

Sometimes I tell people I’m dyslexic and they don’t believe me. I love to read, I can mostly write without error, and I’m fluent in more than one language.

Also, I don’t actually technically know if I’m dyslectic cause I was never diagnosed. Instead I thought I was pretty dumb but if I worked really hard no one would notice. Later I felt inordinately angry about why anyone could possibly care about the exact order of letters when the gist is perfectly clear even if if if I right liike tis.

I mean, clear to me anyway.

I was 25 before it dawned on me that all the tricks...

3Seth Herd
As I understand it from some cog psych/ linguistics class (it's not my area but this makes sense WRT brain function), the problem with subvocalizing is that it limits your reading speed to approximately the rate you can talk. So most skilled readers have learned to disconnect from subvocalizing. Part of the training for speedreading is to make sure you're not subvocalizing at all, and this helped me learn to speedread. I turn on subvocalizing sometimes when reading poetry or lyrical prose, or sometimes when I'm reading slowly to make damned sure I understand something, or remember its precise phrasing.
keltan10

I’ve got a few questions.

  1. What is “WRT brain function”?
  2. How does someone train themself out of subvocalising?
  3. If you think critically, has speed reading actually increased your learning rate for semantic knowledge?
  4. Most things have downsides, what are the downsides of speed reading?
  5. What are your Words Per Minute (WPM)?
  6. Did you test WPM before learning speed reading?
  7. If this was an RPG, what level do you think you are in speed reading from 1-100?
  8. How long did it take you to reach your current level in this skill?

Sorry that’s a lot of questions. I’ve bee... (read more)

9Lorxus
Maybe I'm just weird, but I totally do sometimes subvocalize, but incredibly quickly. Almost clipped or overlapping to an extent, in a way that can only really work inside your head? And that way it can go faster than you can physically speak. Why should your mental voice be limited by the limits of physical lips, tongue, and glottis, anyway?
1Shoshannah Tekofsky
Oh interesting! Maybe I'm wrong. I'm more curious about something like a survey on the topic now.
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

1. If you find that you’re reluctant to permanently give up on to-do list items, “deprioritize” them instead

hate the idea of deciding that something on my to-do list isn’t that important, and then deleting it off my to-do list without actually doing it. Because once it’s off my to-do list, then quite possibly I’ll never think about it again. And what if it’s actually worth doing? Or what if my priorities will change such that it will be worth doing at some point in the future? Gahh!

On the other hand, if I never delete anything off my to-do list, it will grow to infinity.

The solution I’ve settled on is a priority-categorized to-do list, using a kanban-style online tool (e.g. Trello). The left couple columns (“lists”) are very active—i.e., to-do list...

Yeah most of the time I’ll open my to-do list and just look at one the couple very leftmost columns, and the column has maybe 3 items, and then I’ll pick one and do it (or pick a few and schedule them for that same day).

Occasionally I’ll look at a column farther to the right, and see if any ought to be moved left or right. The further right, the less often I’m checking.

Most people avoid saying literally false things, especially if those could be audited, like making up facts or credentials. The reasons for this are both moral and pragmatic — being caught out looks really bad, and sustaining lies is quite hard, especially over time. Let’s call the habit of not saying things you know to be false ‘shallow honesty’[1].

Often when people are shallowly honest, they still choose what true things they say in a kind of locally act-consequentialist way, to try to bring about some outcome. Maybe something they want for themselves (e.g. convincing their friends to see a particular movie), or something they truly believe is good (e.g. causing their friend to vote for the candidate they think will be better for the country).

Either way, if you...

Might be an uncharitable read of what's being recommended here. In particular, it might be worth revisiting the section that details what Deep Honesty is not. There's a large contingent of folks online who self-describe as 'borderline autistic', and one of their hallmark characteristics is blunt honesty, specifically the sort that's associated with an inability to pick up on ordinary social cues. My friend group is disproportionately comprised of this sort of person. So I've had a lot of opportunity to observe a few things about how honesty works.

Speaking ... (read more)

1Roger Scott
Given only finite time, isn't one always omitting nearly everything? If you believe in dishonesty by omission is everyone not dishonest, in that sense, nearly all the time? You can argue that only "relevant" information is subject to non-omission, but since relevance is a subjective, and continuous, property this doesn't seem like very useful guidance. Wherever you choose to draw the line someone can reasonably claim you've omitted relevant (by some other standard) information just on the other side of that line.
1Roger Scott
It seems like this discussion might cover power imbalances between speaker and listener more. For example, in the border agent example, a border control agent has vastly more power than someone trying to enter the country. This power gives them the "right" (read: authority) to ask all sorts of question, the legitimacy of which might be debatable. Does deep honesty compel you to provide detailed, non-evasive answer to questions you personally don't believe the interlocutor has any business asking you? The objective of such interactions is not improving the accuracy of the border agent's worldview and it is unlikely that anything you say is going to alter that worldview. It seems like there are many situations in life where you have little choice but to interact with someone, but the less you tell them the better. There's a reason witnesses testifying in a courtroom are advised to answer "yes" or "no" whenever possible, rather than expounding.
6habryka
Promoted to curated: I sure tend to have a lot of conversations about honesty and integrity, and this specific post was useful in 2-3 conversations I've had since it came out. I like having a concept handle for "trying to actively act with an intent to inform", I like the list of concrete examples of the above, and I like how the post situates this as something with benefits and drawbacks (while also not shying away too much from making concrete recommendations on what would be better on the margin).

One of the primary concerns when attempting to control an AI of human-or-greater capabilities is that it might be deceitful. It is, after all, fairly difficult for an AI to succeed in a coup against humanity if the humans can simply regularly ask it "Are you plotting a coup? If so, how can we stop it?" and be confident that it will give them non-deceitful answers! 

TL;DR LLMs demonstrably learn deceit from humans. Deceit is a fairly complex behavior, especially over an extended period: you need to reliably come up with plausible lies, which preferably involves modeling the thought processes of those you wish to deceive, and also keep the lies an internally consistent counterfactual, yet separate from your real beliefs. As the quotation goes, "Oh what a...

I also wonder how much interpretability LM agents might help here, e.g. as they could make much cheaper scaling the 'search' to many different undesirable kinds of behaviors.

LessOnline & Manifest Summer Camp

June 3rd to June 7th

Between LessOnline and Manifest, stay for a week of experimental events, chill coworking, and cozy late night conversations.

Prices raise $100 on May 13th