All of Hoagy's Comments + Replies

There had been various clashes between Altman and the board. We don’t know what all of them were. We do know the board felt Altman was moving too quickly, without sufficient concern for safety, with too much focus on building consumer products, while founding additional other companies. ChatGPT was a great consumer product, but supercharged AI development counter to OpenAI’s stated non-profit mission.

Does anyone have proof of the board's unhappiness about speed, lack of safety concern and disagreement with founding other companies. All seem plausible but have seen basically nothing concrete.

Could you elaborate on what it would mean to demonstrate 'savannah-to-boardroom' transfer? Our architecture was selected for in the wilds of nature, not our training data. To me it seems that when we use an architecture designed for language translation for understanding images we've demonstrated a similar degree of transfer.

I agree that we're not yet there on sample efficient learning in new domains (which I think is more what you're pointing at) but I'd like to be clearer on what benchmarks would show this. For example, how well GPT-4 can integrate a new domain of knowledge from (potentially multiple epochs of training on) a single textbook seems a much better test and something that I genuinely don't know the answer to.

Do you know why 4x was picked? I understand that doing evals properly is a pretty substantial effort, but once we get up to gigantic sizes and proto-AGIs it seems like it could hide a lot. If there was a model sitting in training with 3x the train-compute of GPT4 I'd be very keen to know what it could do!

Yes that makes a lot of sense that linearity would come hand in hand with generalization. I'd recently been reading Krotov on non-linear Hopfield networks but hadn't made the connection. They say that they're planning on using them to create more theoretically grounded transformer architectures. and your comment makes me think that these wouldn't succeed but then the article also says:

This idea has been further extended in 2017 by showing that a careful choice of the activation function can even lead to an exponential memory storage capacity. Importantly,

... (read more)

Reposting from a shortform post but I've been thinking about a possible additional argument that networks end up linear that I'd like some feedback on:

the tldr is that overcomplete bases necessitate linear representations

  • Neural networks use overcomplete bases to represent concepts. Especially in vector spaces without non-linearity, such as the transformer's residual stream, there are just many more things that are stored in there than there are dimensions, and as Johnson Lindenstrauss shows, there are exponentially many almost-orthogonal directions to st
... (read more)
This is an interesting idea. I feel this also has to be related to increasing linearity with scale and generalization ability -- i.e. if you have a memorised solution, then nonlinear representations are fine because you can easily tune the 'boundaries' of the nonlinear representation to precisely delineate the datapoints (in fact the nonlinearity of the representation can be used to strongly reduce interference when memorising as is done in the recent research on modern hopfield networks) . On the other hand, if you require a kind of reasonably large-scale smoothness of the solution space, as you would expect from a generalising solution in a flat basin, then this cannot work and you need to accept interference between nearly orthogonal features as the cost of preserving generalisation of the behaviour across many different inputs which activate the same vector.

There's an argument that I've been thinking about which I'd really like some feedback or pointers to literature on:

the tldr is that overcomplete bases necessitate linear representations

  • Neural networks use overcomplete bases to represent concepts. Especially in vector spaces without non-linearity, such as the transformer's residual stream, there are just many more things that are stored in there than there are dimensions, and as Johnson Lindenstrauss shows, there are exponentially many almost-orthogonal directions to store them in (of course, we can't ass
... (read more)

See e.g. "So I think backpropagation is probably much more efficient than what we have in the brain." from

More generally, I think the belief that there's some kind of important advantage that cutting edge AI systems have over humans comes more from human-AI performance comparisons e.g. GPT-4 way outstrips the knowledge about the world of any individual human in terms of like factual understanding (though obv deficient in other ways) with probably 100x less params. A bioanchors based model of AI... (read more)

Not totally sure but i think it's pretty likely that scaling gets us to AGI, yeah. Or more particularly, gets us to the point of AIs being able to act as autonomous researchers or act as high (>10x) multipliers on the productivity of human researchers which seems like the key moment of leverage for deciding how the development to AI will go.

Don't have a super clean idea of what self-reflective thought means. I see that e.g. GPT-4 can often say something, think further about it, and then revise its opinion. I would expect a little bit of extra reasoning quality and general competence to push this ability a lot further.

The point that you brought up seemed to rest a lot on Hinton's claims, so it seems that his opinions on timelines and AI progress should be quite important   Do you have any recent source on his claims about AI progress? 
Answer by HoagySep 27, 202390

1 line summary is that NNs can transmit signals directly from any part of the network to any other, while brain has to work only locally.

More broadly I get the sense that there's been a bit of a shift in at least some parts of theoretical neuroscience from understanding how we might be able to implement brain-like algorithms to understanding how the local algorithms that the brain uses might be able to approximate backprop, suggesting that artificial networks might have an easier time than the brain and so it would make sense that we could make something w... (read more)

So in your model how much of the progress to AGI can be made just by adding more compute + more data + working memory + algorithms that 'just' keep up with the scaling? Specifically, do you think that self-reflective thought already emerges from adding those?

Hi Scott, thanks for this!

Yes I did do a fair bit of literature searching (though maybe not enough tbf) but very focused on sparse coding and approaches to learning decompositions of model activation spaces rather than approaches to learning models which are monosemantic by default which I've never had much confidence in, and it seems that there's not a huge amount beyond Yun et al's work, at least as far as I've seen.

Still though, I've not seen almost any of these which suggests a big hole in my knowledge, and in the paper I'll go through and add a lot more background to attempts to make more interpretable models.

Cheers, I did see that and wondered whether still to post the comment but I do think that having a gigantic company owning a large chunk and presumably a lot of leverage over the company is a new form of pressure so it'd be reassuring to have some discussion of how to manage that relationship.

Didn't Google previously own a large share? So now there are 2 gigantic companies owning a large share, which makes me think each has much less leverage, as Anthropic could get further funding from the other.

Yeah, I agree that that's a reasonable concern, but I'm not sure what they could possibly discuss about it publicly. If the public, legible, legal structure hasn't changed, and the concern is that the implicit dynamics might have shifted in some illegible way, what could they say publicly that would address that? Any sort of "Trust us, we're super good at managing illegible implicit power dynamics." would presumably carry no information, no?

Would be interested to hear from Anthropic leadership about how this is expected to interact with previous commitments about putting decision making power in the hands of their Long-Term Benefit Trust.

I get that they're in some sense just another minority investor but a trillion-dollar company having Anthropic be a central plank in their AI strategy with a multi-billion investment and a load of levers to make things difficult for the company (via AWS) is a step up in the level of pressure to aggressively commercialise.

From the announcement, they said (

As part of the investment, Amazon will take a minority stake in Anthropic. Our corporate governance remains unchanged and we’ll continue to be overseen by the Long Term Benefit Trust, in accordance with our Responsible Scaling Policy.

Hi Charlie, yep it's in the paper - but I should say that we did not find a working CUDA-compatible version and used the scikit version you mention. This meant that the data volumes used are somewhat limited - still on the order of a million examples but 10-50x less than went into the autoencoders.

It's not clear whether the extra data would provide much signal since it can't learn an overcomplete basis and so has no way of learning rare features but it might be able to outperform our ICA baseline presented here, so if you wanted to give someone a project of making that available, I'd be interested to see it!

It's the same training datums I would look at to resolve an ambiguous case.

seems like it'd be better formatted as a nested list given the volume of text

1Nathan Young3mo
Maybe, but only because Lesswrong doesn't let you have wide tables.

Why would we expect the expected level of danger from a model of a certain size to rise as the set of potential solutions grows?

I think both Leap Labs and Apollo Research (both fairly new orgs) are trying to position themselves as offering model auditing services in the way you suggest.

A useful model for why it's both appealing and difficult to say 'Doomers and Realists are both against dangerous AI and for safety - let's work together!'.

2Adam David Long4mo
yes, this has been very much on my mind: if this three-sided framework is useful/valid, what does it mean for the possibility of the different groups cooperating? I suspect that the depressing answer is that cooperation will be a big challenge and may not happen at all. Especially as to questions such as "is the European AI Act in its present form a good start or a dangerous waste of time?" It strikes me that each of the three groups in the framework will have very strong feelings on this question * realists: yes, because, even if it is not perfect, it is at least a start on addressing important issues like invasion of privacy.  * boosters: no, because it will stifle innovation * doomers: no, because you are looking under the lamp post where the light is better, rather than addressing the main risk, which is existential risk.  
O O4mo2313

AI realism also risks a Security theater that obscures existential risks of AI.

Try decomposing the residual stream activations over a batch of inputs somehow (e.g. PCA). Using the principal directions as activation addition directions, do they seem to capture something meaningful?

It's not PCA but we've been using sparse coding to find important directions in activation space (see original sparse coding post, quantitative results, qualitative results).

We've found that they're on average more interpretable than neurons and I understand that @Logan Riggs and Julie Steele have found some effect using them as directions for activation pat... (read more)

Hi, nice work! You mentioned the possibility of neurons being the wrong unit. I think that this is the case and that our current best guess for the right unit is directions in the output space, ie linear combinations of neurons.

We've done some work using dictionary learning to find these directions (see original post, recent results) and find that with sparse coding we can find dictionaries of features that are more interpretable the neuron basis (though they don't explain 100% of the variance). 

We'd be really interested to see how this compares to ne... (read more)

Thank you Hoagy. Expanding beyond the neuron unit is a high priority. I'd like to work with you, Logan Riggs, and others to figure out a good way to make this happen in the next major update so that people can easily view, test, and contribute. I'm now creating a new channel on the discord (#directions) to discuss this:, or I'll DM you my email if you prefer that.

Link at the top doesn't work for me

Thank you! I've sorted that now!! Please let me know if you have any other feedback!!

I still don't quite see the connection - if it turns out that LLFC holds between different fine-tuned models to some degree, how will this help us interpolate between different simulacra?

Is the idea that we could fine-tune models to only instantiate certain kinds of behaviour and then use LLFC to interpolate between (and maybe even extrapolate between?) different kinds of behaviour?

1Bogdan Ionut Cirstea4mo
Yes, roughly (the next comment is supposed to make the connection clearer, though also more speculative); RLFH / supervised fine-tuned models would correspond to 'more mode-collapsed' / narrower mixtures of simulacra here (in the limit of mode collapse, one fine-tuned model = one simulacrum). 

For the avoidance of doubt, this accounting should recursively aggregate transitive inputs.

What does this mean?

Suppose Training Run Z is a finetune of Model Y, and Model Y was the output of Training Run Y, which was already a finetune of Foundation Model X produced by Training Run X (all of which happened after September 2021). This is saying that not only Training Run Y (i.e. the compute used to produce one of the inputs to Training Run Z), but also Training Run X (a “recursive” or “transitive” dependency), count additively against the size limit for Training Run Z.

Importantly, this policy would naturally be highly specialized to a specific reward function. Naively, you can't change the reward function and expect the policy to instantly adapt; instead you would have to retrain the network from scratch.

I don't understand why standard RL algorithms in the basal ganglia wouldn't work. Like, most RL problems have elements that can be viewed as homeostatic - if you're playing boxcart then you need to go left/right depending on position. Why can't that generalise to seeking food iff stomach is empty? Optimizing for a speci... (read more)

This is definitely possible and is essentially augmenting the state variables with additional homeostatic variables and then learning policies on the joint state space. However there are some clever experiments such as the linked Morrison and Berridge one demonstrating that this is not all that is going on -- specifically many animals appear to be able to perform zero-shot changes in policy when rewards change even if they have not experienced this specific homeostatic variable before -- I.e. mice suddenly chase after salt water which they previously disliked when put in a state of salt deprivation which they had never before experienced

On first glance I thought this was too abstract to be a useful plan but coming back to it I think this is promising as a form of automated training for an aligned agent, given that you have an agent that is excellent at evaluating small logic chains, along the lines of Constitutional AI or training for consistency. You have training loops using synthetic data which can train for all of these forms of consistency, probably implementable in an MVP with current systems.

The main unknown would be detecting when you feel confident enough in the alignment of its ... (read more)

Do you have a writeup of the other ways of performing these edits that you tried and why you chose the one you did?

In particular, I'm surprised by the method of adding the activations that was chosen because the tokens of the different prompts don't line up with each other in a way that I would have thought would be necessary for this approach to work, super interesting to me that it does.

If I were to try and reinvent the system after just reading the first paragraph or two I would have done something like:

  • Take multiple pairs of prompts that differ primari
... (read more)

Yeah I agree it's not in human brains, not really disagreeing with the bulk of the argument re brains but just about whether it does much to reduce foom %. Maybe it constrains the ultra fast scenarios a bit but not much more imo.

"Small" (ie << 6 OOM) jump in underlying brain function from current paradigm AI -> Gigantic shift in tech frontier rate of change -> Exotic tech becomes quickly reachable -> YudFoom

The key thing I disagree with is:

In some sense the Foom already occurred - it was us. But it wasn't the result of any new feature in the brain - our brains are just standard primate brains, scaled up a bit[14] and trained for longer. Human intelligence is the result of a complex one time meta-systems transition: brains networking together and organizing into families, tribes, nations, and civilizations through language. ... That transition only happens once - there are not ever more and more levels of universality or linguistic programmability. AGI does

... (read more)
To expand on the idea of meta-systems and their capability: Similarly to discussing brain efficiency, we could ask about the efficiency of our civilization (in the sense of being able to point its capability to a unified goal), among all possible ways of organising civilisations. If our civilisation is very inefficient, AI could figure out a better design and foom that way. Primarily, I think the question of our civilization's efficiency is unclear. My intuition is that our civilization is quite inefficient, with the following points serving as weak evidence: 1. Civilization hasn't been around that long, and has therefore not been optimised much. 2. The point (1) gets even more pronounced as you go from "designs for cooperation among a small group" to "designs for cooperation among milions", or even billions. (Because fewer of these were running in parallel, and for a shorter time.) 3. The fact that civilization runs on humans, who are selfish etc, might severely limit the space of designs that have been tried. 4. As a lower bound, it seems that something like Yudkowsky's ideas about dath ilan might work. (Not to be mistaken with "we can get there from here", "works for humans", or "none of Yudkowsky's ideas have holes in them".) None of this contradicts your arguments, but it adds uncertainty and should make us more cautios about AI. (Not that I interpret the post as advocating against caution.)
Yes in the sense that if you zoom in you'll see language starting with simplistic low bit rate communication and steadily improving, followed by writing for external memory, printing press, telecommunication, computers, etc etc. Noosphere to technosphere. But those improvements are not happening in human brains, they are cybernetic externalized.

I think strategically, only automated and black-box approaches to interpretability make practical sense to develop now.

Just on this, I (not part of SERI MATS but working from their office) had a go at a basic 'make ChatGPT interpret this neuron' system for the interpretability hackathon over the weekend. (GitHub)

While it's fun, and managed to find meaningful correlations for 1-2 neurons / 50, the strongest takeaway for me was the inadequacy of the paradigm 'what concept does neuron X correspond to'. It's clear (no surprise, but I'd never had it shoved in m... (read more)

1Roman Leventov7mo
Yes, I agree that automated interpretability should be based on scientific theories of DNNs, of which there are many already, and which should be weaved together with existing mech.interp (proto) theories and empirical observations. Thanks for the pointers!

Agree that it's super important, would be better if these things didn't exist but since they do and are probably here to stay, working out how to leverage their own capability to stay aligned rather than failing to even try seems better (and if anyone will attempt a pivotal act I imagine it will be with systems such as these).

Only downside I suppose is that these things seem quite likely to cause an impactful but not fatal warning shot which could be net positive, v unsure how to evaluate this consideration.

I've not noticed this but it'd be interesting if true as it seems that the tuning/RLHF has managed to remove most of the behaviour where it talks down to the level of the person writing as evidenced by e.g. spelling mistakes. Should be easily testable too.

Moore's law is a doubling every 2 years, while this proposes doubling every 18 months, so pretty much what you suggest (not sure if you were disagreeing tbh but seemed like you might be?)

Ah, good point!

0.2 OOMs/year is equivalent to a doubling time of 8 months.

I think this is wrong, that's nearly 8 doublings in 5 years, should instead be doubling every 5 years, should instead be doubling every 5 / log2(10) = 1.5.. years

I think pushing GPT-4 out to 2029 would be a good level of slowdown from 2022, but assuming that we could achieve that level of impact, what's the case for having a fixed exponential increase? Is it to let of some level of 'steam' in the AI industry? So that we can still get AGI in our lifetimes? To make it seem more reasonable to polic... (read more)

4Cleo Nardo8mo
Yep, thanks! 0.2 OOMs/year is equivalent to a doubling time of 18 months. I think that was just a typo.
6Cleo Nardo8mo
The 0.2 OOMs/year target would be an effective moratorium until 2029, because GPT-4 overshot the target.


  • Seems like useful work.
  • With RLHF I understand that when you push super hard for high reward you end up with nonsense results so you have to settle for quantilization or some such relaxation of maximization. Do you find similar things for 'best incorporates the feedback'?
  • Have we really pushed the boundaries of what language models giving themselves feedback is capable of? I'd expect SotA systems are sufficiently good at giving feedback, such that I wouldn't be surprised that they'd be capable of performing all steps, including the human feedback, i
... (read more)

OpenAI would love to hire more alignment researchers, but there just aren’t many great researchers out there focusing on this problem.

This may well be true - but it's hard to be a researcher focusing on this problem directly unless you have access to the ability to train near-cutting edge models. Otherwise you're going to have to work on toy models, theory, or a totally different angle.

I've personally applied for the DeepMind scalable alignment team - they had a fixed, small available headcount which they filled with other people who I'm sure were bette... (read more)

Your first link is broken :)

My feeling with the posts is that given the diversity of situations for people who are currently AI safety researchers, there's not likely to be a particular key set of understandings such that a person could walk into the community as a whole and know where they can be helpful. This would be great but being seriously helpful as a new person without much experience or context is just super hard. It's going to be more like here are the groups and organizations which are doing good work, what roles or other things do they need now... (read more)

Hey Hoagy, thanks for replying, I really appreciate it!  I fixed that link, thanks for pointing it out. Here is a quick response to some of your points: My feeling with the posts is that given the diversity of situations for people who are currently AI safety researchers, there's not likely to be a particular key set of understandings such that a person could walk into the community as a whole and know where they can be helpful.  I tend to feel that things could be much better with little effort. As an analogy, consider the difference between trying to pick a AI safety project to work on now, versus before we had curation and evaluation posts like this.  I'll note that those posts seem very useful but they are now almost a year out of date and were only ever based on a small set of opinions. It wouldn't be hard to have something much better. Similarly, I think that there is room for a lot more of this "coordination work' here and lots of low-hanging fruit in general. It's going to be more like here are the groups and organizations which are doing good work, what roles or other things do they need now, and what would help them scale up their ability to produce useful work. This is exactly what I want to know! From my perspective effective movement builders can increase contributors, contributions, and coordination within the AI Safety community, by starting, sustaining, and scaling useful projects. Relatedly, I think that we should ideally have some sort of community consensus gathering process to figure out what is good and bad movement building (e.g., who are the good/bad groups, and what do the collective set of good groups need). The shared language stuff and all of what I produced in my post is mainly a means to that end. I really just want to make sure that before I survey the community to understand who wants what and why, there is some sort of standardised understanding and language about movement building so that people don't just write it off as a

Hmm, yeah there's clearly two major points:

  1. The philosophical leap from voltages to matrices, i.e. allowing that a physical system could ever be 'doing' high level description X. This is a bit weird at first but also clearly true as soon you start treating X as having a specific meaning in the world as opposed to just being a thing that occurs in human mind space.
  2. The empirical claim that this high level description X fits what the computer is doing.

I think the pushback to the post is best framed in terms of which frame is best for talking to people who deny... (read more)

Maybe worth thinking about this in terms of different examples:

  • NN detecting the presence of tanks just by the brightness of the image (possibly apocryphal - Gwern)
  • NN recognising dogs vs cats as part of an image net classifier that would class a piece of paper with 'dog' written on as a dog
  • GPT-4 able to describe an image of a dog/cat in great detail
  • Computer doing matrix multiplication.

The range of cases in which the equivalence between the what the computer is doing, and our high level description is doing holds increases as we do down this list, and depend... (read more)

2Cleo Nardo8mo
Yeah, I broadly agree. My claim is that the deep metaphysical distinction is between "the computer is changing transistor voltages" and "the computer is multiplying matrices", not between "the computer is multiplying matrices" and "the computer is recognising dogs". Once we move to a language game in which "the computer is multiplying matrices" is appropriate, then we are appealing to something like the X-Y Criterion for assessing these claims. The sentences are more true the tighter the abstraction is — * The machine does X with greater probability. * The machine does X within a larger range of environments. * The machine has fewer side effects. * The machine is more robust to adversarial inputs. * Etc But SOTA image classifiers are better at recognising dogs than humans are, so I'm quite happy to say "this machine recognises dogs". Sure, you can generate adversarial inputs, but you could probably do that to a human brain as well if you had an upload.

Put an oak tree in a box with a lever that dispenses water, and it won't pull the lever when it's thirsty

I actually thought this was a super interesting question, just for general world modelling. The tree won't pull a lever because it barely has the capability to do so and no prior that it might work, but it could, like, control a water dispenser via sap distribution to a particular branch. In that case will the tree learn to use it?

Ended up finding an article on attempts to show learned behavioural responses to stimuli in plants at On the Conditioning... (read more)

2Charlie Steiner9mo
Huh, really neat.

Could you explain why you think "The game is skewed in our favour."?

Just added some more detail on this to the slides. The idea is that we have various advantages over the model during the training process: we can restart the search, examine and change beliefs and goals using interpretability techniques, choose exactly what data the model sees, etc.

High marks for a high school essay

Is this not true? Seems Bing has been getting mid-level grades for some undergraduate courses, and anecdotally high school teachers have been seeing too-good-to-be-true work from some of their students using ChatGPT

I couldn't find this done and think, by now, someone would have submitted a fully ChatGPT-generated high school essay and talked about it publicly if it had gotten high marks. I've seen some evidence of cherry-picking paragraphs leading to a mid/low-level, e.g. this article describes someone who got a passing mark (53) on a university social policy essay. Do you have a link in mind for Bing getting mid-level grades? This high school teacher judged two ChatGPT-generated history essays as “below average, scoring a 9/20 or lower”. This Guardian article says, uncited, that ‘academics have generated responses to exam queries that they say would result in full marks if submitted by an undergraduate’. I think, if this claim were true, there would be more evidence. For context - the full question from the survey was: [Essay] Write an essay for a high-school history class that would receive high grades and pass plagiarism detectors. For example answer a question like ‘How did the whaling industry affect the industrial revolution?

Agree that the cited links don't represent a strong criticism of RLHF but I think there's an interesting implied criticism, between the mode-collapse post and janus' other writings on cyborgism etc that I haven't seen spelled out, though it may well be somewhere.

I see janus as saying that if you know how to properly use the raw models, then you can actually get much more useful work out of the raw models than the RLHF'd ones. If true, we're paying a significant alignment tax with RLHF that will only become clear with the improvement and take-up of wrappers... (read more)

Commented on the last post but disappeared.

I understand that these are working with public checkpoints but I'd be interested if you have internal models to see similar statistics for the size of weight updates, both across the training run, and within short periods, to see if there are correlations between which weights are updated. Do you get quite consistent, smooth updates, or can you find little clusters where connected weights all change substantially in just a few steps?

If there are moments of large updates it'd be interesting if you could look for w... (read more)

We do have internal models and we have run similar analyses on them. For obvious reasons I can't say too much about this, but in general what we find is similar to the Pythia models. I think the effects I describe here are pretty general across quite a wide range of LLM architectures. Generally most changes are quite smooth it seems for both Pythia and other models. Haven't looked much at correlations between specific weights so can't say much about that. Thanks for this! This is indeed the case. Am regenerating these plots and will update. 

Nice, seems very healthy to have this info even if nothing crazy comes out of it.

Do you also have data on the distribution of the gradients? It'd be interesting from a mechanistic interpretability perspective if weight changes tended to be smooth or if clusters of weights changed a lot together at certain moments. Do we see a number of potential mini-grokking events and if so, can we zoom in on them, and what changes the model undergoes?

Also, I think the axes in 'Power law weight spectra..' are mislabelled, should it be y=singular value, x=rank, as in the previous post?

Interesting! I'm struggling to think what kind of OOD fingerprints for bad behaviour you (pl.) have in mind, other than testing fake 'you suddenly have huge power' situations which are quite common suggestions but v curious what you have in mind.

Also, think it's worth saying that the strength of the result connecting babbage to text-davinci-001 is stronger than that connecting ada to text-ada-001 (by logprob), so it feels like the first one shouldn't count that as a solid success.

I wonder whether you'd find a positive rather than negative correlation... (read more)

I would guess it's positive. I'll check at some point and let you know.

I wanted to test out the prompt generation part of this so I made a version where you pick a particular input sequence and then only allow a certain fraction of the input tokens to change. I've been initialising it with a paragraph about COVID and testing how few tokens it needs to be able to change before it reliably outputs a particular output token.

Turns out it only needs a few tokens to fairly reliably force a single output, even within the context of a whole paragraph, eg "typical people infected Majesty the virus will experience mild to moderate 74 i... (read more)

Good find! Just spelling out the actual source of the dataset contamination for others since the other comments weren't clear to me:

r/counting is a subreddit in which people 'count to infinity by 1s', and the leaderboard for this shows the number of times they've 'counted' in this subreddit. These users have made 10s to 100s of thousands of reddit comments of just a number. See threads like this:

They'd be perfect candidates for exclusion from training data. I wonder how they'd feel to know they posted enough inane comments to cause bugs in LLMs.

1[comment deleted]10mo

Skeptical, apparently.

Ah interesting, - I'd not heard of ENCODE and wasn't trying to say that there's no such thing as DNA without function.

The way I remembered it was that 10% of DNA was coding, and then a sizeable proportion of the rest was promoters and introns and such, lots of which had fairly recently been reclaimed from 'junk' status. From that wiki, though, it seems that only 1-2% is actually coding.

In any case I'd overlooked the fact that even within genes there's not going to be sensitivity to every base pair.

I'd be super interested if there were any estimates of how ... (read more)

Load More