All of delton137's Comments + Replies

I would modify the theory slightly by noting that the brain may become hypersensitive to sensations arising from the area that was originally damaged, even after it has healed. Sensations that are otherwise normal can then trigger pain. I went to the website about pain reprocessing therapy and stumbled upon an interview with Alan Gordon where he talked about this.  I suspect that high level beliefs about tissue damage etc play a role here also in causing the brain to become hyper focused on sensations coming from a particular region and to interpret t... (read more)

Since nobody else posted these: 

Bay Area is Sat Dec 17th (Eventbrite) (Facebook)

South Florida (about an hour north of Miami) is Sat Dec 17th (Eventbrite) (Facebook)

On current hardware, sure.

It does look like scaling will hit a wall soon if hardware doesn't improve, see this paper:

But Gwern has responded to this paper pointing out several flaws... (having trouble finding his response right now..ugh)

However, we have lots of reasons to think Moore's law will continue ... in particular future AI will be on custom ASICs / TPUs / neuromorphic chips, which is a very different story. I wrote about this long ago, in 2015. Such chips, especially asynchronous and analog ones, can be vastly more ... (read more)

I disagree, in fact I actually think you can argue this development points the opposite direction, when you look at what they had to do to achieve it and the architecture they use. 

I suggest you read Ernest Davis' overview of Cicero.  Cicero is a special-purpose system that took enormous work to produce -- a team of multiple people labored on it for three years.  They had to assemble a massive dataset from 125,300 online human games. They also had to get expert annotations on thousands of preliminary outputs. Even that was not enough.. they ... (read more)

I've looked into these methods a lot, in 2020 (I'm not so much up to date on the latest literature). I wrote a review in my 2020 paper, "Self-explaining AI as an alternative to interpretable AI". 

There are a lot of issues with saliency mapping techniques, as you are aware (I saw you link to the "sanity checks" paper below). Funnily enough though, the super simple technique of occlusion mapping does seem to work very well, though! It's kinda hilarious actually that there are so many complicated mathematical techniques for saliency mapping, but I have s... (read more)

There's no doubt a world simulator of some sort is probably going to be an important component in any AGI, at the very least for planning - Yan LeCun has talked about this a lot. There's also this work where they show a VAE type thing can be configured to run internal simulations of the environment it was trained on.

In brief, a few issues I see here:

  • You haven't actually provided any evidence that GPT does simulation other than "Just saying “this AI is a simulator” naturalizes many of the counterintuitive properties of GPT which don’t usually become apparen
... (read more)
6the gears to ascension7mo
my impression is that by simulator and simulacra this post is not intending to claim that the thing it is simulating is realphysics but rather that it learns a general "textphysics engine", the model, which runs textphysics environments. it's essentially just a reframing of the prediction objective to describe deployment time - not a claim that the model actually learns a strong causal simplification of the full variety of real physics.

Peperine (black pepper extract) can help make quercetin more bioavailable. They are co-administered in many studies on the neuroprotective effects of quercetin:,22&q=piperine+quercetin

I find slower take-off scenarios more plausible. I like the general thrust of Christiano's "What failure looks like". I wonder if anyone has written up a more narrative / concrete account of that sort of scenario.

The thing you are trying to study ("returns on cognitive reinvestment") is probably one of the hardest things in the world to understand scientifically. It requires understanding both the capabilities of specific self-modifying agents and the complexity of the world. It depends what problem you are focusing on too -- the shape of the curve may be very different for chess vs something like curing disease. Why? Because chess I can simulate on a computer, so throwing more compute at it leads to some returns. I can't simulate human biology in a computer - we h... (read more)

I disagree. The theoretical framework is a first step to allow us to reason more clearly about the topic. I expect to eventually bridge the gap between the theoretical and the empirical eventually. In fact, I just added some concrete empirical research directions I think could be pursued later on:   Recall that I called this "a rough draft of the first draft of one part of the nth post of what I hope to one day turn into a proper sequence". There's a lot of surrounding context that I haven't gotten around to writing yet. And I do have a coherent narrative of where this all fits together in my broader project to investigate takeoff dynamics.   The formalisations aren't useless; they serve to refine and sharpen thinking. Making things formal forces you to make explicit some things you'd left implicit.

How familiar are you with Chollet's paper "On the Measure of Intelligence"? He disagrees a bit with the idea of "AGI" but if you operationalize it as "skill acquisition efficiency at the level of a human" then he has a test called ARC which purports to measure when AI has achieved human-like generality.

This seems to be a good direction, in my opinion. There is an ARC challenge on Kaggle and so far AI is far below the human level. On the other hand, "being good at a lot of different things", ie task performance across one or many tasks, is obviously very important to understand and Chollet's definition is independent from that.

ARC is a nice attempt. I also participated in the original challenge on Kaggle. The issue is that the test can be gamed (as anyone on Kaggle did) brute forcing over solution strategies.  An open-ended or interactive version of ARC may solve this issue.

Thanks, it's been fixed!!

Interesting, thanks. 10x reduction in cost every 4 years is roughly twice what I would have expected. But it sounds quite plausible especially considering AI accelerators and ASICs.

Thanks for sharing! That's a pretty sophisticated modeling function but it makes sense. I personally think Moore's law (the FLOPS/$ version) will continue, but I know there's a lot of skepticism about that.

Could you make another graph like Fig 4 but showing projected cost, using Moore's law to estimate cost? The cost is going to be a lot, right?

Thanks!  Good idea. I might do this when I get the time—will let you know!
We basically lumped the reduced cost of FLOP per $ and increased spending together. A report from CSET on AI and Compute [] projects the costs by using two strongly simplified assumptions: (I) doubling every 3.4 months (based on OpenAI's previous report) and (II) computing cost stays constant. This could give you some ideas on rather upper bounds of projected costs. Carey's previous analysis [] uses this dataset from AI Impacts [] and therefore assumes:

Networks with loops are much harder to train.. that was one of the motivations for going to transformers instead of RNNs. But yeah, sure, I agree. My objection is more that posts like this are so high level I have trouble following the argument, if that makes sense. The argument seems roughly plausible but not making contact with any real object level stuff makes it a lot weaker, at least to me. The argument seems to rely on "emergence of self-awareness / discovery of malevolence/deception during SGD" being likely which is unjustified in my view. I'm not s... (read more)

Has GPT-3 / large transformers actually led to anything with economic value? Not from what I can tell although anecdotal reports on Twitter are that many SWEs are finding Github Copilot extremely useful (it's still in private beta though). I think transformers are going to start providing actual value soon, but the fact they haven't so far despite almost two years of breathless hype is interesting to contemplate. I've learned to ignore hype, demos, cool cherry-picked sample outputs, and benchmark chasing and actually look at what is being deployed "in the ... (read more)

Economic value might not be a perfect measure. Nuclear fission didn't generate any economic value either until 200.000 in Japan were incinerated. My fear is that a mixture of experts approach can lead to extremely fast progress towards AGI. Perhaps even less - maybe all it takes is an agent AI that can code as well as humans, to start a cascade of recursive self-improvement. But indeed, a Knightian uncertainty here would already put me at some ease. As long as you can be sure that it won't happen "just anytime" before some more barriers are crossed, at least you can still sleep at night and have the sanity to try to do something. I don't know, I'm not a technical person, that's why I'm asking questions and hoping to learn more. "I'm more worried about someone reverse engineering the wiring of cortical columns in the neocortex in the next few years and then replicating it in silicon." Personally that's what worries me the least. We can't even crack c.elegans! I don't doubt that in 100-200 years we'd get there but I see many other way faster routes.

This is a shot in the dark, but I recall there was a blog post that made basically the same point visually, I believe using Gaussian distributions. I think the number they argued you should aim for was 3-4 instead of 6. Anyone know what I'm talking about?

I don't recognize the exact blog post you reference but in my personal experience the actual number of skills that put me on a useful Pareto frontier is indeed closer to 3-4. To be clear, I don't get there by just learning 3-4 different skills. I learn like 8-10 and then a subset of 3-4 gets me to the Pareto frontier.

Hi, I just wanted to say thanks for the comment / feedback. Yeah, I probably should have separated out the analysis of Grokking from the analysis of emergent behaviour during scaling. They are potentially related - at least for many tasks it seems Grokking becomes more likely as the model gets bigger. I'm guilty of actually conflating the two phenomena in some of my thinking, admittedly.

Your point about "fragile metrics" being more likely to show Grokking great. I had a similar thought, too.

I think a bit too much mindshare is being spent on these sci-fi scenario discussions, although they are fun.

Honestly I have trouble following these arguments about deception evolving in RL. In particular I can't quite wrap my head around how the agent ends up optimizing for something else (not a proxy objective, but a possibly totally orthogonal objective like "please my human masters so I can later do X"). In any case, it seems self awareness is required for the type of deception that you're envisioning. Which brings up an interesting question - can a pu... (read more)

1Timothy Underwood1y
Yeah, but don't you expect successful human equivalent neural networks to have some sort of loop involved? It seems pretty likely to me that the ML researchers will successfully figure out how to put self analysis loops into neural nets.

Zac says "Yes, over the course of training AlphaZero learns many concepts (and develops behaviours) which have clear correspondence with human concepts."

What's the evidence for this? If AlphaZero worked by learning concepts in a sort of step-wise manner, then we should expect jumps in performance when it comes to certain types of puzzles, right? I would guess that a beginning human would exhibit jumps from learning concepts like "control the center" or "castle early, not later".. for instance the principle "control the center", once followed, has implicati... (read more)

Huh, that's pretty cool, thanks for sharing.

This is pretty interesting. There is a lot to quibble about here, but overall I think the information about bees here is quite valuable for people thinking about where AI is at right now and trying to extrapolate forward.

A different approach, perhaps more illuminating would be to ask how much of a bee's behavior could we plausibly emulate today by globing together a bunch of different ML algorithms into some sort of virtual bee cognitive architecture - if say we wanted to make a drone that behaved like a bee ala Black Mirror. Obviously that's a much more c... (read more)

Another point is that when you optimize relentlessly for one thing, you have might have trouble exploring the space adequately (get stuck at local maxima). That's why RL agents/algorithms often take random actions when they are training (they call this "exploration" instead of "exploitation"). Maybe random actions can be thought of as a form of slack? Micro-slacks?

Look at Kenneth Stanley's arguments about why objective functions are bad (video talk on it here). Basically he's saying we need a lot more random exploration. Humans are similar - we have an ope... (read more)

Bostrom talks about this in his book "Superintelligence" when he discusses the dangers of Oracle AI. It's a valid concern, we're just a long way from that with GPT-like models, I think.

I used to think a system trained on text only could never learn vision. So if it escaped onto the internet, it would be pretty limited in how it could interface with the outside world since it couldn't interpret streams from cameras. But then I realized that probably in it's training data is text on how to program a CNN. So in theory a system trained on only text could build... (read more)

I just did some tests... it works if you go to settings and click "Activate Markdown Editor". Then convert to Markdown and re-save (note, you may want to back up before this, there's a chance footnotes and stuff could get messed up). 

$stuff$ for inline math and double dollar signs for single line math work when in Markdown mode. When using the normal editor, inline math doesn't work, but $$ works (but puts the equation on a new line). 

I have mixed feelings on this. I have mentored ~5 undergraduates in the past 4 years and observed many others, and their research productivity varies enormously. How much of that is due to IQ vs other factors I really have no idea. My personal feeling was most of the variability was due to life factors like the social environment (family/friends) they were ensconced in and how much time that permitted them to focus on research. 

My impression from TAing physics for life scientists for two years was that a large number felt they were intrinsically bad a... (read more)

I liked how in your AISS support talk you used history as a frame for thinking about this because it highlights the difficulty of achieving superhuman ethics. Human ethics (for instance as encoded in laws/rights/norms) is improving over time, but it's been a very slow process that involves a lot of stumbling around and having to run experiments to figure out what works and what doesn't.  "The Moral Arc" by Michael Shermer is about the causes of moral progress... one of them is allowing free speech, free flow of ideas. Basically, it seems moral progres... (read more)

It's a mixed bag. A lot of near term work is scientific, in that theories are proposed and experiments run to test them, but from what I can tell that work is also incredibly myopic and specific to the details of present day algorithms and whether any of it will generalize to systems further down the road is exceedingly unclear. 

The early writings of Bostom and Yudkowsky I would classify as a mix of scientifically informed futurology and philosophy. As with science fiction, they are laying out what might happen. There is no science of psychohistory an... (read more)

The paper you cited does not show this.

Yeah, you're right I was being sloppy. I just crossed it out. 

oo ok, thanks, I'll take a look. The point about generative models being better is something I've been wanting to learn about, in particular. 

SGD is a form of efficient approximate Bayesian updating.

Yeah I saw you were arguing that in one of your posts. I'll take a closer look. I honestly have not heard of this before. 

Regarding my statement - I agree looking back at it it is horribly sloppy and sounds absurd, but when I was writing I was just thinking about how all L1 and L2 regularization do is bias towards smaller weights - the models still take up the same amount of space on disk and require the same amount amount of compute to run in terms of FLOPs. But yes you're right they make the m... (read more)

So actually L1/L2 regularization does allow you to compress the model by reducing entropy, as evidenced by the fact that any effective pruning/quantization system necessarily involves some strong regularizer applied during training or after. The model itself can't possibly know or care whether you later actually compress said weights or not, so it's never the actual compression itself that matters, vs the inherent compressibility (which comes from the regularization).

By the way, if you look at Filan et al.'s paper "Clusterability in Neural Networks" there is a lot of variance in their results but generally speaking they find that L1 regularization leads to slightly more clusterability than L2 or dropout.

The idea that using dropout makes models simpler is not intuitive to me because according to Hinton dropout essentially does the same thing as ensembling. If what you end up with is something equivalent to an ensemble of smaller networks than it's not clear to me that would be easier to prune.

One of the papers you linked to appears to study dropout in the context of Bayesian modeling and they argue it encourages sparsity. I'm willing to buy that it does in fact reduce complexity/ compressibility but I'm also not sure any of this is 100% clear cut.

It's not that dropout provides some ensembling secret sauce; instead neural nets are inherently ensembles proportional to their level of overcompleteness. Dropout (like other regularizers) helps ensure they are ensembles of low complexity sub-models, rather than ensembles of over-fit higher complexity sub-models (see also: lottery tickets, pruning, grokking, double descent).
By the way, if you look at Filan et al.'s paper "Clusterability in Neural Networks []" there is a lot of variance in their results but generally speaking they find that L1 regularization leads to slightly more clusterability than L2 or dropout.

(responding to Jacob specifically here) A lot of things that were thought of as "obvious" were later found out to be false in the context of deep learning - for instance the bias-variance trade-off.

I think what you're saying makes sense at a high/rough level but I'm also worried you are not being rigorous enough. It is true and well known that L2 regularization can be derived from Bayesian neural nets with a Gaussian prior on the weights. However neural nets in deep learning are trained via SGD, not with Bayesian updating -- and it doesn't seem modern CNNs... (read more)

SGD is a form of efficient approximate Bayesian updating. More specifically it's a local linear 1st order approximation. As the step size approaches zero this approximation becomes tight, under some potentially enormous simplifying assumptions of unit variance (which are in practice enforced through initialization and explicit normalization). But anyway that's not directly relevant, as Bayesian updating doesn't have some monopoly on entropy/complexity tradeoffs. If you want to be 'rigorous', then you shouldn't have confidently said: (As you can't rigorously back that statement up). Regularization to bias towards simpler models in DL absolutely works well, regardless of whether you understand it or find the provided explanations satisfactory.

Hey, OK, fixed. Sorry there is no link to the comment -- I had a link in an earlier draft but then it got lost. It was a comment somewhere on LessWrong and now I can't find it -_-.

That's interesting it motivated you to join Anthropic - you are definitely not alone in that. My understanding is Anthropic was founded by a bunch of people who were all worried about the possible implications of the scaling laws.

1Zac Hatfield-Dodds1y
No worries, here's the comment [].

To my knowledge the most used regularization method in deep learning, dropout, doesn't make models simpler in the sense of being more compressible.

A simple L1 regularization would make models more compressible in so far as it suppresses weights towards zero so they can just be thrown out completely without affecting model performance much. I'm not sure about L2 regularization making things more compressible - does it lead to flatter minima for instance? (GPT-3 uses L2 regularization, which they call "weight decay").

But yes, you are right, Occam factors are... (read more)

Yes, it does (as should make sense, because if you can drop out a parameter entirely, you don't need it, and if it succeeds in fostering modularity or generalization, that should make it much easier to prune), and this was one of the justifications for dropout, and that has nice Bayesian interpretations too. (I have a few relevant cites in my sparsity tag [].)
L2 regularization is much more common than dropout, but both are a complexity prior and thus compress. This is true in a very obvious way for L2. Dropout is more complex to analyze, but has now been extensively analyzed and functions as a complexity/entropy penalty as all regularization does. L2 regularization (weight decay) obviously makes things more compressible - it penalizes models with high entropy under a per-param gaussian prior. "Flatter minima" isn't a very useful paradigm for understanding this, vs Bayesian statistics.

I think this is a nice line of work. I wonder if you could add a simple/small constraint on weights that avoids the issue of multimodal neurons -- it seems doable. 

I just wanted to say I don't think you did anything ethically wrong here. There was a great podcast with Diana Fleischman I listened to a while ago where she talked about how we manipulate other people all the time especially in romantic relationships. I'm uncomfortable saying that any manipulation whatsoever is ethically wrong because I think that's demanding too much cognitive overhead for human relationships (and also makes it hard to raise kids) - I think you have to have a figure out a more nuanced view. For instance, having a high level rule on what ... (read more)

You sound very confident your device would have worked really well. I'm curious, how much testing did you do? 

I have a Garmin Vivosmart 3 and it tries to detect when I'm either running, biking, or going up stairs. It works amazingly well considering the tiny amount of hardware and battery power it has, but it also fails sometimes, like randomly thinking I've been running for a while when I've been doing some other high heart rate thing. Maddeningly, I can't figure out how to turn off some of the alerts, like when I've met my "stair goal" for the day. 

Only eating with a fork. A full system would require more data than that. We tested on real people in real-world conditions who were not part of the training dataset. If someone ate in a different style we could add just a little bit of annotated training data for the eating style, run the toolchain overnight and the algorithm would be noticeably better for that person and everyone else. The reason why I'm so confident in our algorith was because ① it required very little data to do updates and ② I had lots of experience in the field which meant I knew exactly what quality level was and wasn't acceptable to customers. To update the code in response to user feedback we would have to push the new code. Building an update system was theoretically straightforward. It was a (theoretically) solved problem with little technical risk. But it was not a problem that we had personally built a toolchain for and the whole firmware update system involved more technical maintenance than I wanted to commit myself to.

I think he's conditioning heavily on being fully vaxxed and boosted when making the comparison to the flu. Which makes sense to me. I also suspect long Covid-19 risk is much lower if you're vaxxed & boosted, based on the theory that Long Covid is caused by an inflammatory cascade that won't shut off (there's a lot of debate about what biomarkers to use but many long Covid patients have elevated markers of inflammation months later). If your symptoms are mild, you won't have that inflammatory cascade. Here's Zvi on one of the latest Long Covid papers : ... (read more)

"I think this is important as the speed prior was considered to be, and still is by many, a very good candidate for a way of not producing deceptive models." I'm curious who has professed a belief in this.  

I don't have much direct experience with transformers (I was part of some research with BERT once where we found it was really hard to use without adding hard-coded rules on top, but I have no experience with the modern GPT stuff). However, what you are saying makes a lot of sense to me based on my experience with CNNs and the attempts I've seen to explain/justify CNN behaviour with side channels (for instance this medical image classification system that also generates text as a side output). 

See also my comment on Facebook

I think what you're saying makes a lot of sense. When assembling a good training data set, it's all about diversity. 

It'd be hard for humans to compete with AI unless humans can communicate with the AI in reasonable-sized chunks e.g. a 100-page document. Me, I think we should chat in 10-page documents or less ᾓ7ἿE‍♀️.

(cross posting this comment from E. S. Yudkowksy's Facebook with some edits / elaboration)

Has anyone tried fine-tuning a transformer on small datasets of increasing size to get a sense of how large a dataset would be needed to do this well? I suspect it might have to be very large.

Note this is similar to the "self explaining AI" idea I explored in early 2020, which I threw together a paper on (I am hesitant to link to it because it's not that great of a paper and much of the discussion there is CNN specific, but here it is.). I can see how producing "thoug... (read more)

I've fine-tuned GPT models on a bunch of different datasets of different sizes, although not this particular dataset (which doesn't exist yet). Below I list some key things to note.  Also see here [] for related discussion.  These points hold true for typical tasks/datasets, though a few unusual ones like arithmetic behave differently. * GPT performance tends to scale smoothly and gradually with data/model size, over multiple orders of magnitude. * In terms of subjective response, you don't need much data to get GPTs to the level of "hey, it kinda gets it!". * You may need several orders of magnitude more data to reach the point of saturation where the model can't improve with additional data. * Incomplete mastery usually looks more like "randomly failing X% of the time" than "understanding X% of the content of the task," which can make it difficult to assess quality (or quality differences) at a glance. For a concrete example, here is a data scaling experiment [] I did with GPT-J (6.1B params) on the tumblr post dataset I use for my tumblr bot [GPT-J (6.1B params) ].  My full dataset is roughly 4 times as large as the 30M word dataset proposed here, i.e. the 30M word dataset would be roughly as big as the 25% subsample shown in the report. The linked report only shows val loss, which is not very interpretable, but at least conveys that I haven't reached diminishing returns yet.  This seems plausible from subjective evidence, as the model still sometimes misunderstands tumblr lingo / the conversational structure of the data / etc.
Using the stated length estimates per section, a single run would constitute approximately 600 pages of single spaced text. This is a lot of writing.

We're guessing 1000 steps per reasonably-completed run (more or less, doesn't have to be exact) and guessing maybe 300 words per step, mostly 'thought'.  Where 'thoughts' can be relatively stream-of-consciousness once accustomed (we hope) and the dungeon run doesn't have to be Hugo quality in its plotting, so it's not like we're asking for a 300,000-word edited novel.

However I also could see the "thoughts" output misleading people - people might mistake the model's explanations as mapping onto the calculations going on inside the model to produce an output.

I think the key point on avoiding this is the intervening-on-the-thoughts part:
"An AI produces thoughts as visible intermediates on the way to story text, allowing us to watch the AI think about how to design its output, and to verify that we can get different sensible outputs by intervening on the thoughts".

So the idea is that you train things in such a way that the thoughts do map onto the calculations going on inside the model.

Note: Pfizer started a trial in September to try to answer this question.  We may know answer in a few months. In theory I don't see why it wouldn't work but with limited supply there's probably better uses at least in the next few months. 

Also, note the initial EUA application is asking it be approved for high-risk patients only, probably because Pfizer was told by FDA it wouldn't be EUA'd otherwise. 

Paxlovid must be taken with Ritonavir (otherwise Paxlovid breaks down to fast) which messes with liver enzymes and isn't a good choice for man... (read more)

Very cool, will take a look. This basically solves question 1. It seems the original Solomonoff work isn't published anywhere. By the way, the author, William H. Press, is a real polymath! I am curious if there is any extension of this work to agents with finite memory..  as an example, the same situation where you're screening a large number of people, but now you have a memory where you can store N results of prior screenings for reference. I'm going to look into it.. 

Seems like a memory version would be identical, just with a smaller n after subtracting the individuals you screen. When you fill up your memory with cleared individuals, why would you then ever want to 'forget' them? By stipulation, you learn nothing about other individuals or the population, only about the ones you look at. If you forget them to replace them with a new memory, that de facto makes the n bigger, and worsens your odds since you've flushed back into the pool the only individuals you knew for sure you never want to sample again (because they are clear) and so now you may waste a sample to test them again while gaining nothing. And once you remove them from the population via your memory, you're back to the solved memoryless problem and have to square-root it.

Here's another paper on small / non-robust features, but rather specific to patch-based vision transformers: 
Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation
^ This work is very specific to patch-based methods. Whether patches are here to stay and for how long is unclear to me, but right now they seem to be on an ascendancy (?).  

For what it's worth - I see value in votes being public by default. It can be very useful to see who upvoted or downvoted your comment. Of course then people will use the upvote feature just to indicate they read a post, but that's OK (we are familiar with that system from Facebook, Twitter, etc). 

I'm pretty apathetic about all the other proposals here. Reactions seem to me to be unnecessary distractions. [side note - emojiis are very ambiguous so it's good you put words next to each one to explain what they are supposed to mean].  The way I woul... (read more)

Load More