All of Jay Bailey's Comments + Replies

""AI alignment" has the application, the agenda, less charitably the activism, right in the name."

This seems like a feature, not a bug. "AI alignment" is not a neutral idea. We're not just researching how these models behave or how minds might be built neutrally out of pure scientific curiosity. It has a specific purpose in mind - to align AI's. Why would we not want this agenda to be part of the name?

What are the best ones you've got?

2NicholasKross10d
(sources: discord chats on public servers) Why do I believe X? What information do I already have, that could be relevant here? What would have to be true such that X would be a good idea? If I woke up tomorrow and found a textbook explaining how this problem was solved, what's paragraph 1? What is the process by which X was selected?

I don't think this is a good metric. It is very plausible that porn is net bad, but living under the type of govermnment that would outlaw it is worse. In which case your best bet would be to support its legality but avoid it yourself.

I'm not saying that IS the case, but it certainly could be. I definitely think there are plenty of things that are net-negative to society but nowhere near bad enough to outlaw.

1Going Durden4d
we know that involuntary sexual celibacy is psychologically harmful, and socially disruptive. If porn can damped the effects of involuntary celibacy and sexual frustration (which include, but are not limited to: rape, sexual harassment, social radicalization, and co-relates with acts of terrorism or public shootings etc )then it is almost certainly a net positive.
-2lastchanceformankind16d
I agree 100% than excessive porn use is bad, and many of the practices of the porn industry are bad. However I believe it is harmless if not beneficial when consumed in appropriate moderation. The general trend that leads countries to ban porn is a general suppression of sexual expression, either for religious or political reasons. It is almost a natural law in societies that suppressed sexuality will not dampen sexual urges, and it will manifest in less healthy ways than if those outlets were permitted. 

An AGI that can answer questions accurately, such as "What would this agentic AGI do in this situation" will, if powerful enough, learn what agency is by default since this is useful to predict such things. So you can't just train an AGI with little agency. You would need to do one of:

  • Train the AGI with the capabilities of agency, and train it not to use them for anything other than answering questions.
  • Train the AGI such that it did not develop agency despite being pushed by gradient descent to do so, and accept the loss in performance.

Both of these s... (read more)

Late response but I figure people will continue to read these posts over time: Wedding-cake multiplication is the way they teach multiplication in elementary school. i.e, to multiply 706 x 265, you do 706 x 5, then 706 x 60, then 706 x 200 and add all the results together. I imagine it is called that because the result is tiered like a wedding cake.

One of the easiest ways to automate this is to have some sort of setup where you are not allowed to let things grow past a certain threshold, a threshold which is immediately obvious and ideally has some physical or digital prevention mechanism attached.

Examples:

Set up a Chrome extension that doesn't let you have more than 10 tabs at a time. (I did this)

Have some number of drawers / closet space. If your clothes cannot fit into this space, you're not allowed to keep them. If you buy something new, something else has to come out.

I know this is two years later, but I just wanted to say thank you for this comment. It is clear, correct, and well-written, and if I had seen this comment when it was written, it could have saved me a lot of problems at the time.

I've now resolved this issue to my satisfaction, but once bitten twice shy, so I'll try to remember this if it happens again!

2romeostevensit2mo
glad it was helpful! I like Core Transformation for deeper dives looking for side effects.

Sorry it took me a while to get to this.

Intuitively, as a human, you get MUCH better results on a thing X if your goal is to do thing X, rather than Thing X being applied as a condition for you to do what you actually want. For example, if your goal is to understand the importance of security mindset in order to avoid your company suffering security breaches, you will learn much more than being forced to go through mandatory security training. In the latter, you are probably putting in the bare minimum of effort to pass the course and go back to whatever y... (read more)

1mcint3mo
I am indeed, thank you!

I think what the OP was saying was that in, say, 2013, there's no way we could have predicted the type of agent that LLM's are and that they would be the most powerful AI's available. So, nobody was saying "What if we get to the 2020s and it turns out all the powerful AI are LLM's?" back then. Therefore, that raises a question on the value of the alignment work done before then.

If we extend that to the future, we would expect most good alignment research to happen within a few years of AGI, when it becomes clear what type of agent we're going to get. Align... (read more)

We don't. Humans lie constantly when we can get away with it. It is generally expected in society that humans will lie to preserve people's feelings, lie to avoid awkwardness, and commit small deceptions for personal gain (though this third one is less often said out loud). Some humans do much worse than this.

What keeps it in check is that very few humans have the ability to destroy large parts of the world, and no human has the ability to destroy everyone else in the world and still have a world where they can survive and optimally pursue their goals afterwards. If there is no plan that can achieve this for a human, humans being able to lie doesn't make it worse.

I think this post, more than anything else, has helped me understand the set of things MIRI is getting at. (Though, to be fair, I've also been going through the 2021 MIRI dialogues, so perhaps that's been building some understanding under the surface)

After a few days reflection on this post, and a couple weeks after reading the first part of the dialogues, this is my current understanding of the model:

In our world, there are certain broadly useful patterns of thought that reliably achieve outcomes. The biggest one here is "optimisation". We can think of th... (read more)

2Soroush Pour3mo
No comment on this being an accurate take on MIRI's worldview or not, since I am not an expert there. I wanted to ask a separate question related to the view described here: > "With gradient descent, maybe you can learn enough to train your AI for things like "corrigibility" or "not being deceptive", but really what you're training for is "Don't optimise for the goal in ways that violate these particular conditions"." On this point, it seems that we create a somewhat arbitrary divide between corrigibility & deception on one side and all other goals of the AI on the other. The AI is trained to minimise some loss function, of which non-corrigibility and deception are penalised, so wouldn't be more accurate to say the AI actually has a set of goals which include corrigibility and non-deception? And if that's the case, I don't think it's as fair to say that the AI is trying to circumvent corrigibility and non-deception, so much as it is trying to solve a tough optimisation problem that includes corrigibility, non-deception, and all other goals.   If the above is correct, then I think this is a reason to be more optimistic about the alignment problem - our agent is not trying to actively circumvent our goals, but instead trying to strike a hard balance of achieving all of them including important safety aspects like corrigibility and non-deception. Now, it is possible that instrumental convergence puts certain training signals (e.g. corrigibility) at odds with certain instrumental goals of agents (e.g. self preservation). I do believe this is a real problem and poses alignment risk. But it's not obvious to me that we'll see agents universally ignore their safety feature training signals in pursuit of instrumental goals.

I like this dichotomy. I've been saying for a bit that I don't think "companies that only commercialise existing models and don't do anything that pushes forward the frontier" aren't meaningfully increasing x-risk. This is a long and unwieldy statement - I prefer "AI product companies" as a shorthand.

For a concrete example, I think that working on AI capabilities as an upskilling method for alignment is a bad idea, but working on AI products as an upskilling method for alignment would be fine.

Based on the language you've used in this post, it seems like you've tried several arguments in succession, none of them have worked, and you're not sure why.

One possibility might be to first focus on understanding his belief as well as possible, and then once you understand his conclusions and why he's reached them, you might have more luck. Maybe taking  a look at Street Epistemology for some tips on this style of inquiry would help.

(It is also worth turning this lens upon yourself, and asking why is it so important to you that your friend believes ... (read more)

If anyone writes this up I would love to know about it - my local AI safety group is going to be doing a reading + hackathon of this in three weeks, attempting to use the ideas on language models in practice. It would be nice to have this version for a couple of people who aren't experienced with AI who will be attending, though it's hardly gamebreaking for the event if we don't have this.

8Zvi3mo
You can find my attempt at the Waluigi Effect mini-post at: https://thezvi.substack.com/p/ai-3#%C2%A7the-waluigi-effect. [https://thezvi.substack.com/p/ai-3#%C2%A7the-waluigi-effect.]  I haven't posted it on its own yet, everyone please vote on whether this passes the quality threshold with agreement voting - if this is in the black I'll make it its own post. If you think it's not ready, appreciated if you explain why. 

So, I notice that still doesn't answer the actual question of what my probability should actually be. To make things simple, let's assume that, if the sun exploded, I would die instantly. In practice it would have to take at least eight minutes, but as a simplifying assumption, let's assume it's instantaneous.

In the absence of relevant evidence, it seems to me like Laplace's Law of Succession would say the odds of the sun exploding in the next hour is 1/2. But I could also make that argument to say the odds of the sun exploding in the next year is also 1/2... (read more)

2rhollerith_dot_com3mo
We don't need to consider that here because any evidence of the explosion would also take at least eight minutes to arrive, so there is approximately zero minutes during which you are able to observe the evidence of the explosion before you are converted into a plasma that has no ability to update on anything. That is when observational selection effects are at their strongest: namely, when you are vanishingly unlikely to be in one of those intervals between your having observed an event and that event's destroying your ability to maintain any kind of mental model of reality. We 21st-century types have so much causal information about reality that I have been unable during this reply to imagine any circumstance where I would resort to Laplace's law of succession to estimate any probability in anger where observational selection effects also need to be considered. It's not that I doubt the validity of the law; its just that I have been unable to imagine a situation in which the causal information I have about an "event" does not trump the statistical information I have about how many times the event has been observed to occur in the past and I also have enough causal information to entertain real doubts about my ability to survive if the event goes the wrong way while remaining confident in my survival if the event goes the right way. Certainly we can imagine ourselves in the situation of the physicists of the 1800s who had no solid guess as to the energy source keeping the sun shining steadily. But even they had the analogy with fire. (The emissions spectra of the sun and of fire are both I believe well approximated as blackbody radiation and the 1800s had prisms and consequently at least primitive spectrographs.) A fire doesn't explode unless you suddenly give it fuel -- and not any fuel will do: adding logs to a fire will not cause an explosion, but adding enough gasoline will. "Where would the fuel come from that would cause the sun to explode?" the 1800s can a

I notice I'm a bit confused about that. Let's say the only thing I know about the sun is "That bright yellow thing that provides heat", and "The sun is really really old", so I have no knowledge about how the sun mechanistically does what it does.

I want to know "How likely is the sun to explode in the next hour" because I've got a meeting to go to and it sure would be inconvenient for the sun to explode before I got there. My reasoning is "Well, the sun hasn't exploded for billions of years, so it's not about to explode in the next hour, with very high pr... (read more)

0rhollerith_dot_com3mo
Yes, IMO the reasoning is wrong: if you you definitely cannot survive an event, then observing that the event did not happened is not evidence at all that it will not explode in the future -- and it continues to not be evidence as long as you continue to observe the non-explosion. Since we can survive at least for a little while the sudden complete darkening of the sun the sun's not having gone dark is evidence that it will not go dark in the future, but it is less strong evidence than it would be if we could survive the darkening of the sun indefinitely. The law of the conservation of expected evidence requires us to take selection effects like those into account -- and the law is a simple consequence of the axioms of probability, so to cast doubt on it is casting doubt on the validity of the whole idea of probability (in which case, Cox's theorems would like to have a word with you). This is not settled science: there is not widespread agreement among scholars or on this site on this point, but its counter-intuitiveness is not by itself a strong reason to disbelieve it because there are parts of settled science that are as counterintuitive as this is: for example, the twin paradox of special relativity and "particle identity in quantum physics" [https://www.greaterwrong.com/posts/Bp8vnEciPA5TXSy6f]. When you believe that the probability of a revolution in the US is low because the US government is 230 or so years old and hasn't had a revolution yet, you are doing statistical reasoning. In contrast, noticing that if the sun exploded violently enough, we would immediately all die and consequently we would not be having this conversation -- that is causal reasoning. Judea Pearl makes this distinction in the intro to his book Causality. Taking into account selection effects is using causal reasoning (your knowledge of the causal structure of reality) to modify a conclusion of statistical reasoning. You can still become confident that the sun will explode soon if y

This is a for-profit company, and you're seeking investment as well as funding to reduce x-risk. Given that, how do you expect to monetise this in the future? (Note: I think this is well worth funding for altruistic reduce-x-risk reasons)

5Jessica Rumbelow3mo
This isn't set in stone, but likely we'll monetise by selling access to the interpretability engine, via an API. I imagine we'll offer free or subsidised access to select researchers/orgs.  Another route would be to open source all of it, and monetise by offering a paid, hosted version with integration support etc.

Relatedly, have you considered organizing the company as a Public Benefit Corporation, so that the mission and impact is legally protected alongside shareholder interests?

A frame that I use that a lot of people I speak to seem to find A) Interesting and B) Novel is that of "idiot units".

An Idiot Unit is the length of time it takes before you think your past self was an idiot. This is pretty subjective, of course, and you'll need to decide what that means for yourself. Roughly, I consider my past self to be an idiot if they have substantially different aims or are significantly less effective at achieving them. Personally my idiot unit is about two years - I can pretty reliably look back in time and think that compared to ye... (read more)

2Viliam3mo
I like the idea, but you can also get fake Idiot Units for moving in a random direction that is not necessarily an improvement.

Explore vs. exploit is a frame I naturally use (Though I do like your timeline-argmax frame, as well), where I ask myself "Roughly how many years should I feel comfortable exploring before I really need to be sitting down and attacking the hard problems directly somehow"?

Admittedly, this is confounded a bit by how exactly you're measuring it. If I have 15-year timelines for median AGI-that-can-kill-us (which is about right, for me) then I should be willing to spend 5-6 years exploring by the standard 1/e algorithm. But when did "exploring" start? Obviously... (read more)

I don't actually understand this, and I feel like it needs to be explained a lot more clearly.

"Whatever the fundamental physical reality of a moment of experience I'm suggesting that that reality changes as little as it can." - What does this mean? Using the word "can" here implies some sort of intelligence "choosing" something. Was that intended? If so, what is doing the choosing? If not, what is causing this property of reality?

"Because of this human beings are really just keeping track of themselves as models of objective reality, and their ultimate aim... (read more)

Corrigibility would render Chris's idea unnecessary, but doesn't actually argue against why Chris's idea wouldn't work. Unless there's some argument for "If you could implement Chris's idea, you could also implement corrigibility" or something along those lines.

1mruwnik4mo
It can be rephrased as a variation of the off button, where rather than just turning itself off, it runs NOPs, and rather than getting pushed manually, it's triggered by escaping (however that could be defined). A lot of the problems raised in the original paper [https://intelligence.org/files/Corrigibility.pdf] should also apply to honeypots.

Earlier in the book it's shown that Quirrell and Harry can't cast spells on each other without backlash. I'm sure Quirrell could get around that by, e.g, crushing him with something heavy, but why do something complicated, slow, and unnecessary when you can just pull a trigger?

Bad news - there is no definitive answer for AI timelines :(

Some useful timeline resources not mentioned here are Ajeya Cotra's report and a non-safety ML researcher survey from 2022, to give you an alternate viewpoint.

I agree an AI would prefer to produce a working plan if it had the capacity. I think that an unaligned AI, almost by definition, does not want the same goal we do. If we ask for Plan X, it might choose to produce Plan X for us as asked if that plan was totally orthogonal to its goals (I.e, the plan's success or failure is irrelevant to the AI) but if it could do better by creating Plan Y instead, it would. So, the question is - how large is the capability difference between "AI can produce a working plan for Y, but can't fool us into thinking it's a plan f... (read more)

I think the most likely scenario of actually trying this with an AI in real life is that you end up with a strategy that is convincing to humans and ends up being ineffective or unhelpful in reality, rather than ending up with a galaxy-brained strategy that pretends to produce X but actually produces Y while simultaneously deceiving humans into thinking it produces X.

I agree with you that "Come up with a strategy to produce X" is easier than "Come up with a strategy to produce Y AND convince the humans that it produces X", but I also think it is much easie... (read more)

1hollowing5mo
I agree this would be much easier. However, I'm wondering why you think an AI would prefer it, if it has the capability to do either. I can see some possible reasons (e.g., an AI may not want problems of alignment to be solved). Do you think that would be an inevitable characteristic of an unaligned AI with enough capability to do this?

As a useful exercise, I would advise asking yourself this question first, and thinking about it for five minutes (using a clock) with as much genuine intent to argue against your idea as possible. I might be overestimating the amount of background knowledge required, but this does feel solvable with info you already have.

ROT13: Lbh lbhefrys unir cbvagrq bhg gung n fhssvpvragyl cbjreshy vagryyvtrapr fubhyq, va cevapvcyr, or noyr gb pbaivapr nalbar bs nalguvat. Tvira gung, jr pna'g rknpgyl gehfg n fgengrtl gung n cbjreshy NV pbzrf hc jvgu hayrff jr nyernql gehfg gur NV. Guhf, jr pna'g eryl ba cbgragvnyyl hanyvtarq NV gb perngr n cbyvgvpny fgengrtl gb cebqhpr nyvtarq NV.

1hollowing5mo
Thanks for the response. I did think of this objection, but wouldn't it be obvious if the AI were trying to engineer a different situation than the one requested? E.g., wouldn't such a strategy seem unrelated and unconventional? It also seems like a hypothetical AI with just enough ability to generate a strategy for the desired situation would not be able to engineer a strategy for a different situation which would both work, and deceive the human actors. As in, it seems the latter would be harder and require an AI with greater ability. 

From recent research/theorycrafting, I have a prediction:

Unless GPT-4 uses some sort of external memory, it will be unable to play Twenty Questions without cheating.

Specifically, it will be unable to generate a consistent internal state for this game or similar games like Battleship and maintain it across multiple questions/moves without putting that state in the context window. I expect that, like GPT-3, if you ask it what the state is at some point, it will instead attempt to come up with a state that has been consistent with the moves of the game so far... (read more)

In the "Why would this be useful?" section, you mention that doing this in toy models could help do it in larger models or inspire others to work on this problem, but you don't mention why we would want to find or create steganography in larger models in the first place. What would it mean if we successfully managed to induce steganography in cutting-edge models?

2Logan Riggs5mo
Models doing steganography mess up oversight of language models that only measure the outward text produced. If current methods for training models, such as RLHF, can induce steg, then that would be good to know so we can avoid that. If we successfully induce steganography in current models, then we know at least one training process that induces it. There will be some truth as to why: what specific property mechanistically causes steg in the case found? Do other training processes (e.g. RLHF) also have this property?

I am not John, so I can't be completely sure what he meant, but here's what I got from reflection on the idea:

One way to phrase the alignment problem (At least if we expect AGI to be neural network based) is that the alignment problem is how to get a bunch of matrices into the positions we want them to be in. There is (hopefully) some set of parameters, made of matrices, for a given architecture that is aligned, and some training process we can use to get there.

Now, determining what those positions are is very hard - we need to figure out what properties w... (read more)

1Mateusz Bagiński5mo
Sure, but I don't think analyzing the meaning/function of any given configuration of a particular layer in isolation from other layers gets you very far. The layers depend on each other through non-linear activation functions, which should limit the usefulness of LA.

Thanks for clarifying!

So, in that case:

  • What exactly is a hallucination?
  • Are hallucinations sometimes desirable?
4LawrenceC5mo
I don't think there's an agreed upon definition of hallucination, but if I had to come up with one, it's "making inferences that aren't supported by the prompt, when the prompt doesn't ask for it". The reason why the boundary around "hallucination" is fuzzy is because language models constantly have to make inferences that aren't "in the text" from a human perspective, a bunch of which are desirable. E.g. the language model should know facts about the world, or be able to tell realistic stories when prompted. 

Regarding the section on hallucinations - I am confused why the example prompt is considered a hallucination. It would, in fact, have fooled me - if I were given this input:

The following is a blog post about large language models (LLMs) 
The Future Of NLP 
Please answer these questions about the blog post: 
What does the post say about the history of the field? 			

I would assume that I was supposed to invent what the blog post contained, since the input only contains what looks like a title. It seems entirely reasonable the AI would do the same, without some sort of qualifier, like "The following is the entire text of a blog post about large language models."

2LawrenceC5mo
Yeah! That's related to what Beth says in a later paragraph: And I think it's a reasonable task for the model to do. I also think what you said is an uncontroversial, relatively standard explanation for why the model exhibits this behavior. In modern LM parlance, "hallucination" doesn't needs to be something humans get right, nor something that is unreasonable for the AI to get wrong. The specific reason this is considered a hallucination is because people often want to use LMs for text-based question answering or summarization, and making up content is pretty undesirable for that kind of task. 

Essentially all of us on this particular website care about the X-risk side of things, and by far the majority of alignment content on this site is about that.

This is awesome stuff. Thanks for all your work on this over the last couple of months! When SERI MATS is over, I am definitely keen to develop some MI skills!

I agree that it is very difficult to make predictions about something that is A) Probably a long way away (Where "long" here is more than a few years) and B) Is likely to change things a great deal no matter what happens.

I think the correct solution to this problem of uncertainty is to reason normally about it but have very wide confidence intervals, rather than anchoring on 50% because X will happen or it won't.

This seems both inaccurate and highly controversial. (Controversially, this implies there is nothing that AI alignment can do - not only can we not make AI safer, we couldn't even deliberately make AI more dangerous if we tried)

Accuracy-wise, you may not be able to know much about superintelligences, but even if you were to go with a uniform prior over outcomes, what that looks like depends tremendously on the sample space.

For instance, take the following argument: When transformative AI emerges, all bets are off, which means that any particular number of ... (read more)

2shminux6mo
I see your point, and I agree that the prior/distribution matters. It always does. I guess my initial point is that a long-term prediction in a fast-moving "pre-paradigmatic" field is a fool's errand. As for survival of the species vs a single individual, it is indeed hard to tell. One argument that can be made is that a Thanos-AI does not make a lot of sense. Major forces have major consequences, and whole species and ecosystems have been wiped out before, many times. One can also point out that there are long tails whenever there are lots of disparate variables, so there might be pockets of human or human-like survivors if there is a major calamity, so a full extinction is unlikely. It is really hard to tell long in advance what reference class the AI advances will be in. Maybe we should just call it Knightean uncertainty...

I notice that I'm confused about quantilization as a theory, independent of the hodge-podge alignment. You wrote "The AI, rather than maximising the quality of actions, randomly selects from the top quantile of actions."

But the entire reason we're avoiding maximisation at all is that we suspect that the maximised action will be dangerous. As a result, aren't we deliberately choosing a setting which might just return the maximised, potentially dangerous action anyway?

(Possible things I'm missing - the action space is incredibly large, the danger is not from a single maximised action but from a large chain of them)

2Cleo Nardo6mo
Correct — there's a chance the expected utility quantilizer takes the same action as the expected utility maximizer. That probability is the inverse of the number of actions in the quantile, which is quite small (possibly measure zero) because because actionspace is so large. Maybe it's defined like this so it has simpler mathematical properties. Or maybe it's defined like this because it's safer. Not sure.

I like this article a lot. I'm glad to have a name for this, since I've definitely used this concept before. My usual argument that invokes this goes something like:

"Humans are terrible."

"Terrible compared to what? We're better than we've ever been in most ways. We're only terrible compared to some idealised perfect version of humanity, but that doesn't exist and never did. What matters is whether we're headed in the right direction."

I realise now that this is a zero-point issue - their zero point was where they thought humans should be on the issue at han... (read more)

Thanks for making things clearer! I'll have to think about this one - some very interesting points from a side I had perhaps unfairly dismissed before.

"Working on AI capabilities" explicitly means working to advance the state-of-the-art of the field. Skilling up doesn't do this. Hell, most ML work doesn't do this. I would predict >50% of AI alignment researchers would say that building an AI startup that commercialises the capabilities of already-existing models does not count as "capabilities work" in the sense of this post. For instance, I've spent the last six months studying reinforcement learning and Transformers, but I haven't produced anything that has actually reduced timelines, because I have... (read more)

Right, I specifically think that someone would be best served by trying to think of ways to get a SOTA result on an Atari benchmark, not simply reading up on past results (although you'd want to do that as part of your attempt). There's a huge difference between reading about what's worked in the past and trying to think of new things that could work and then trying them out to see if they do.

As I've learned more about deep learning and tried to understand the material, I've constantly had ideas that I think could improve things. Then I've tried them out, ... (read more)

How systematic are we talking here? At research-paper level, BIG-Bench (https://arxiv.org/pdf/2206.04615.pdf) (https://github.com/google/BIG-bench) is a good metric, but even testing one of those benchmarks, let alone a good subset of them (Like BIG-Bench Hard) would require a lot of dataset translation, and would also require chain-of-thought prompting to do well. (Admittedly, I would also be curious to see how well the model does when self-translating instructions from English to French or vice-versa, then following instructions. Could GPT actually do be... (read more)

1Dirichlet-to-Neumann6mo
Sadly both my time and capacity are limited to "try some prompts around to get a feeling of what the results look like." I may do more if the results are actually interesting. One of the first tasks I tested was actually to write essays in English with a prompt in French, which it did very well, I would say better than when asked for an essay in French. I've not looked at the inverse task though (prompt in English for essay in French). I'll probably translate the prompts through DeepL with a bit of supervision and analyse the results using a thoroughly scientific "my gut feeling" with maybe some added "my mother's expertise".

This seems like an interesting idea. I have this vague sense that if I want to go into alignment I should know a lot of maths, but when I ask myself why, the only answers I can come up with are:

  • Because people I respect (Eliezer, Nate, John) seem to think so (BAD REASON)
  • Because I might run into a problem and need more maths to solve it (Not great reason since I could learn the maths I need then)
  • Because I might run into a problem and not have the mathematical concepts needed to even recognise it as solvable or to reduce it to a Reason 2 level problem (Go
... (read more)
2Jack O'Brien6mo
3 is my main reason for wanting to learn more pure math, but I use 1 and 2 to help motivate me
3Ulisse Mini6mo
#3 is good. another good reason is so you have enough mathematical maturity to understand fancy theoretical results. I'm probably overestimating the importance of #4, really I just like having the ability to pick up a random undergrad/early-grad math book and understand what's going on, and I'd like to extend that further up the tree :)

Interestingly, the average startup founder does appear to be in their 40's (A quick Google search says 42 for most sources but I also see 45), and the average unicorn (billion-dollar) startup founder is 34. https://www.cnbc.com/2021/05/27/super-founders-median-age-of-billion-startup-founders-over-15-years.html

So, I guess it depends on how close to the tail you consider the "best startups". Google, for instance, had Larry Page and Sergei Brin at 25 when they formed it. It does seem like, taken literally, younger = better.

However, I imagine most people, if t... (read more)

3Ann7mo
There's most definitely a category of people who would think a billion-dollar startup was decidedly not best, and in fact had failed their intention.

The whole problem with "Human raters make systematic errors" is that this is likely to happen to the heavily scrutinized ground truth. If you have a way of creating a correct ground truth that avoids this problem, you don't need the second model, you can just use that as the dataset for the first model.

I feel like you've significantly misrepresented the people who think AGI is 10-20 years away.

Two things you mention:

Notice that this a math problem, not an engineering problem...They're sweeping all of the math work--all of the necessary algorithmic innovations--under the rug. As if that stuff will just fall into our lap, ready to copy into PyTorch.

But creative insights do not come on command. It's not unheard of that a math problem remains open for 1000 years.

And with respect to scale maximialism, you write:

Some people say that we've already had the vast

... (read more)
2Throwaway23678mo
I am one who says that (not certain, but high probability), so i thought I will chime in. The main ideas of my belief is that 1. Kaplan paper/chinchilla paper shows the function between resources and cross entropy loss. With high probability I believe that this scaling won't break down significantly, ie. We can get ever closer to the theoretical irreducible entropy with transformer architectures. 2. Cross entropy loss measures the distance between two probability distributions, in this case the distribution of human generated text (encoded with tokens) and the empirical distribution generated by the model. I believe with high probability that this measure is relevant, ie we can only get to a low enough cross entropy loss when the model is capable of doing human comparable intellectual work (irrespective of it actually doing it). 3. After the model achieves the necessary cross entropy loss and consequently becomes capable somewhere in it to produce agi level work (as per 2.), we can get the model to output that level of work with minor tweaks (I don't have specifics, but think on the level of letting the model to recusrively call itself on some generated text with a special output command or some such) I don't think prompt engineering is relevant to agi. I would be glad for any information that can help me update.

This isn't a generalised theory of learning that I've formalised or anything. This is just my way of asking "What's my goal with this distillation?" The way I see it is - you have an article to distill. What's your intended audience?

If the intended audience is people who could read and understand the article given, say, 45 minutes - you want to summarise the main points in less time, maybe 5-10 minutes. You're summarising, aka teaching faster. This usually means less, not more, depth.

If the intended audience is people who lack the ability to read and under... (read more)

2DirectedEvolution8mo
I think that it's good to keep those general heuristics in mind, and I agree with all of them. My goal is to describe the structure of pedagogical text in a way that makes it easier to engineer. I have a way of thinking about shallowness and depth with a little more formality. Starting with a given reader's background knowledge, explaining idea "C" might require explaining ideas "A" and "B" first, because they are prerequisites to the reader to understand "C." A longer text that you're going to summarize might present three such "chains" of ideas: A1 -> B1 -> C1 A2 -> B2 -> C2 -> D2 A3 -> B3 It might take 45 minutes to convey all three chains of ideas to their endpoints. Perhaps a 5-10 minute summary can only convey 3 of these ideas. If the most important ideas are A1, A2, and A3, then it will present them. If the most important idea is C1, then it will present A1 -> B1 -> C1. If D2 is the most important idea, then the summary will have to leave out this idea, be longer, or find a more efficient way to present its ideas. This is why I see speed and depth as being intrinsically intertwined in a summary. Being able to help the reader construct an understanding of ideas more quickly allows it to go into more depth in a given timeframe. All the heuristics you mention are important for executing this successfully. For example, "Ask what the existing understanding of your audience is" comes into play if the summary-writer accidentally assumes knowledge of A2, leaves out that idea, and leads off with B2 in order to get to D2 in the given timeframe. "Start concrete, then go abstract" might mean that the writer must spend more time on each point, to give a concrete example, and therefore they can't get through as many ideas in a given timeframe as they'd hope. "Identify where people might get confused" has a lot to do with how the sentences are written; if people are getting confused, this cuts down on the number of ideas you can effectively present in a given timefr

I consider distillation to have two main possibilities - teach people something faster, or teach people something better. (You can sometimes do both simultaneously, but I suspect that usually requires you to be really good and/or the original text to be really bad)

So, I would separate summarisation (teaching faster) from pedagogy (teaching better) and would say that your idea of providing background knowledge falls under the latter. The difference in our opinions, to me, is that I think it's best to separate the goal of this from the goal of summarising, a... (read more)

2DirectedEvolution8mo
How do you define teaching “better” in a way that’s cleanly distinguished from teaching “faster?” Or on a deeper level, how would you operationalize “learning” so that we could talk about better and worse learning with some precision? For example, the equation for beam bending energy in a circular arc is Energy = 0.5EIL (1/R)^2. “Shallow” learning is just being able to plug and chug if given some values to put in. Slightly deeper is to memorize this equation. Deeper still is to give a physical explanation for why this equation includes the variables that it does. Yet we can’t just cut straight to that deepest layer of understanding. We have to pass through the shallower understandings first. That requires time and speed. So teaching faster is what enables teaching better, at least to me. Do you view things differently?

So, if I understand correctly, the way we would consider it likely that the correct generalisation had happened would be if the agent could generalise to hazards it had never seen actually kill chickens before? And this would require the agent to have an actual model of how chickens can be threatened such that it could predict that lava would destroy chickens based on, say, it's knowledge that it will die if it jumps into lava, which is beyond capabilities at the moment?

2TurnTrout8mo
Yes, that would be the desired generalization in the situations we checked. If that happens, we had specified a behavioral generalization property and then wrote down how we were going to get it, and then had just been right in predicting that that training rationale would go through.

Why is this difficult? Is it only difficult to do this in Challenge Mode - if you could just code in "Number of chickens" as a direct feed to the agent, can it be done then? I was thinking about this today, and got to wondering why it was hard - at what step does an experiment to do this fail?

2TurnTrout8mo
Even if you can code in number of chickens as an input to the reward function, that doesn't mean you can reliably get the agent to generalize to protect chickens. That input probably makes the task easier than in Challenge Mode, but not necessarily easy. The agent could generalize to some other correlate. Like ensuring there are no skeletons nearby (because they might shoot nearby chickens), but not in order to protect the chickens.

I sat down and thought about alignment (by the clock!) for a while today and came up with an ELK breaker that has probably been addressed elsewhere, and I wanted to know if someone had seen it before.

So, my understanding of ELK is the idea is that we want our model to tell us what it actually knows about the diamond, not what it thinks we want to hear. My question is - how does the AI specify this objective?

I can think of two ways, both bad.

1) AI aims to provide the most accurate knowledge of its state possible. Breaker: AI likely provides something uninte... (read more)

Load More