LESSWRONG
LW

742
Håvard Tveit Ihle
2235150
Message
Dialogue
Subscribe

AI researcher, former cosmologist. Homepage.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
15Inference costs for hard coding tasks halve roughly every two months
6d
0
7Is the gap between open and closed models growing? Evidence from WeirdML
2mo
3
56Introducing the WeirdML Benchmark
8mo
13
67o1-preview is pretty good at doing ML on an unknown dataset
1y
1
33How good are LLMs at doing ML on an unknown dataset?
1y
4
Oliver Daniels-Koch's Shortform
Håvard Tveit Ihle19h10

Yea, it could be worth it in some cases, if that is what you need for your experiment. In this case I would look for a completely open source llm project (where both the code and data are open), so that you know you are comparing apples to apples-with-your-additional-pretraining.

Reply
Oliver Daniels-Koch's Shortform
Håvard Tveit Ihle22h76

If you have to repeat the entire post-training all the way from a base model, that is obviously a lot more work than just adding a small fine-tuning stage to an already post-trained model.

The full post-training can also only really be done by a big lab which has their own full post-training stack. Post training is getting more and more advanced and complicated with each month.

Reply
The title is reasonable
Håvard Tveit Ihle1d41

You say "LLMs are really weird", like that is an argument against Eliezers high confidence. While I agree that the weirdness should make us less confident about what specific internal concepts and drives they have, the weirdness itself is an argument in favor of Eliezers position, that whatever drives they end up with will look alien to us, at least when they get applied way out of the training distribution. Do you agree with this?

Not saying I agree with Eliezers high confidence, just talking about this specific point.

Reply
Is the gap between open and closed models growing? Evidence from WeirdML
Håvard Tveit Ihle2mo24

Actually, it seems to come in around the level of the leading chinese models, so the gap is not closing much, at least not on these kinds of tasks. 

Reply
meemi's Shortform
Håvard Tveit Ihle8mo192

Why do you consider it unlikely that companies could (or would) fish out the questions from API-logs?

Reply
Introducing the WeirdML Benchmark
Håvard Tveit Ihle8mo41

Thank you for your comment!

Not sure I agree with you about which way the tradeoff shakes out. To me it seems valuable that people outside the main labs have a clear picture of the capabilities of the leading models, and how that evolves over time, but I see your point that it could also encourage or help capabilities work, which is not my intention.

I’m probably guilty of trying to make the benchmark seem cool and impressive in a way that may not be helpful for what I actually want to achieve with this.

I will think more about this, and read what others have been thinking about it. At the very least I will keep your perspective in mind going forward.

Reply
Introducing the WeirdML Benchmark
Håvard Tveit Ihle8mo20

The LLMs are presented with the ML task and they write python code to solve the ML task. This python code is what is run in the isolated docker with 12GB memory.

So the LLMs themselves are not run on the TITAN V, they are mostly called through an API. Although I did in fact run a bunch of the LLMs locally through ollama, just not on the TITAN V server, but a larger one.

Reply
Introducing the WeirdML Benchmark
Håvard Tveit Ihle8mo10

My guess is it's <1 hour per task assuming just copilot access, and much less if you're allowed to use e.g. o1 + Cursor in agent mode. That being said, I think you'd want to limit humans to comparable amounts of compute for comparable number, which seems a bit trickier to make happen.

I guess I was thinking that the human baseline should be without LLMs, because otherwise I could just forward the prompt to the best LLM, se what they did, and perhaps improve upon it, which would put human level always at or above the best LLM.

Then again this is not how humans typically work now, so it’s unclear what is a «fair» comparison. I guess it depends on what the human baseline is supposed to represent, and you have probably thought a lot about that question at METR.

Is the reason you can't do one of the existing tasks, just to get a sense of the difficulty?

I could, but it would not really be a fair comparison, since I have seen many of the LLMs solutions, and have seen what works.

Doing a fresh task I made myself would not be totally fair either, since I will know more about the data then they do, but it would definitely be closer to fair.

Reply
Introducing the WeirdML Benchmark
Håvard Tveit Ihle8mo10

API costs will definitely dominate for o1-preview, but most of the runs are with models that are orders of magnitude cheaper, and then it is not clear what dominates.

Going forward, models like o1-preview (or even more expensive) will probably dominate the cost, so the compute will probably be a small fraction.

Reply
Introducing the WeirdML Benchmark
Håvard Tveit Ihle8mo30

Thank you!

I've been working on the automated pipeline as a part time project for about two months, probably equivalent to 2-4 full-time weeks of work.

One run for one model and one task typically takes perhaps 5-15 minutes, but it can be up to about an hour (if they use their 10 min compute time efficiently, which they tend not to do).

Total API costs for the project is probably below 200$ (if you do not count the credits used on googles free tier). Most of the cost is for running o1-mini and o1-preview (even though o1-preview only went through a third of the runs compared to the other models). o1-preview costs about 2$ for each run on each task. For compute I'm using hardware we have locally with my employer, so I have not tracked what the equivalent cost of renting it would be, but I guess it would be of the same order of magnitude or as the API costs or a factor of a few larger.

I expect the API costs to dominate going forward though if we want to run o3 models etc through the eval.

Reply
Load More