I am curious to see what would be the results of the new Gemini 2.5 pro on internal benchmarks.
I don't think ketamine neurotoxicity is a thing. Ketamine is actually closer to be a neuroprotector.
I am confident that LLMs siginificantly boost software development productivity (I would say 20-50%) and am completely sure it's not even close to 5x.
However, despite I agree with your conclusion, I would like to point out that timeframes are pretty short. 2 years ago (~exactly GPT-4 launch date) LLMs were barely making any impact. I think tools started to resemble the current state around 1 year ago (~exactly Claude 3 Opus launch date).
Now, suppose we had 5x boost for a year. Would it be very visible? We would have got 5 years of progress in 1 year but had software landscape changed a lot in 5 years in pre-LLM era? Comparing 2017 and 2022, I don't feel like that much changed.
The tech stack has shifted almost entirely to whatever there was the most data on; Python and Javascript/Typescript are in, almost everything else is out.
I think AI agents will actually prefer strongly typed languages because they provide more feedback. Working with TypeScript, Python and Rust, while a year ago the first two were clearly winning in terms of AI productivity boost, nowadays I find Cursor Agent making fewer mistakes with Rust.
I think you might find this paper relevant/interesting: https://aidantr.github.io/files/AI_innovation.pdf
TL;DR: Research on LLM productivity impacts in material disocery.
Main takeaways:
I would like to note that this dataset is not as hard as it might look like. Humans performed not so well because there is a strict time limit, I don't remember exactly but it was something like 1 hour for 25 tasks (and IIRC the medalist only made arithmetic errors). I am pretty sure any IMO gold medailst would typically score 100% given (say) 3 hours.
Nevertheless, it's very impressive, and AIMO results are even more impressive in my opinion.
Thanks, I think I understand your concern well now.
I am generally positive about the potential of prediction markets if we will somehow resolve the legal problems (which seems unrealistic in the short term but realistic in the medium term).
Here is my perspective on "why should a normie who is somewhat risk-averse, don't enjoy wagering for its own sake, and doesn't care about the information externalities, engage with prediction markets"
First, let me try to tackle the question at face value:
Good to know :)
I do agree that subsidies run into a tragedy-of-commons scenario. So despite subsidies are beneficial, they are not sufficient.
But do you find my solution to be satisfactory?
I thought about it a lot, I even seriously considered launching my own prediction market and wrote some code for it. I strongly believe that simply allowing the usage of other assets solves most of the practical problems, so I would be happy to hear any concerns or further clarify my point.
Or another, perhaps easier solution (I updated my original answer): just all...
Isn't this just changing the denominator without changing the zero- or negative-sum nature?
I feel like you are mixing two problems here: an ethical problem and a practical problem. UPD: on second thought, maybe you just meant the second problem, but still I think my response would be clearer by considering them separately.
The ethical problem is that it looks like prediction markets do not generate income, thus they are not useful and shouldn't be endorsed, they don't differ much from gambling.
While it's true that they don't generate income and are ze...
Why does it have to be "safe enough"? If all market participants agree to bet using the same asset, it can bear any degree of risk.
I think I should have said that a good prediction market allows users to choose what asset will a particular "pair" use. It will cause a liquidity split which is also a problem, but it's also manageable and, in my opinion, it would be much closer to an imaginary perfect solution than "bet only USD".
I am not sure I understand your second sentence, but my guess is that this problem will also go away if each market "pair" uses a single (but customizable) asset. If I got it wrong, could you please clarify?
In a good prediction market design users would not bet USD but instead something which appreciates over time or generates income (e.g. ETH, Gold, S&P 500 ETF, Treasury Notes, or liquid and safe USD-backed positions in some DeFi protocol).
Another approach would be to use funds held in the market to invest in something profit-generating and distribute part of the income to users. This is the same model which non-algorithmic stablecoins (USDT, USDC) use.
So it's a problem, but definitely a solvable one, even easily solvable. The major problem is that predi...
Regarding 9: I believe it's when you are successful enough that your AGI doesn't instantly kill you immediately but it still can kill you in the process of using it. It's in the context of a pivotal act, so it assumes you will operate it to do something significant and potentially dangerous.
I am currently job hunting, trying to get a job in AI Safety but it seems to be quite difficult especially outside of the US, so I am not sure if I will be able to do it.
If I will not land a safety job, one of the obvious options is to try to get hired by an AI company and try to learn more there in the hope I will either be able to contribute to safety there or eventually move to the field as a more experienced engineer.
I am conscious of why pushing capabilities could be bad so I will try to avoid it, but I am not sure how far it extends. I understa...
The British are, of course, determined to botch this like they are botching everything else, and busy drafting their own different insane AI regulations.
I am far from being an expert here, but I skimmed through the current preliminary UK policy and it seems significantly better compared to EU stuff. It even mentions x-risk!
Of course, I wouldn't be surprised if it will turn out to be EU-level insane eventually, but I think it's plausible that it will be more reasonable, at least from the mainstream (not alignment-centred) point of view.
And compute, especially inference compute, is so scarce today that if we had ASI right now, it would take several decades, even with exponential growth, to build enough compute for ASIs to challenge humanity.
Uhm, what? "Slow takeoff" means ~1 year... Your opinion is very unusual, you can't just state it without any justification.
Are you implying that it is close to GPT-4 level? If yes, it is clearly wrong. Especially in regards to code: everything (maybe except StarCoder which was released literally yesterday) is worse than GPT-3.5, and much worse than GPT-4.
In addition to many good points already mentioned, I would like to add that I have no idea how to approach this problem.
Approaching x-risk is very hard too, but it is much clearer in comparison.
Preliminary benchmarks had shown poor results. It seems that dataset quality is much worse compared to what LLaMA had or maybe there is some other issue.
Yet another proof that top-notch LLMs are not just data + compute, they require some black magic.
Generally, I am not sure if it's bad for safety in the notkilleveryoneism sense: such things prevent agent overhang and make current (non-lethal) problems more visible.
Hard to say if net good or net bad, too many factors and the impact of each are not clear.
I am not sure how did you come to the conclusion that current models are superhuman. I can visualize complex scenes in 3D for example. Especially under some drugs :)
And I don't even think I have an especially good imagination.
In general, it is very hard to compare mental imagery with Stable Diffusion. For example, it it is hard to imagine something with many different details in different parts of the image but it is perhaps a matter of representation. An analogy could be that our perception is like a low-resolution display. I can easily zoom in on a...
I do think we are in a coding overhang.
Current harnesses seem far from the ceiling and could be improved a lot. One example: you can significantly boost output quality by using simple tricks, like telling Claude Code to implement something and then explicitly asking it to self-review. You could get slightly more creative -- say, ask to implement something security-sensitive, then ask to break it.
And I feel the same way Andrej does. Even though I consider myself relatively adept at using AI agents, I feel like I am doing a quite horrible job. In principle, ... (read more)