OpenAI is competing in the AtCoder world tour finals (heuristic division) with a new model/agent. It is a 10-hour competition with an optimization-based problem, and OpenAI's model is currently at 2nd place.
Is optimizing CoT to look nice a big concern? There are other ways to show a nice CoT without optimizing for it. The frontrunners also have some incentives to not show the real CoT. Additionally, there is a good chance that people prefer a nice structured summary of CoT by a small LLM when reasonings become very long and convoluted.
What is the point of these benchmarks without knowing the training compute and data ? One of the main questions is their interpretability. Iterative refinement of these models may open new opportunities.
People with a history of seizures are usually excluded from these kinds of clinical trials, so it is not an apple to apple comparison. the problem is that bupropion interacts with a lot of drugs. seizure rates are also highly dose dependent(10 times higher if taking more than 450 mg daily). Generally, if you’re not taking any interacting medications, are on the 150–300 mg slow-release version, and have no history of seizures, then the risk is low.
As a doctor, I can tell you that even if you don’t have anxiety, it’s possible to develop some while taking bupropion/welbutrin. I used it personally and experienced the most severe anxiety I’ve ever had. It is also associated with a higher chance of seizures, and if you daydream a lot, it may make them worse. However, on the positive side, it often decreases inattention. Generally i like the drug , but it is not a first-line treatment for depression, and for good reasons.
I lived for a while in a failing country with high unemployment. The businesses and jobs that pay well become saturated very quickly. People are less likely to spend money and often delay purchasing new stuff or maintaining their homes. Many jobs exist because we dont have time to do them ourselves, and a significant number of these jobs will just vanish. It is really hard to prepare for a high unemployment rate society.
Overrefusal issues were way more common 1-2 years ago. models like gemini 1, and claude 1-2 had severe overrefusal issues.
Your argument is actually possible, but what evidences do you have, that make it the likely outcome?
the difficulty of alignment is still unknown. it may be totally impossible, or maybe some changes to current methods (deliberative alignment or constitutional ai) + some R&D automation can get us there.
The earliest submissions by human players were at the 37-minute mark, and 3 people submitted results by the 1-hour mark. However, it is in a competitive, time-constrained environment, so it is more likely a 2-4 hour task. There is also the possibility that players made multiple attempts that were not good enough, so it may be shorter than that. The first OpenAI submission was at the 15-minute mark, so some brute-forcing is probably happening. Assuming that the tokens per second are the same as o3(168) here, they used 150,000 tokens for the first submission and more than 5.7 million for the whole competition. Of course, a lot of assumptions are going on here. There is a good chance that they used more tokens than that.