This work was done as an experiment for Boaz Barak’s “CS 2881r: AI Safety and Alignment” at Harvard. The lecture where this work was presented can be viewed on YouTube here,
From the launch of ChatGPT to billion-dollar chip deals, AI announcements now shape global markets and public imagination alike. But how does the economy respond to these announcements? And how do the new and improved models approach the complex, “messy” tasks that characterize real-world human labor?
For CS 2881r at Harvard, we wanted to explore both sides of this question:
- How financial markets have reacted to major AI-related announcements, and
- How large language models (LLMs) perform and self-assess on tasks of varying “messiness.”
1. Market Reactions to AI Announcements
Motivation
When ChatGPT was first released in late 2022, markets moved sharply. Treasury yields fell, tech stocks rallied, and investors began rewriting their expectations about productivity and inflation. We wondered: were these just initial shocks, or did markets continue to respond to AI milestones over time?
Process
We compiled a list of 19 major AI-related announcements from November 2022 (ChatGPT) through June 2025 (Gemini 2.5 Pro). For each event, we measured changes in 10-year and 30-year Treasury yields within a ±5 trading day event window. Using data from the FRED API, we compared actual yield movements to a constant model of expected changes, computing cumulative abnormal changes (CACs) to detect unusual movements around announcement dates.
Findings
- Phase 1 (2022–2023): The AI Shock.
Announcements like ChatGPT and GPT-4 triggered significant yield drops — as large as 51 basis points — reflecting investor excitement and a belief in AI-driven productivity gains. - Phase 2 (Early 2024): Diminishing Returns.
Yields still moved, but less sharply. Markets seemed to be adjusting their expectations, showing smaller and mixed reactions. - Phase 3 (Mid-2024–2025): Saturation.
By the time of later announcements like Gemini 2.5, yield changes were statistically insignificant. Investors appeared to have fully priced in AI’s long-run productivity story.

Our interpretation is that perhaps markets have learned. The early “AI shock” has matured into a steady state where new announcements are no longer surprises. Just as AI systems stabilize with training, so too have financial markets adapted to the once-explosive influence of AI news. At the same time, it could be noted that the early major AI announcements coincided with other world events — like FED announcements and the SVB crisis.
2. How LLMs Handle Messy, Real-World Tasks?
Motivation
While markets adjust to AI, AI itself is learning to deal with the real world — and the real world is messy. Human work often involves incomplete information, overlapping goals, and ambiguous instructions. We wanted to measure how this “messiness” affects LLM performance and how well models can estimate their own completion times for tasks that resemble real jobs.
Process
We used GDPVal, a public benchmark of 1,000+ economic and professional tasks, selecting 70 tasks across three key sectors:
- Professional, Scientific, and Technical Services (e.g., software engineering),
- Government
- Retail Trade.
Each task was evaluated using METR’s messiness framework (Kwa et al., 2025), which defines messiness based on factors like real-world sourcing, lack of simplification, and end-to-end context. We asked GPT-4o-mini to rate each task on: messiness (0–16 scale), difficulty (1–5 scale) and estimated completion time. Initially, the model’s time estimates were wildly high (sometimes hours for short tasks), so we introduced prompt calibrations:
- Reframed time in seconds instead of minutes,
- Provided baselines (e.g., “1k-token reasoning ≈ 2–3 minutes”).
This calibration improved estimates and consistency, though as you’ll see later, the estimates are still a severe overestimate in most cases.

Results
- Messier tasks took longer.
As expected, higher messiness scores correlated with longer estimated completion times. Models slowed down when confronted with unstructured, realistic, multi-step prompts — just like humans.

- Difficulty was not predictive.
Subjective difficulty ratings didn’t align with performance outcomes. “Messiness” turned out to be a better proxy for real-world cognitive load. - Self-estimation remains unreliable.
LLMs often misjudged their own completion times, revealing persistent limitations in meta-cognition, or their ability to reflect on their own processes. In the graph below, the left of the red line indicates an overestimate of the completion time compared to the true time. As you see, most of the tasks were overestimated.

Our analysis also compared performance across model generations: specifically GPT-4o-mini, GPT-5-mini, and GPT-5.
Across the 70 GDPVal tasks, we observed a clear improvement in output quality for the newer models. The differences were not just quantitative (higher accuracy or completeness scores), but also qualitative — the kinds of tasks that each model excelled at changed meaningfully between the 4-series and 5-series.
Key Findings
- Overall Performance Gains
Newer models (GPT-5-mini and GPT-5) produced slightly higher-quality, more usable responses across nearly every domain.

- Shift in Task Strengths
The GPT-4-series models tended to perform better on structured, template-like tasks — for example, filling out forms, summarizing brief documents, or performing contained reasoning steps.
In contrast, the GPT-5-series models excelled on open-ended, professional tasks that more closely resemble real economic or managerial work, such as:- Software engineering (multi-file synthesis, configuration, deployment)
- Project management and professional writing (e.g., memos, training manuals)
- Legal and regulatory reasoning (interpreting policies, drafting responses)


We hypothesize that this improvement stems from increased training exposure to economic and enterprise-related data.
As models have become more commercially integrated, the tasks most valuable to businesses — such as technical documentation, contract generation, or analytical writing — have become better represented in the data used to fine-tune them.
In other words, as the economic relevance of LLM applications has grown, so too has the model’s competency in those economically valuable domains.
- Current model capabilities do not yet achieve a level of accuracy on messy tasks that can replace human labor.
The best performing model, GPT-5-mini, achieved an average score of 58/100 across the 70 tasks. Both GPT-5 and GPT-5-mini performed worse on average on tasks with higher messiness (M=10 or M = 12) as compared to lower (M = 9). We note that the sample size of these groups is limited and findings have significant uncertainty, but our evidence suggests that the models tested cannot consistently achieve high accuracy on messy tasks.

Conclusions
Our findings highlight three key lessons for understanding AI’s evolving role in the economy:
- Markets may be pricing in new developments in the AI industry, but confounders remain.
Market movements in response to AI industry announcements became less significant over time, but some announcements coincided with other world events, so we can’t definitively claim that investors have adjusted their expectations. - Messiness matters.
Even the latest OpenAI reasoning models still struggle to consistently perform unstructured, context-rich tasks, with overall average scores within 50-60 out of 100 for tested models. AI systems would still need to achieve much higher accuracy on these tasks to approach complete replacement of human labor in real-world contexts. - Meta-cognition remains a challenge.
LLMs are not yet able to predict their time completions. Enabling models to reflect on their own performance and recognize uncertainty may help integrate AI safely and effectively into professional workflows.