This blog was written by Blessing Ajimoti, Vitor Tomaz, Hoda Maged, Mahi Shah, and Abubakar Abdulfatah as part of the Economics of Transformative AI research by BlueDot Impact and Apart Research.
Transformative Artificial Intelligence (TAI) promises major productivity gains by automating routine and complex tasks. But it also raises serious concerns about job displacement, especially in developing countries. According to the World Economic Forum’s Future of Jobs 2025 survey, 86% of employers say AI and digital access are the biggest forces reshaping business by 2030.
While discussions about which occupations are most at risk are growing, hard evidence on TAI’s real-time impact on employment is limited. This gap is especially important for developing countries, where limited digital infrastructure, significant market informality, and lower AI adoption present both challenges and opportunities. Policymakers, employers, and researchers need better tools to track how AI is affecting jobs and when to act to prevent risks.
To help close this gap, our research team, supported by BlueDot Impact and Apart Research, developed a new empirical approach. Using the data from Anthropic’s Economic Index: Insights from Claude 3.7 Sonnet released in March 2025, the research built on recent empirical work by Handa and colleagues in their paper Which Economic Tasks Are Performed with AI? Evidence from Millions of Claude Conversations, which analyzed millions of Claude Sonnet 3.7 conversations to study AI’s role across economic tasks. We combined data from millions of AI prompts submitted to Anthropic’s Claude Sonnet 3.7 model with official employment records from Brazil (2021–2024). Then, building on the task-based framework by Acemoglu and Restrepo, we studied whether AI usage for automating or augmenting tasks was associated with changes in net job creation across occupations.
We analyzed four occupation groups according to the share of Claude conversations they had, developing an econometric model that fits the observed trends for two of the groups - the 10 occupations that use Claude the least, and the 10 occupations that use Claude the most for augmentation. The model did not fit the other two groups well, indicating other unidentified dynamics at play. Comparing the trends for the first two groups, we found no statistical evidence that they differ, suggesting that job displacement by automation is not yet happening.
The results we obtained raise important questions and prompt us to continue this investigation. The data we used covers only one week of prompts from one AI system and focuses on a single country. As AI adoption grows in the workplace, the risks of job loss could increase. Still, our findings align with other recent studies, such as the Large Language Models, Small Labor Market Effects paper by Humlum and Vestergaard, which found no measurable effects of generative AI on wages or working hours in the short term in Denmark. Expanding our data coverage to other providers of AI services, geographies, and time frames could substantially improve our understanding of the underlying interaction between AI adoption and job flows.
At the heart of our research lies a deceptively simple question: Can employment data serve as an early warning system for AI safety risks? Our findings raise important questions regarding AI safety, providing a window of opportunity for stakeholders to plan ahead of potentially irreversible job elimination, especially at a time when reactive policy making often trails technological diffusion. It gives researchers and policymakers a way to have empirical, sector and geography-specific data to assist planning. By tracking these metrics, governments can intervene before job displacements spiral into political instability or regressive regulation. In sum, this research can serve as scaffolding for a theory of change that links prompt behavior to occupation-level AI use to employment patterns to early warning signals to policy response.
What’s promising is that the method we developed is reproducible. It can be applied in other countries and expanded with more diverse data to help governments and employers monitor how TAI affects work in real time. Rather than wait for disruption, this approach offers a way to stay ahead of it.
Our central research question is: Do occupations whose tasks are frequently associated with 'automating' AI interactions (based on Claude data) exhibit different employment trends (e.g., faster decline or slower growth) compared to other occupations, particularly those associated with 'augmenting' AI interactions? To address this, we'll break down our methodology.
First, let’s clarify a few concepts:
By “occupations” we mean the occupations listed in the official US Occupations database (O*NET), and the tasks associated with them. It is how the US classifies everyone`s occupations, from firefighters to engineers and entertainers. For analytical simplicity, we focused on O*NET’s 98 Minor Groups rather than the 1596 detailed occupations.
The statement “occupations that correlate more with Claude conversations flagged as automating” packs a lot. First, remember that the Claude dataset links conversations to tasks. Second, O*NET provides the link between the 1,596 occupations and 19,530 tasks. The challenge here is to choose how those tasks contribute to an occupation or simplicity, we did not differentiate tasks in terms of contribution and used the percentage of total conversations related to a certain occupation (through a certain task) as a proxy for this.
So, by the statement above we mean ‘O*NET Minor Groups whose constituent tasks collectively account for the highest volume of Claude conversations flagged as 'automating.' We identified the 10 Minor Groups with the highest percentages of such conversations, labeling this group 'Top 10 Automation.' Similar logic was applied to form 'Top 10 Augmentation' (based on 'augmenting' conversations), 'Top 10 Overall' (based on all conversations), and 'Bottom 10 Overall' (lowest share of all conversations). The table and diagram below illustrate this classification.
The first building block in our analysis is classifying occupations into 4 categories:
| Alias | Concept | Share of |
| top_10 | Occupations that are most represented across all Claude conversations | All conversations |
| top_10_aug | Occupations where Claude was mostly helping people, i.e, augmenting their capabilities | Conversations tagged as Augmentation |
| top_10_aut | Occupations where Claude was mostly doing the task itself | Conversations tagged as Automation |
| bottom_10 | Occupations that barely showed up in Claude usage | All conversations |
Table 1: Occupation Group Definitions and Conversation Share Basis.
The foundational data is sourced from Anthropic, derived from interactions with their Claude 3.7 Sonnet model. Anthropic employs a system referred to as 'Clio' to analyse user prompts while preserving privacy. This involves two key classification tasks:
It is crucial to acknowledge that the limited conversational context available to Clio means this classification has inherent accuracy limitations. Anthropic provides a summarised dataset linking O*NET tasks to automation/augmentation tags and their respective share of total conversations.
We then link this data to O*NET and summarise it to the occupation level, and tag the occupations according to the classifications we discussed above (top_10, top_10_aug, top_10_aut, and bottom_10).
Figure 1: Anthropic’s classification of tasks
Once these occupation groups were defined, we analysed their employment trends over time using real-world labour market data. For this study, we utilised Brazil's CAGED (Cadastro Geral de Empregados e Desempregados) database, which records monthly job creation and elimination figures based on the Brazilian Classification of Occupations (CBO). We mapped these CBO codes to the O*NET Minor Groups using our established crosswalk.
Comparing employment trends across occupation groups is not straightforward.. Several factors can confound simple comparisons:
As a result, we analyzed the data with two methodologies that attempt to remove these confounders.
We attempted to identify a model that could disaggregate the various effects on our variable of interest (net jobs), allowing us to predict its evolution.
We departed from visual inspection of the raw data to conjecture what types of treatment it requires. First observations are that (1) data is highly cyclical, (2) the scale of the group bottom_10 is very different from others, both in terms of the mean jobs flows and their variation, and (3) there seems to be a slight and consistent downward trend (see Figure 2 below).
Figure 2- Step 0: Raw Data
The next very clear aspect of our data is the strong seasonality, i.e, the data has up- and downward swings that seem to repeat over time with a constant period. We isolate that periodic effect with a method called Seasonal Trend LOESS (STL) with a period of 12 months and subtract it from the data to obtain Figure 3. Without the periodic sign, the downward trend becomes slightly more visible (See figure 3 below).
Figure 3- Step 1: Removing seasonality
We then remove the linear trend by simply fitting the data to a linear model (through an Ordinary Least Squares estimator) and subtracting it from the data. The final chart seems relatively stationary around the normalized mean (i.e., 0), which we will test next (See Figure 4 below).
Figure 4- Step 2: Removing linear trend
The previous plot showed that the swings for the category bottom_10 are much bigger than for other categories. To facilitate comparison, we “scale” the data using z-score scaling, which simply means that, for each category, we subtract its average value and divide by its standard deviation (See Figure 5 below).
Figure 5- Step 3: Scaling data
If our procedure was successful in capturing the major drivers of fluctuation in the flow of jobs for each category, we are left with one major driver: the history of job flows itself. We test for this with what is called an Autoregressive model, which is jargon to say “a model that predicts its next data point from its previous data points”.
If the combination of category, seasonality, trend, and history explains most of the variance, we succeed in concluding that our categories are informative of job flow trends, and can start drawing conclusions, such as testing whether the linear trends between two categories are statistically different.
The second methodology is called Difference-in-Differences. The method consists of defining a control population - in our case, occupations classified as Bottom 10 - and one or more populations that receive a “treatment” and might respond to it in different ways - in our case, the other populations are the classifications Top 10, Top 10 Aut, and Top 10 Aug. The “treatment” here is considered the launch of Chat GPT 4 in March 2023. We chose ChatGPT 4 because it marked the start of highly precise data extraction capabilities and scoring 70%+ on several benchmarks. The key assumption of the model is that the groups (control and treatment) would follow a similar trend had they not received treatment, but diverge due to treatment.
After defining the treatment, the control, and the treatment group, the analysis consists of creating a model that attempts to capture the effects of the treatment (or lack thereof) and other important variables. Our model is of the form:
net_jobs ~ treated + post + treated_post + C(month): treated + noise
Which can be interpreted as “when looking at the net change in jobs between one month and the previous for a given occupation, the result is a combination of the effects of whether the occupation is treatment or control, whether the date is before or after the “treatment” date, and the month of the year, plus some random noise. Moreover, we take into account interactions between these predictors (e.g., if the datapoint refers to an occupation in the treatment group and if the date is after treatment)”.
The main assumption of this test is the ‘parallel trends assumption’, which means that we need strong reasons to believe that each group, had there been no treatment (i.e., the launch of ChatGPT 4.0), would behave similarly.
We apply this model separately to compare each of the three “Top” groups, each one representing a treatment group, against the Bottom 10 group, which is the control group.
As can be seen in Figure 5, our methodology seems to capture a lot of what is going on! However, what meets the eye is not necessarily the whole story. To make sure our model is reliable, we run a couple of statistical checks. First, we use something called the Akaike Information Criterion — or AIC for short — to help us decide how many months we should look back to understand where the net jobs for this month are going, i.e., to help us decide the lag of the model.
Then, we run the Augmented Dickey-Fuller (ADF) test, which checks if the data is stable enough over time for this kind of model to work well. In simple terms, we use AIC to choose the best parameter, and the ADF to see if the model is adequate for the data. If the ADF’s p-value is too high (usually meaning greater than 0.05), then it means that we have strong reasons to believe our model does not fit the data well, and therefore our beliefs about the underlying dynamics of net jobs. The table below shows the results of this test.
| Category | Lag | ADF p-value | AIC |
| top_10 | 6 | 0.518 | -0.72 |
| bottom_10 | 1 | 0.013 | -22.87 |
| top_10_aug | 6 | 0.021 | -1.61 |
| top_10_aut | 6 | 0.666 | 7.25 |
Table 2: Summary statistics for the Autoregressive models
As a standard practice, if the ADF p-value below 0.05, we reject the idea that the data is unstable (non-stationary) - as is the case for categories Top 10 Aug and Bottom 10. Because the values for Top 10 and Top 10 Aut are high, we have strong reasons to believe our model does not capture well the main drivers of change in job flows for these two categories.
So far, our model does a good job in capturing the main drivers of job flow divergence between the categories Bottom 10 and Top 10 Aug, as can be seen by how similar the series are. As a final check, we confirm with a cross-correlation plot (below). The main insights of the plot are that (1) the value of Bottom 10 at a certain time explains almost 80% of the variance in the value of Top 10 Aug (which is very high!), and (2) the correlation decays with lag, meaning that the time structure of the series are somewhat preserved.
Our final goal is to answer the question: “Are occupations that correlate more with Claude conversations flagged as automating falling faster or growing more slowly than other occupations?”, and now we have the tools to do so! Remember our detrending? We used a separate linear trend for each category, so now we can test if the difference in their rates of change is statistically significant!
Testing whether the trends for Bottom 10 and Top 10 Aug differ
With a model that fits reasonably well with our data for the categories Bottom 10 and Top 10 Aug, we can use what we learned to assess whether their trends are different. We do so by fitting a linear regression model that includes a term for the category. We then plot the de-seasonalized and normalized data with their respective trends:
Figure 6- Step 4: Normalized and de-seasonalized series for the groups
Figure 6 shows the overlap in the 95% confidence bands, which suggests that the trends do not differ. Analyzing the summary statistics of our model, the p-value of the interaction term is well above the significance threshold, at p=0.53.
So what?
We have covered a lot of ground, and the results are somewhat mixed, but very interesting nonetheless. The key takeaways from our analyses are:
We ran an Ordinary Least Squares regression on the model described in the Approach section; the result is in the additional notes section.
The numbers that interest us the most are the R-squared (0.838), the F-statistic (2.37e-26), and the column P>|z| for the treatment (0.000), post (0.213), and interaction (0.603) terms. The elevated R-squared value means that our model captures most of the variance in the data, while the low p-value of the F-statistic indicates that the results are statistically significant. What is interesting, however, is that the p-value for treatment is very close to 0, while for the interaction between post and treatment is well above the traditional p<0.05 threshold for statistical significance. One way to read these results is that the chosen treatment does not seem to have any effect on the prediction of net jobs; the category itself, however, has a significant (p<0.05) predictive power. That aligns with what we’d expect, as we are comparing fundamentally different occupations such as engineers and computational occupations (Top 10 Aut group) and building cleaning and pest control workers or animal care workers (Bottom 10). Future work could focus on applying more modern techniques for differences-in-differences when the parallel trends assumption is broken, as appears to be the case in this analysis.
Potential reasons for our results
For the analyses where models fit well and assumptions held, we found that there is no significant difference between the four categories of net jobs examined, which does not support our initial hypotheses. However, these results can change, and hence, policymakers need to monitor the labor market and be prepared for any disruptions.
Possible reasons for our conclusion, in addition to the limitations of our research, include:
These agree with the barriers Rosa and Kubta highlighted based on their surveys; three barriers to AI adoption in Brazil are low awareness about AI in small firms, high costs of implementation, and low labor skills.
Limitations
In this section, we discuss several of the limitations and suggest ways to develop our model in the future to have a more fine-grained “microscope” to analyze the econometrics of TAI. Some of these limitations have been discussed by Handa and colleagues.
Due to time constraints, we also relied on a large language model (LLM) to assist with some mappings, but couldn’t fully review the output. Future iterations of our model should draw from more diverse data sources and prioritize transparency in how mappings are produced.
We propose policy recommendations that can mitigate the impact of TAI in Brazil and other developing countries as the AI adoption rate accelerates. The implementation of these strategies should be done by governments and frontier AI companies individually and collaboratively. They should
Additional notes
This paper provides the conceptual framework for our study. It introduces a taxonomy of economic tasks based on real-world AI interactions, identifying where AI systems like Claude show strong performance. We use this task classification to assess which occupations in Brazil are most exposed to automation or augmentation.
This dataset operationalizes the task taxonomy from the paper. It assigns quantitative exposure scores to hundreds of economic tasks, based on how well AI models perform them. We link these task scores to occupations through O*NET descriptors, allowing us to compute AI risk metrics for Brazilian jobs.
In establishing the initial CBO to ISCO linkage for our study, we drew upon the foundational work presented in "Job Concordances for Brazil: Mapping the Classificação Brasileira de Ocupações (CBO) to the International Standard Classification of Occupations (ISCO-88)" (cbo2isco.pdf). This paper details a comprehensive concordance between CBO and the 1988 version of ISCO (ISCO-88).