Stanford University

INSIGHTS

AI and Labor Markets: What We Know and Don’t Know

Dr. Bharat Chandar shares his perspective on the current state of knowledge of artificial intelligence and labor.

by Bharat Chandar
Postdoctoral Fellow

October 14, 2025
19 min read

Bharat Chandar is a postdoctoral fellow at the Stanford Digital Economy Lab. His recent paper with Lab Director Erik Brynjolfsson and Research Scientist Ruyu Chen, “Canaries in the Coal Mine? Six Facts about the Recent Employment Effects of Artificial Intelligence,” provided some of the earliest large-scale evidence consistent with the hypothesis that the AI revolution is beginning to have a significant and disproportionate impact on entry-level workers in the American labor market.

We face enormous uncertainty about how AI will shape labor markets. People have questions about what we have learned, what the future might look like, and what we should do about it. The excellent Jasmine Sun suggested I write a post summarizing the state of knowledge. This is my assessment of what we know and don’t know about AI and labor.1

Some economists believe studying labor effects of AI is too crowded a space, with not much difference from prior technology and not much interesting left to say. I disagree. The mismatch between supply and demand for research in this topic is unlike anything I have seen. Hopefully this article convinces at least some people of that. There is so much we don’t know, and many directions for making meaningful, socially beneficial contributions.

  1. The overall impact of AI on aggregate employment is likely small right now
  2. AI may be diminishing hiring for AI-exposed entry-level jobs
  3. We can probably measure AI exposure across jobs better than most people think
  4. We are gathering new evidence on where we should expect the greatest AI progress in the near term
  5. We have very little idea about employment changes in other countries
  6. We could use better data on firm AI adoption
  7. We do not know how employment trends will progress going forward
  8. We do not know which jobs will have growing future demand
  9. We have little evidence on how AI is reshaping the education landscape
  10. We do not know how personalized AI learning will change job retraining.
  11. We do not know how workers’ tasks have changed following adoption of AI
  12. AI has unclear effects on matching between employers and job candidates
  13. We need more data on how AI will affect incomes and wealth
  14. We need rigorous modeling of how to deal with economic disruptions, informed by the above data

1. The overall impact of AI on aggregate employment is likely small right now

This is consistent with a range of papers. Chandar (2025), Gimbel et al. (2025), Eckhardt and Goldschlag (2025), and Dominski and Lee (2025) all use the Current Population Survey to show at most small changes in hiring in AI-exposed jobs.2 “Canaries in the Coal Mine” finds quite concentrated employment declines among AI-exposed 22-25 y/o workers, with continued employment growth for most other workers. Humlum and Vestergaard (2025) likewise find small overall effects in Danish data through 2024. Together the evidence suggests overall hiring has not declined meaningfully due to AI.

Path forward: Building trackers using the CPS, ADP, Revelio, and other data sets is a good start. Improving causal estimates will help as well. See point 2.

Back to top

2. AI may be diminishing hiring for AI-exposed entry-level jobs

This was a key highlight from our “Canaries in the Coal Mine” paper. We found declines in employment concentrated among 22-25 year-old workers in AI-exposed jobs such as software development, customer service, and clerical work. 

How strong is this evidence? As with most things, the best way to think about this is as a Bayesian. Our results shifted my own beliefs about the likelihood that AI was responsible for some meaningful share of the slowdown in entry-level hiring. The extent to which your beliefs align with mine depend on how strong your priors are about AI’s labor impact and how credible you think the recent evidence is. 

Before our “Canaries” paper, my inclination was that AI was having at most a minimal impact on the labor market. Primarily, this is because I looked at the CPS data and did not find much evidence of aggregate labor impacts, as discussed in point 1. I put added trust in evidence I produce myself because I know what goes into the sausage, but it certainly helped to see corroboration in other studies. 

That said, most of the discourse around AI’s impacts on jobs focused on young workers. Evidence for that group was at best weak. The CPS has small sample sizes when filtering to specific age-occupation groups. O’Brien (2025) suggests the level of statistical uncertainty in even the ACS, a much larger data set, is quite large. 

Outside of the research community people shared differing views. Numerous news articles claimed disruption to entry-level labor markets using a combination of qualitative interviews and quite speculative analysis of broad labor market data, while others pushed back citing evidence from datasets like the CPS. A report from VC firm SignalFire suggested a startling slowdown in entry-level hiring in the tech sector, with the impact highly concentrated on the youngest workers. 

All this made me believe the overall economy-wide impact was small, but with a large amount of uncertainty about young workers in particular. That was my prior when we began looking at the ADP data.

I still believe the overall impact is small, but I have updated my views about the impact on young workers. I now believe that AI may have contributed a meaningful amount to the overall slowdown in hiring for entry-level workers. 

This is again a case where I put faith in the work we produced. We did not begin the project with any agenda about which direction the results should go. 

We started by showing entry-level employment declines in some case studies such as software development and customer service. Then we found the results held more generally across highly AI-exposed occupations but not in low-exposure occupations. We then found that the same patterns held under an LLM-usage based exposure measure from Anthropic. Strikingly, we then found the same result for occupations with high automative AI usage but not ones with high augmentative usage. We took this as a potential indication that the results might really be driven by AI and not other changes in the economy. 

We proceeded by listing out the most plausible alternatives we could think of that could be driving the patterns. We found very similar results when excluding the tech sector, computer jobs, and jobs that could be worked remotely. These tests indicate that our findings are not primarily driven by tech overhiring, return from work from home or outsourcing. We then controlled for firm-time effects to control for any economic shocks that impacted overall hiring at the firm, finding the results again looked similar: within firms, entry-level hiring in AI-exposed jobs declined 13% relative to less-exposed jobs. These impacts appeared only after the proliferation of LLMs. Older workers had statistically insignificant impacts. To the extent we believe that interest rate changes or other aggregate economic changes impact hiring of all workers at a firm, not just specific occupations correlated with AI exposure, then this would rule out a variety of alternative explanations.

We then compared college graduates to non-college workers. For people without a college degree, we continued to find a striking divergence between more- and less-exposed jobs, even at higher age groups. This suggested our results could not be explained away by Covid-era education disruptions. We also showed robustness to alternative ways of building the sample, to including part-time and temporary workers, to extending the sample to 2018, and to computing the results separately for men and women.3 

So, how strong is the evidence? Ultimately, our paper is observational. We do not have an experiment varying AI adoption. If you believe our various alternative analyses offer compelling evidence of a causal impact of AI, then you should update your beliefs more. If you need more evidence to convince you of a causal impact, then you should update less accordingly. 

Since our paper came out, two other articles showed similar results using large-scale data from Revelio Labs. Hosseini and Lichtinger (2025) find employment declines for young workers in US firms that adopt AI technologies, with adoption measured by the content of their job postings. Klein Teeselink (2025) shows that exposed firms and occupations have seen significant contraction in entry-level hiring in the UK. 

On the other hand, the newest version of Humlum and Vestergaard (2025), released in late September 2025, finds no difference in entry-level hiring between firms that adopt AI and those that do not in Denmark. They measure firm adoption via worker surveys on AI encouragement at work, which they match to administrative firm data. An interesting question is how to square these results with Hosseini and Lichtinger (2025), who find large differences in hiring between adopting and non-adopting firms after they start using AI but also note some differences even before AI adoption. It would be worthwhile to test if the varying results between these papers stem from 1. Institutional differences between the US and Denmark, 2. Differences in the firm-level exposure measures, or 3. Differences in the specific statistical analyses. We should hope to see more work using high-quality employment and adoption data in the coming months to help sort through the various findings.

Path forward: There are two primary sources of uncertainty. 

The first is whether existing studies measure a causal impact. One way to bolster evidence is to collect better data on when individual firms adopt AI (see more in point 4) to track employment changes before and after at the firm level, hopefully improving upon the measures in Humlum and Vestergaard (2025), Hosseini and Lichtinger (2025), and other work. Even better would be to find some kind of experiment in firm-level AI-adoption. An example would be an A/B test at an AI company that randomly offered discounts on subscriptions to different firms. Ideally the experiment would have been run starting in the early days of AI and run for months, if not years. 

The second source of uncertainty is how representative the data sets in existing studies are to the broader economy. In my view, this is not as important as the first issue. If we have indeed estimated causal effects, then we’ve still identified a widespread phenomenon affecting tens of thousands of firms and millions of workers. That’s big and important in and of itself. It seems unlikely to me that expanding to the broader economy would meaningfully lead to different conclusions. That said, we should have high-quality public data available for everyone to look at, not just those of us with industry partnerships. That’s why I signed this letter to the Department of Labor. 

A third area for further study is how to integrate the large literature measuring productivity improvements in individual workflows with broader employment impacts. The below chart from an Economist article several months back summarizes estimates of how AI affects performance inequality at work.4 Future work should model the implications of these heterogeneous impacts for occupational employment and wage changes across different subpopulations, such as entry-level and experienced workers. Autor and Thompson (2025) makes a seminal contribution on this issue.

A chart originally from an Economist article titled "How AI will divide the best from the rest" about the "impact of generative AI on the gap between high- and low-performing workers"

Back to top

3. We can probably measure AI exposure across jobs better than most people think

A common complaint is that existing measures of AI exposure are poor because they do not have validation on real-world economic outcomes. To the extent you believe the recent literature, this is not as true as it used to be. First, Tomlinson et al. (2025) found that AI exposure measures from Eloundou et al. (2024) have very high correlation with Microsoft Copilot usage from 2025. This suggests their predictions were largely borne out, with the caveat that associating chat bot conversations with occupations can be challenging. Using LLMs to predict which job a conversation relates to has become nearly standard practice. 

Second, the recent papers finding these AI exposure measures predict employment changes offer further real-world validation. Imagine the exposure measures were completely worthless, essentially randomly assigning different jobs high or low exposure. In that case we should expect the exposure measures wouldn’t predict anything about employment. To the extent they do predict employment changes, they receive some degree of validation, with stronger causal evidence offering greater validation.

Path forward: We should continue to validate the exposure measures we have based on real economic outcomes like employment or LLM usage. It would be great to get actual large-scale data from AI labs on usage by occupation, perhaps via survey rather than relying on predictions based on conversations.5 We should develop and validate new measures of exposure that reflect evolving AI capabilities. Finally, while existing work explores exposure by occupation, an important direction for research is to further measure differences by seniority or demographic characteristics to identify precise subpopulations facing potential job risk. A step in this direction is Manning and Aguirre 2025, which explores exposure by economic vulnerability.

Back to top

4. We are gathering new evidence on where we should expect the greatest AI progress in the near term

In which dimensions should we expect AI to improve most rapidly? In which dimensions will progress be slower? 

Nascent work has been making strides in answering these questions. Ongoing work from Erik Brynjolfsson, Basil Halperin, and Arjun Ramani, and by Rishi Bommasani and his coauthor team, track LLM improvement along different dimensions of economically-relevant intelligence. GDPval from OpenAI is another excellent development, along with Apex from Mercor. Developing economically-relevant evals is a quite exciting and active research area. Making predictions about AI progress along different dimensions of intelligence will help with creating better measures of future occupational exposure.

Another perspective is that while existing measures are probably better than most people think (see point 3), they may also have lots of room for improvement. Developing economically-relevant evaluations, and directly assessing LLM capabilities for these measures, can lead to even sharper estimates of present occupational exposure than the existing approaches.

Path forward: Continued research in this space, identifying areas of rapid and slow progress. A next step would be to associate these rates of improvement with occupational task information. Which occupations face greater AI risk as the models improve? GDPval is a quality step towards understanding this. Another good idea is to solicit predictions, perhaps via markets, about future disruption by occupation.6

We should also assess how future model capabilities depend on choices in the development process. Rather than building systems optimized to mimic and replace humans (see the Turing Trap), could we instead develop “Centaur Evaluations” designed to maximize joint human-AI performance? How much does the choice of evaluations affect future model capabilities? Can these choices alter the consequences for human work?

Back to top

5. We have very little idea about employment changes in other countries

The two studies that speak to this are Klein Teeselink (2025) in the UK and Humlum and Vestergaard (2025) in Denmark. I am aware of no evidence from any other part of the world. 

The most relevant data we have are recent estimates from Anthropic and OpenAI about usage in different countries. Older work develops country-specific occupational exposure measures (Gmyrek 2025). 

Path forward: More research should be done on other labor markets. Three promising avenues are to use Revelio or ADP in other countries, if feasible; use other private payroll data from other countries; or use government administrative data to track employment changes. Some infrastructure likely needs to be built out to measure AI exposure for local occupations. 

A particular area of focus should be countries with high levels of employment in exposed jobs such as call center operations. Further modeling can also help with predicting how impacts may vary across different institutional contexts.

Back to top

6. We could use better data on firm AI adoption

The issue is both conceptual and empirical. First, what does it mean for a firm or worker to “adopt” AI? They use it once? Every day? 1% of the company uses it? 10%? It’s used for “production of goods and services?” Or back office tasks? They use the free version, or the subscription version? They use the older models or the newer models? Chatterji et al. (2025) find that a large share of consumer ChatGPT use seems work-related, which makes this even more complicated. 

The main challenge empirically is that we have little data on adoption, even setting aside these conceptual questions. Bonney et al. (2024) from the US Census finds an AI adoption number close to 10%, but they have only a 16% response rate and ask a potentially narrow question: “Between MMM DD – MMM DD, did this business use Artificial Intelligence (AI) in producing goods or services? (Examples of AI: machine learning, natural language processing, virtual agents, voice recognition, etc.).” Why only ask about usage for “production of goods and services” as opposed to back office tasks or internal processes? Would Walmart, for example, respond yes to this question? See also Hyman et al. (2025) and this review article by Crane et al. (2025) from February.

Another source is the Ramp AI Index. They find an adoption number closer to 44% based on business subscriptions to AI services. While Ramp firms may have a higher proclivity to adopt technology, the company also sees a very low share of Gemini usage and misses the consumer app usage for work that Chatterji et al. (2025) document. Chatterji et al. (2025), Hartley (2025), and Bick et al. (2025) show consumer-level adoption but do not have as much to say about firms. Hampole et al. (2025) and Hosseini and Lichtinger (2025) measure adoption by whether AI appears in job postings and descriptions, but they likely miss a meaningful share of firm usage. Humlum and Vestergaard (2025) use worker survey data in Denmark that they match to firms.

Path forward: Ideally we would have some sort of continuous index of AI adoption, with differences in “how much” firms or workers have adopted AI. One option is to measure token counts, as suggested by Seed AI. Business spend data seems promising as well. Another option is the number of unique users or the number of conversations. We should encourage AI companies to share data on this to the extent feasible. Business surveys should also explore alternative questions and test how sensitive reported adoption rates are to the specific wording.

Back to top

Prior technologies decreased labor demand in some occupations. Simultaneously they increased labor demand in other occupations and created new forms of work (Acemoglu and Autor 2011, Autor et al. 2024). As a result employment has remained fairly stable for decades, with real wages rising along with productivity. These forces help explain why John Maynard Keynes was wrong in his prediction that by now we would be working 15 hours a week.

Will AI likewise lead to rising real wages without compromising employment? Or will AI capabilities advance far and fast enough that even new work is better performed by machines instead of humans? We do not yet know if this time will be different. 

There is an extraordinary amount of disagreement on this issue. Many employees at the AI labs believe with a high degree of conviction that AI will replace a large amount of work in the next several years. On the other hand, some economists believe AI’s labor market impacts will be similar to prior technologies. 

The computer and Internet revolutions likely account for the predominant share of rising inequality over the past several decades due to skill-biased technical change. Even if AI is like prior technologies it may have profound effects on society.

Path forward: We should build trackers to continue to follow employment trends by AI exposure and demographic variables like age. We should also develop credible predictions about future labor impacts; see point 4.

Back to top

“We need more theoretical work that allows us to simulate the labor market impacts of policy changes, especially in settings with potentially transformative AI. In technical terms, we need structural models of labor market impacts of AI that can be empirically calibrated using the sort of information in this doc and that can be used to simulate the consequences of counterfactual policies.”

BHARAT CHANDAR
Postdoctoral Fellow, Stanford Digital Economy Lab

8. We do not know which jobs will have growing future demand

Suppose we want to design job retraining programs for workers displaced by AI. Which jobs should we direct them towards? While it may seem early to have these discussions, this is an important consideration for scenario planning.

This is both a conceptual question and an empirical question. Conceptually, should both receptionists and software developers be retrained as nurses, for example? Or should they be retrained in different professions given differences in their prior training? How do we account for geographic differences in labor supply and demand? See also point 10.

Empirically, which jobs seem supply constrained? Why? Because of regulation? Expertise? Distaste? Slow employment transitions? Of the supply constrained jobs, which ones should we expect to be safe from AI for the medium to long term? See also point 4. 

Prior work already makes a lot of progress on these issues. See evidence from the Cleveland Fed or LinkedIn’s Career Explorer for just two recent examples. The main area for progress is to connect existing and future analyses to potential impacts from AI.

Path forward: The empirical questions are actually quite easy to make progress on using basic data like the CPS. The conceptual questions likely require some more serious economic modeling. Future scenario planning should account for potential AI impacts. 

Back to top

9. We have little evidence on how AI is reshaping the education landscape

For over a century school has served as the starting point for developing the skills needed to enter the workforce. How is AI changing the way students prepare for their personal and professional futures?

So far we know a lot of students and teachers are using AI. We also know school policies have not kept pace with adoption.7 Anthropic reports that about half of student usage in higher education is direct—that is, seeking answers or content with minimal engagement.8 The other half is classified as collaborative. Both Anthropic and Gallup find that teachers seem to be using AI for tasks such as curriculum development, research, and grading.

There are still many open questions. Are students making different education choices? Are they choosing different majors? Are they aiming for different careers? Are they learning as much in school? This survey of college students points to some early evidence of changing career choices, but it is quite old now.  

Is AI increasing gaps in test scores? Or decreasing them? How does AI impact the best students vs middling and struggling students? How does it impact the best schools vs middling and struggling schools? 

How is AI affecting curricula? How is it affecting the form of assessment? Is it increasing or decreasing differences in teacher effectiveness?

Path forward: We need to systematically collect data on all of these questions and much more. It seems like there is some early ongoing research on the topic, but way more should be done here.

Back to top

10. We do not know how personalized AI learning will change job retraining.

Is personalized AI learning effective? What’s the right way to design it? Can it be used at the K-12 level (see Alpha School)? The university level? What about for job retraining or upskilling? 

AI may speed up the rate of job obsolescence, but it may also speed up the rate of skill retraining or reduce barriers to entering a new profession in the first place.

Hyman et al. (2025) provides some valuable evidence that job retraining can help workers adjust to technological employment disruptions. How these programs look in the future is likely to change quite a bit.9

Path forward: Schools and companies should experiment with these technologies in a rigorous and scientific way.

Back to top

11. We do not know how workers’ tasks have changed following adoption of AI

Recent research uses work tasks from O*NET to measure job exposure or predict productivity changes. However, we do not have much empirical evidence on how AI is changing the way workers spend their time at work. How do we empirically measure a worker’s tasks? Which tasks are they spending more time on? Less time? Are some tasks disappearing? Are other ones being created? Are workers expanding their scope, performing tasks they could not have otherwise done?10

Answering these questions will help with creating new and more accurate measures of tasks that workers perform and how they evolve over time.

Path forward: I have seen three promising avenues for studying this. The first uses job posting information to construct tasks for different roles. The second uses AI agent interviewers that call people and collect data on how they spend their time (Shao et al. 2025). The third collects company performance review data in which employees report their time allocation across projects and tasks.

Back to top

12. AI has unclear effects on matching between employers and job candidates

On the one hand, van Inwegen et al. (2025) show that algorithmic writing assistance for resumes leads to causal increases in hiring and wages for prospective employees. Aka et al. (2025) find that AI-assisted interviewing leads to much better candidate selection.

On the other hand, Wiles and Horton (2025) highlight that choices that maximize private benefit in this setting may or may not benefit society more broadly. AI tools may help employers create more postings and job candidates apply to more jobs with lower cost. These reduced frictions may increase the quality of matches between job seekers and employers, or, perversely, they may reduce the quality of matches because of diluted labor market signals. If it costs nothing to post a job you do not actually want to hire for, and if all resumes start to look the same, in principle it can be harder to form good matches. Cui et al. (2025) indeed find that AI usage improves cover letters, but cover letters subsequently became less informative signals of worker ability. Employers correspondingly shift toward alternative signals such as past reviews when evaluating candidates.

 Path forward: It would be useful to come up with empirical measures to track the health of the matching process in the broader labor market. The standard Diamond-Mortensen-Pissarides model is a good place to start for ideas. 

 Back to top

13. We need more data on how AI will affect incomes and wealth

There is a lot of uncertainty about how AI will affect economic inequality. The best paper I have seen on this topic is Rockall et al. (2025). Their Figure 2, reproduced below, shows income for workers in the UK by source. 

High earners tend to have higher occupational AI exposure on average, but they also receive a lower share of their income in wages. Capital gains may thus offset their wage losses. On the other hand, lower earners tend to have lower occupational exposure on average, but a higher share of their income comes from wages. They may consequently get lower returns from rising company valuations. They also receive a higher share of their income in government benefits, which may increase with tax receipts if the economy grows.

Path forward: These measures should be expanded and tracked over time, especially in a broader set of countries, to measure how potential gains from AI will be distributed. Following Nordhaus (2021), we should also track changes in the labor share of income, especially at leading AI firms.11

 Back to top

14. We need rigorous modeling of how to deal with economic disruptions, informed by the above data

In my view, the AI community has many theories of adaptation to AI disruption that are unmoored from data or careful statements of assumptions. Chad Jones, one of the greatest economists ever at precisely this sort of work, is once again making important contributions to improve this. Anton Korinek, Phillip Trammell, and others are also making key contributions on the growth side. 

Much more research could be done here. Ide and Talamas (2025), Ide (2025), and Garicano and Rayo (2025) advance our understanding of impacts on entry-level versus senior workers. Hampole et al. (2025) develop theoretically-founded AI exposure measures. Agrawal et al. (2025) model the short and long-run allocation of humans across work under highly intelligent AI. Autor and Thompson (2025) makes a seminal contribution to modeling and empirically measuring how technology affects employment and wages for different workers. It’s one of my favorite papers of the past few years.

We need more theoretical work that allows us to simulate the labor market impacts of policy changes, especially in settings with potentially transformative AI. In technical terms, we need structural models of labor market impacts of AI that can be empirically calibrated using the sort of information in this doc and that can be used to simulate the consequences of counterfactual policies. 

Path forward: Develop these sorts of models and use them to consider real policy choices. This is a totally open field for research.

 Back to top


Bharat Chandar is a postdoctoral fellow at the Stanford Digital Economy Lab. His recent paper with Lab Director Erik Brynjolfsson and Research Scientist Ruyu Chen, “Canaries in the Coal Mine? Six Facts about the Recent Employment Effects of Artificial Intelligence,” provided some of the earliest large-scale evidence consistent with the hypothesis that the AI revolution is beginning to have a significant and disproportionate impact on entry-level workers in the American labor market.

Bharat received his PhD in economics from Stanford GSB. He studies labor economics and technology.

His substack can be read here.


  1. For a broader research agenda on the economics of transformative AI, see Brynjolfsson et al. (2025).
  2. Johnston and Makridis (2025) find AI-exposed industries have seen employment gains. Industry-level labor market changes may be distinct from the occupation-level changes if firms make capital investments or become more productive in ways that increases overall labor demand. See also Hampole et al. (2025).
  3. Thanks to Omeed Maghzian, Brad Ross, and many others for helpful suggestions on robustness checks.
  4. Note that Toner-Rodgers (2024) has since been discredited.
  5. Chatterji et al. (2025) makes a useful contribution in this regard, but only shows results for five very broad occupational groupings. An important question is how to balance user privacy with collecting data on usage. The Anthropic Economic Index provides rich detail on how conversations relate to different work tasks, but these relationships are inferred by LLMs. My view is the Anthropic Economic Index is very useful to researchers, even if it is not perfect.
  6. Thanks to Tom Cunningham for the suggestion.
  7. Wong et al. (2025) suggests this is the case in Asian universities as well.
  8. Lee et al. (2024) found that ChatGPT did not lead to a meaningful increase in self-reported cheating, though their data is from early- to mid-2023.
  9. Mollick et al. (2024) is an early study on this topic.
  10. Vendraminelli et al. (2025) is an interesting research contribution, showing examples where workers can and cannot expand the scope of their work using AI.
  11. Credit to Phillip Trammell for the suggestion to track the labor share at AI firms. If the labor share goes to 0 at frontier labs, that may be an indication of recursive self-improvement.

INSIGHTS

Q&A | Does artificial intelligence pose a threat to financial stability?

by Matty Smith
Communications

October 10, 2025
7 min read

More and more, people are turning to artifiical intelligence for investment advice. It’s even been predicted that generative AI could be the leading source of financial advice for retail investors as soon as 2027. A new working paper, ‘Ex Machina: Financial Stability in the Age of Artificial Intelligence,’ takes a look at how different AI agents impact financial stability when asked to manage mutual fund assets.

Co-author and Lab Research Scientist Sophia Kazinnik spoke with us about the paper’s findings.

What did you set out to do with your research?

In this study, we look at how different types of artificial intelligence agents behave in a mutual fund redemption game, where each investor must decide whether to redeem early for a certain but smaller payoff, or stay invested and receive a potentially higher return later. The catch is that the value of staying depends not just on the underlying economic fundamentals, but also on how many other investors choose to redeem. When more people redeem early, the fund has to liquidate assets at a cost, reducing the return for those who stay. So, this is a classic coordination problem with strategic complementarities: what you do depends on what you expect others to do.

How did you go about testing the AI?

To test how AI behaves in such environments, we replace human investors with two kinds of AI. Q-learning agents learn through experience, trying out different actions and updating their strategy based on what pays off over many simulations. LLM agents are given a written description of the environment and use logical reasoning to make a decision in each round.

We then observe how these agents behave under different conditions, e.g., when the fundamentals are known vs uncertain, when the final payoffs are risky or safe, and compare their decisions to what economic theory predicts.

The goal is to see how AI design shapes behavior in systems where beliefs and coordination matter, and what that means for financial stability.

How would you define financial stability?

In this study, financial stability means that investors act based on actual economic conditions, not out of fear or panic. When things are truly bad, pulling your money out early makes sense. But when the economy is still strong, early redemptions just create unnecessary stress on the system.

So we look at how many investors pull out too early, even when the situation doesn’t call for it. We call this “fragility.” The more people redeem when they shouldn’t, the more fragile the system is, and the less financially stable it is. In other words, stability here means investors don’t run unless they have a good reason to.

How did the two types of AI differ? 

The first type of AI, Q-learning, learns by trial and error. It tends to overreact and pull money out early, especially when the situation is uncertain. That makes the system more fragile and more likely to suffer from large, early withdrawals, even when that’s not the “right” thing to do.

The second type, LLM agents, reads the rules and reasons through what to do. It usually makes more accurate, theory-aligned choices. But because each LLM agent reasons on its own and might expect different things from the others, they don’t always move together. Some choose to redeem, others don’t (even in the same situation). That leads to more mixed or uneven outcomes, rather than clear group behavior like we see with Q-learning.

What surprised us the most is that Q-learning broke down under uncertainty (i.e., situations where agents don’t have perfect information about key variables that influence their decision), even though the math said it shouldn’t matter. The LLMs handled it fine. That shows the AI’s internal design, and not only the economic conditions, can create or prevent financial instability.

We expected AI design to make a difference, but the strong early-exit bias in Q-learning and the lack of coordination among LLMs were bigger and more revealing than we expected.

“The type of AI you use really matters. Just changing the AI agent can lead to completely different outcomes. That’s a new form of model risk, the risk that your system behaves poorly not because of bad inputs, but because of the AI design itself.”

SOPHIA KAZINNIK
Research Scientist, Stanford Digital Economy Lab

So in the context of this experiment, do you personally feel one type of AI is “better” than the other?

I think it depends on what you’re after. If the main goal is financial stability, then LLMs are the safer choice. They make decisions that stay closer to what economic theory recommends, and they don’t overreact. They also show a clean, logical pattern: fragility increases smoothly when the system becomes more fragile (like when assets get harder to sell).

But if you care more about predictable, coordinated group behavior, then Q-learning might seem more appealing. These agents tend to move together and settle on clear-cut actions. The problem is, they often coordinate on the wrong thing (like all pulling out early) just because that’s what their learning has reinforced.

So, it’s a trade-off: LLMs give you more stability, but less predictability. Q-learning gives you more coordination, but also more risk.

Does the paper provide guidance on how to design AI systems for this purpose?

We do show that how you design your AI system matters, and provide some high-level guidance. For LLMs, their decisions are more stable when they’re given clear, consistent information. If they get vague or conflicting inputs, they form different expectations, making their behavior harder to predict. So, if firms or regulators want more reliable behavior from LLMs, they should make sure the AI has good, precise information to work with.

For Q-learning agents, the problem is that they learn the wrong lesson when outcomes are sometimes zero. For example, if “staying” in the fund occasionally leads to no return, the AI may wrongly learn that staying is a bad choice, even if it isn’t overall. To fix this, you can adjust how the AI learns, so it better reflects the full range of possible outcomes.

We also highlight that humans should stay involved. A human advisor who understands both finance and AI can help guide choices, reduce confusion, and make sure these systems work in safer, more predictable ways.

Where are you excited to go with this research next?

Frankly, I feel like a kid in a candy store these days. There are so many interesting directions one could go with this.

One direction is to explore what happens when different kinds of AI are mixed together: some learning from experience (like Q-learners), some reasoning through problems (like LLMs), and maybe even some that act more like humans. We could also look at connected funds, to see how problems in one part of the system might spread to others.

Another goal is to test how policy tools (like redemption fees, gates, or penalties) might work differently depending on the kind of AI involved. Do LLMs and Q-learners react the same way to a swing-pricing rule? Probably not.

And lastly, in the paper, we highlight the need to go beyond designing smart individual AIs. Even if each agent is “well-behaved” on its own, groups of AIs can still produce bad collective outcomes. So, future work could focus on this idea of multi-agent alignment: making sure AIs not only act wisely alone but also interact safely when working in large systems.

“In my personal opinion, regulators and institutions need to catch up. AI isn’t just another tool, it’s shaping decisions and outcomes in ways that weren’t possible before. That means we need updated stress tests, better ways to measure how AIs behave in practice, and new policies that account for both financial knowledge and technological understanding among users.”

SOPHIA KAZINNIK
Research Scientist, Stanford Digital Economy Lab

Would you feel comfortable with your investments being handled by AI agents?

I’d say yes, but with caution (and conditions). I’d happily hand things over to a reasoning-based AI agent, as long as it doesn’t share my irrational love for shoes.

More seriously, I’d feel comfortable letting AI manage my investments, as long as a few guardrails are in place. I’d want the AI to be clear about how it’s making decisions, not just spit out recommendations I don’t understand. I’d also want it to stay calm in the face of uncertainty: no panicked sell-offs just because the market had a bad day. And I’d definitely feel better knowing a human could step in if the AI started making weird choices, like treating a minor dip as the end of the world. In short, I’d trust but continually verify.

Do you agree with the prediction that by 2027, generative AI will be the leading source of financial advice for retail investors?

That’s a likely direction: many industry reports predict it, and the tech is advancing quickly. But in the paper we make the following point: AI may lead the way, but humans still matter.

Given concerns around trust, liability, and stability, the future will probably look “bionic,” a combination of AI tools and human oversight, rather than fully automated systems calling the shots on their own. So yes, AI advice may become the most common, but the safest and most realistic path is AI and human collaboration, not AI-only.

***

Read the paper here.

Sophia Kazinnik is a research scientist at the Stanford Digital Economy Lab, where she explores the intersection of artificial intelligence and economics. Prior to joining Stanford, Sophia worked as an economist and quantitative analyst at the Federal Reserve Bank of Richmond, where she was part of the Quantitative Supervision and Research group. While there, she contributed to supervisory projects targeting cyber and operational risks and developed NLP tools for supervisory purposes.

Interested in frameworks and guardrails for agentic AI?
Read about our Loyal Agents project here.

Stanford University