Democratizing statistics: How AI can empower everyone (without causing disasters)

INSIGHTS

Democratizing statistics: How AI can empower everyone (without causing disasters)

February 17, 2025
8 min read

by José Ramón Enríquez, David Levine and David Nguyen

“Democratizing Statistics: How AI can Empower Everyone (without Causing Disasters)” is an opinion piece on how large language models are making complex data analysis available to everyone, and the risks involved. A video interview discussing the piece with authors José Ramón Enríquez, postdoctoral fellow at Stanford University, David Levine, professor at the Haas School of Business at the University of California, Berkeley, and David Nguyen, research scientist at the Stanford Digital Economy Lab, can be viewed here.

Imagine a world where you don’t need an advanced degree in statistics to uncover meaningful insights from data. Once the realm of statisticians, Large Language Models (LLMs) such as ChatGPT are making complex data analysis accessible to anyone with curiosity and a computer. But can LLMs help users navigate the complexities of data analysis safely? Or will LLMs help users create data-driven “hallucinations” (or, at least, misleading analyses), counterparts to the textual hallucinations that LLMs already create?

The Upside: AI Unlocks the Power of Data for Everyone

LLMs are lowering the barriers to data analysis for everyday users from business and healthcare to education and civil society. These models enable users to apply statistical analysis to answer questions and make informed decisions. For example, a nurse might use AI to analyze patient outcomes without waiting for an analyst. A teacher can assess the effectiveness of a lesson plan by visualizing student performance data. Job-hunters can do A/B testing on which resume gets the most responses.

More generally, teams of all sizes and backgrounds and across every industry and sector can incorporate statistical analysis into their workflows, fostering a culture of data-informed decision-making. We can expect faster decisions as AI assistants enable real-time data analysis, cutting down the time it takes to move from raw data to actionable insights while also reducing the need for specialized skills or expertise. Engaging non-traditional users brings unique perspectives to data analysis, as new users will ask new questions and uncover important patterns that experts might have overlooked.

The Risks: When AI + Stats Go Wrong

While the benefits are substantial, the risks of democratizing statistics are equally important. Without proper training or safeguards, errors can lead to misguided decisions, wasted resources, or even harm. As anyone who has taught (or been taught) statistics knows, we can expect new users to:

Be Unaware of the Problems with Data Quality: New users will not know to look for missing values, outliers, or inconsistencies that can distort results. For example, because the mean is not a robust static of central tendency, when estimating the mean age of a group, it only takes a few observations mistakenly reporting an age of 700+ years to move an estimate of the mean by a lot. Users might also ignore data collection procedures that can lead to sampling or selection bias. For example, conducting a survey exclusively online might exclude individuals without internet access, skewing the findings and limiting the generalizability of conclusions. Another issue related to “hallucinations” arises when the dataset (or parts of it) are, intentionally or not, entirely made up by the LLM.
Place Excessive Confidence in Statistical Outputs: Users may accept results without understanding statistical assumptions (e.g., normality, independence) or fail to acknowledge limits to the models. For example, new users may interpret a P value for non-normally distributed data without realizing the results may be invalid. Similarly, they will use linear regression when the relationship between variables is clearly non-linear. Users may also ignore the need to perform corrections for multiple hypotheses testing, increasing the risks of false positives.
Misinterpret Results: Driven by the often-confident presentation of results, users may draw incorrect conclusions, such as interpreting correlation as causation. For example, a team might conclude that increased social media spending drives higher sales when both are actually linked to an omitted factor—seasonal demand. Other common mistakes are misinterpreting the units of analysis, the magnitude of estimates, or confusing statistical with practical significance.

From a systems perspective, algorithmic bias can also contribute to misguidance and perpetuation of existing biases. If AI systems are trained on skewed, non-representative datasets, or if they rely on inappropriate explanatory and outcome variables, they risk perpetuating and propagating errors and generating equivocal analyses. A well-documented example is the misrepresentation of health risks for Black patients in the US when health needs are proxied by health expenditures. This issue could be further exacerbated by reinforcement learning from human feedback (RLHF). When novice users accept AI-generated suggestions that are poorly calibrated or unsuitable for the problem at hand, and this feedback is subsequently incorporated into the model, biases can become entrenched and amplify over time.

“We invite developers, educators, and policymakers to join the conversation and ensure that democratized statistics truly benefits everyone.”

JOSÉ RAMÓN ENRIQUEZ, DAVID LEVINE, AND DAVID NGUYEN
“Democratizing Statistics: How AI can Empower Everyone (without Causing Disasters)”

Better AI Can Reduce the Risks

If poorly designed AI can create risks, better AI offers a powerful way to reduce the risks of democratized statistics.

Ideally, any AI helping with statistics will create a guided workflow. First, it will automate data validation. Tools can flag issues such as missing values, duplicates, and outliers before analysis begins. Imagine an assistant that read in some data and reported:

Your second variable is called height and appears to be in centimeters. It also uses “9999” to code for a missing value. One observation is 999 cm (33 feet) tall. It is likely that this value is a typo. Should I code it as missing?

As the AI proceeds with the analysis, it can verify if the data satisfy statistical assumptions such as normality or linearity. For example, before running a correlation or t-test, the AI could automatically check for normality and suggest a non-parametric test if necessary.

It can also check functional forms. For example, it might ask:

You requested linear regression, which assumes your variables have a linear relationship. But the residuals in your specification follow a predictable pattern (according to the Ramsey RESET test). This result is often due to a nonlinear relationship. I suggest using a more flexible specification such as polynomial regression and also an approach that does not make assumptions about functional form such as random forests. Should I run those analyses instead?

The AI system might also proactively alert users to potential methodological pitfalls. For instance, it could display a cautionary message such as, “Your dataset’s sample size is too small to derive reliable conclusions”—a feature already seen in some advanced software. Additionally, the AI could autonomously adjust P values in contexts involving multiple hypothesis tests, flag the risks of overfitting, or adjust standard errors to more accurately reflect the data characteristics.

One might also imagine the AI engaging interactively with the analyst to solicit further context or supplementary documentation. For example, it could request additional materials—such as codebooks, questionnaires, or survey instruments—to deepen its understanding of the dataset. If the analysis suggests that the available data are inadequate, the system might even recommend collecting additional observations to strengthen the study’s empirical foundation.

Perhaps most importantly, the AI assistant can help interpret the results. To start, it can point out: “Remember, correlation does not imply causation. Consider additional analysis to understand causality.”

As the AI system evolves, it will increasingly discern both the intent of the analyst and the contextual nuances of the problem. Such a well-informed assistant can go further in spotting and solving potential concerns. For example, if the AI knows the analyst is examining the relationship of education and wages, it might add:

It is important to consider a third factor, currently omitted from the analysis, such as the family’s emphasis on child-rearing that might increase both education and wages.

Similarly, it could warn about selection bias, such as:

Because your sample consists entirely of individuals who have completed college or higher, these estimates may be subject to selection bias, as those who did not pursue post-secondary education are excluded. To properly interpret your results, it is crucial to account for this non-random selection process.

As the software learns from the user’s evolving expertise—and ideally, as the user’s own analytical skills expand—the degree of handholding and alerting provided by the AI can modify over time. AI systems can contribute to this process by logging each decision they make—from handling missing observations to selecting specific tests. Such an iterative approach can facilitate error detection, enhance understanding of the rationale behind the AI’s suggestions, and foster trust in the system. Just as researchers now routinely submit their Stata or R code with publications, one might foresee a future in which the submission of transcripts detailing LLM prompt-response interactions becomes a requisite component of the peer-review process.

How to implement?

Many of these capabilities can be integrated simply by refining the prompt provided to a large language model. For example, ChatGPT 4o implements some of these features if we add to a prompt requesting a correlation:

To the extent possible, test whether the data satisfy the statistical assumptions of all techniques. Flag potential issues and interpret results with caution.

Alternatively, it would also be possible to have additional software, module or agent that does this additional testing and explanation. In the context of increasingly popular agentic systems such functionality could be implemented as an adversarial mechanism: where one model is tasked with justifying each decision to another. However, relying on an extra module introduces a significant challenge: novice users, who stand to benefit the most from such extra guidance, may not recognize its relevance.

Thus, the optimal solution would be to embed this supplementary analysis as a default feature in any artificial intelligence tool that facilitates statistical analyses. Developers at organizations such as OpenAI, Anthropic, Google, and Meta should consider incorporating these instructive defaults into the system’s core functionality whenever it identifies that a statistical analysis is being performed. An effective approach might include querying users about their statistical proficiency at the outset and dynamically adapting the system’s feedback based on subsequent interactions and inquiries.

The Balanced Future of Statistics

As statistics become more accessible, the opportunities for innovation and insight are vast. However, the risks of misuse are just as significant. The same AI assistants that empower novices to use statistical techniques should also act as a safety net, helping new users avoid common pitfalls and guiding them to make confident, data-driven decisions.

By combining automated tools with ongoing education and awareness of statistical principles, we can create a world where everyone—not just experts—can harness the power of data responsibly and effectively. We will see users with less initial education producing rapid and credible analyses. Along the way, the users will learn more statistics and all of society will benefit from more decisions informed by data.

As AI-based statistical tools mature, there is also an opportunity to integrate interactive learning modules. This integrated approach would allow users—regardless of their prior expertise—to progressively build deeper statistical literacy, thereby ensuring that the benefits of democratized data analysis are both far-reaching and sustainable.

We invite developers, educators, and policymakers to join the conversation and ensure that democratized statistics truly benefits everyone.

Note: Authors are listed in alphabetical order. The examples of age over 700 years and height of 999 cm are both from Levine’s experience. Written with the assistance of ChatGPT.

José Ramón Enríquez is a postdoctoral fellow at the Stanford Digital Economy Lab (Stanford HAI) and the Golub Capital Social Impact Lab (Stanford GSB). José Ramón obtained his Ph.D. in Political Economy and Government (PEG) from Harvard University in May 2023.

José Ramón studies the political economy of economic and political development with a focus on political accountability. Specifically, he has worked on understanding the role of information in improving political accountability, with a specific emphasis on misinformation, political polarization, and corruption; the causes and effects of criminal-political violence on democratic representation; and the effects of the lack of coordination across levels of government.

David I. Levine is a professor of business administration at Berkeley Haas and serves as chair of the Economic Analysis & Policy Group. Levine’s research focuses on understanding and overcoming barriers to improving health in poor nations. This research has examined both how to increase demand for health-promoting goods such as safer cookstoves and water filters, and how to change health-related behaviors such as handwashing with soap. He has also written extensively on organizational learning (and failures to learn).

David Nguyen is a research scientist at the Stanford Digital Economy Lab. David’s research explores new and better ways to measure the modern and digital economy. He is particularly interested in advancing economic metrics and statistics on economic output and welfare.

Prior to joining the Stanford Digital Economy Lab, David worked as an economist at the OECD in Paris, and as a senior economist at the National Institute of Economic and Social Research (NIESR). As a research associate, he remains affiliated to the London-based Economic Statistics Centre of Excellence (ESCoE). David received his PhD from the London School of Economics.

Back to News