In data analysis, it's crucial to recognize that correlation does not necessarily imply causation. It's a classic logical fallacy to assume that because two variables are correlated, one must cause the other. However, there could be other factors at play that are causing both variables to change, or it could be purely coincidental. Additional evidence, such as experimental data or a strong theoretical rationale, is essential to ascertain causation.
Moving beyond correlation involves delving deeper into the underlying mechanisms and relationships that govern phenomena, seeking to understand the true drivers of observed patterns. By embracing this approach, analysts can uncover valuable insights that transcend surface-level correlations, enabling more informed decision-making and the development of robust predictive models.
Figure 1: The xkcd comic humorously explores the classic confusion between causation and correlation, emphasizing the significance of background knowledge in making causal inferences. https://xkcd.com/552/
Introduction
In this blog, we explore "correlation does not equal causation," delve into causal inference basics, discuss discovering causal relationships, and showcase AI research benefits from embracing causality. As causal inference gains momentum in the industry, we witness major players recognizing its pivotal role and investing in developing causal data science skills among their personnel.
From a marketing perspective - Picture this: a company launches a groundbreaking advertising campaign and witnesses a surge in sales. It's tempting to attribute this spike directly to the campaign's brilliance. But hold on a moment. Are we jumping to conclusions too quickly? Without the precision of causal inference, we're left in the dark, unable to discern whether the campaign truly ignited the sales boost or if other clandestine forces lurk behind the scenes – perhaps seasonal shifts, rival maneuvers, or even evolving consumer tastes. Enter causal inference techniques, the beacon guiding analysts through this labyrinth of uncertainty, unraveling the intricate web of influences and delivering crystal-clear insights into the campaign's genuine impact on sales.
Figure 2: Follow the Customer Journey
Causal Inference
Classical machine learning aims to minimize prediction error, striving for accurate models. This objective, easily understood by all, has propelled ML research through competitions across various domains. Unlike prediction tasks, causal inference lacks a clear, objective evaluation criterion due to its deeper focus on understanding underlying causal relationships.
Causal inference poses greater challenges than optimizing a loss function, as context-specific domain knowledge becomes pivotal. Benchmarking model predictions against actual experiments, although rare in practice, offers retrospective insights into performance. Yet, lacking a straightforward criterion for accuracy evaluation, assessing estimates proves intricate. Quality causal inferences hinge on untestable assumptions inherent in the available data, prompting a fundamental reevaluation of data science and ML problem-solving approaches.
Indeed, a fundamental theoretical framework shedding light on the challenges of causal data science is the Pearl causal hierarchy (PCH), also known as the ladder of causation. This hierarchy categorizes data analysis into three distinct layers of an information hierarchy.
At the lowest rung are associations, which refer to simple conditional probability statements between variables in the data. They remain purely correlational ("How does X relate to Y?") and, therefore, do not have any causal meaning. The second rung relates to interventions ("What happens to Y if I manipulate X?"), and we have already entered the world of causality here. On the third layer, we finally have counterfactuals ("What would Y be if X had been x?"), which represent the highest form of causal reasoning.
Figure 3: The Causal Hierarchy. Questions at level I can only be answered if information from level I or higher is available. Table from (Pearl, 2018).
The theoretical demands of causal inference outlined by the PCH necessitate a paradigm shift in data science, bringing forth significant organizational hurdles. Engaging domain experts, including clients, engineers, and sales partners, becomes imperative to validate assumptions and ensure the accuracy of problem modeling. This collaborative approach fosters a more comprehensive perspective on data science and prompts the restructuring of team dynamics, heralding a more integrated and holistic approach to tackling complex challenges.
Causal Discovery
In scientific inquiry, uncovering causal relationships stands as a foundational pursuit. One commonly employed method is randomized A/B experiments. Consider the scenario of assessing the efficacy of a novel cancer treatment: researchers recruit participants and randomly assign them to either a control group, receiving a placebo, or a treatment group, receiving the experimental drug. The purpose of randomization is to mitigate potential confounding factors. For instance, age could serve as a confounder, impacting both the decision to take the drug and its treatment effects. Therefore, maintaining comparable age distributions across both groups is essential in practical experiments.
Nevertheless, randomized experiments can prove prohibitively expensive and logistically challenging to execute, occasionally raising ethical concerns. Consequently, causal discovery from observational data, a field gaining traction across machine learning, philosophy, statistics, and computer science, has emerged as a compelling alternative. This approach involves inferring causal relationships directly from observational data, circumventing the need for costly and potentially ethically fraught randomized experiments.