\documentclass[10pt]{article}
\usepackage{fullpage}
\usepackage{setspace}
\usepackage{parskip}
\usepackage{titlesec}
\usepackage[section]{placeins}
\usepackage{xcolor}
\usepackage{breakcites}
\usepackage{lineno}
\usepackage{hyphenat}
\PassOptionsToPackage{hyphens}{url}
\usepackage[colorlinks = true,
linkcolor = blue,
urlcolor = blue,
citecolor = blue,
anchorcolor = blue]{hyperref}
\usepackage{etoolbox}
\makeatletter
\patchcmd\@combinedblfloats{\box\@outputbox}{\unvbox\@outputbox}{}{%
\errmessage{\noexpand\@combinedblfloats could not be patched}%
}%
\makeatother
\usepackage{natbib}
\renewenvironment{abstract}
{{\bfseries\noindent{\abstractname}\par\nobreak}\footnotesize}
{\bigskip}
\titlespacing{\section}{0pt}{*3}{*1}
\titlespacing{\subsection}{0pt}{*2}{*0.5}
\titlespacing{\subsubsection}{0pt}{*1.5}{0pt}
\usepackage{authblk}
\usepackage{graphicx}
\usepackage[space]{grffile}
\usepackage{latexsym}
\usepackage{textcomp}
\usepackage{longtable}
\usepackage{tabulary}
\usepackage{booktabs,array,multirow}
\usepackage{amsfonts,amsmath,amssymb}
\providecommand\citet{\cite}
\providecommand\citep{\cite}
\providecommand\citealt{\cite}
% You can conditionalize code for latexml or normal latex using this.
\newif\iflatexml\latexmlfalse
\providecommand{\tightlist}{\setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}%
\AtBeginDocument{\DeclareGraphicsExtensions{.pdf,.PDF,.eps,.EPS,.png,.PNG,.tif,.TIF,.jpg,.JPG,.jpeg,.JPEG}}
\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
\begin{document}
\title{Why Correlation Isn't Always Causation}
\author[1]{Charles C. Igel}%
\affil[1]{Regis University}%
\vspace{-1em}
\date{\today}
\begingroup
\let\center\flushleft
\let\endcenter\endflushleft
\maketitle
\endgroup
\sloppy
\pagebreak
At some point, we've all likely been warned that correlation is not
causation. It sounds reasonable so we tend to accept the assertion, but
what does it really mean? And is it always true?
To answer these questions, we first need to understand what the terms
mean and how they are distinguished from one another. Correlation is a
mathematical representation that summarizes the measured association
between variables. In simpler terms, it's a number between -1 and 1 that
describes what happens to one variable (let's call this variable
\emph{y}) when another variable changes (let's call this one \emph{x}).
Causation takes correlation a bit further by demanding more from our
variables than a basic association. Causation requires that at least
part of the the change we see in variable \emph{y} is actually due to
changes in variable \emph{x.} In other words, a change in one variable
has actually caused a change in the other, hence the term \emph{causal}.
\par\null
First, let's look at how correlations between variables can be
misleading. The scatterplot in Fig. 1 shows simulated data from a sample
of 50 elementary students, grades 1-6. The plot shows two variables for
each student: a measure of shoe size along the
\href{https://en.wikipedia.org/wiki/Cartesian_coordinate_system}{\emph{x}}\href{https://en.wikipedia.org/wiki/Cartesian_coordinate_system}{-axis~}(var.\emph{x})
and performance on a common math test along the
\href{https://en.wikipedia.org/wiki/Cartesian_coordinate_system}{\emph{y}}\href{https://en.wikipedia.org/wiki/Cartesian_coordinate_system}{-axis}
(var.\emph{y}). Each point in the plot represents the intersection
between those variables for each student in our simulated sample. The
association between these variables is clear, as shoe size (\emph{x})
increases, so do our math scores (\emph{y}). There is a rather wide
range in math scores across shoe sizes, but this range doesn't throw off
the overall association demonstrated by the linear increase indicated by
the blue
\href{https://blog.minitab.com/blog/adventures-in-statistics-2/curve-fitting-with-linear-and-nonlinear-regression}{line
of best fit}. To further reinforce this association, we can look at the
calculated correlation statistic between shoe size and math performance
{[}\emph{r}(\emph{xy})=.74{]}. {[}If this statistic is unfamiliar, see
\href{https://alternative-hypothesis.net/2019/01/14/linear-correlation/}{Linear
Association and Correlation}.{]} This is a strong correlation, certainly
something to take notice of, and provides further evidence for the
association between shoe size and math performance within our sample.\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/corr-scatterplot/corr-scatterplot}
\caption{{Correlation Scatterplot
{\label{434887}}%
}}
\end{center}
\end{figure}
Here's where a logical error can easily occur: The data seems clear and
consistent about this association, so we conclude that the best way to
improve math performance is to give students bigger shoes. We have just
made the leap from correlation to causation. This conclusion doesn't
make any logical sense; but the data does show an association. How can
this be? The conclusion illustrated here falls under the logical fallacy
\href{https://en.wikipedia.org/wiki/Post_hoc_ergo_propter_hoc}{\emph{post
hoc ergo propter hoc}}, an assertion that because \emph{y} follows
\emph{x}, \emph{y} must be caused by \emph{x}. The danger of this
fallacy is its plausibility. Our brains are wired to find connections
between events and, unless these connections are as absurd as the
shoe-math connection, we tend to accept them as legitimate. We are
lulled into seeing causal associations where they may only be
correlational associations.
\par\null
Before we give up on the hope that research will tell us anything
trustworthy, it's important to understand that in most cases correlation
is not intended to be the end of the research process; it's largely a
way to see if further study into the causal relationship between
variables is warranted. If no correlation is found, it's unlikely that
any causal relationship will be found later; but if correlation is found
we need to determine if that association is causal. We do this by trying
to isolate and/or control as many other variables as possible, until we
are just looking at our variables of interest (in our example, variables
\emph{x} and \emph{y}). We tend to do this through one of two
techniques: (1) experimental design, or (2) statistical controls.
\par\null
Experimental design refers to the way of structuring a study to control
those extra variables. A preferred way of doing this is through a
randomized control trial (RCT), sometimes referred to as a clinical
trial. The RCT is an experimental design in which participants are
randomly assigned to different conditions that are thought to affect the
outcome variable. To go back to our shoe-math example, we could randomly
assign students different size shoes (\emph{x}) and have them take the
math test (\emph{y}). If our notion that bigger shoes improve math
performance holds, then we should still see a strong \emph{xy}
correlation after the random assignment to shoe size. Generally
speaking, the mechanism underlying the RCT is an assumption that the
other possible variables that may link shoe size to math performance
will be randomly distributed across the assigned conditions; and by
randomly distributing them we control their effects.
\par\null
Statistical control is a way of mathematically isolating the measured
effects of the other variables so we can just look at the relationship
between our variables of interest. This technique is often used when an
experimental design isn't possible, and involves developing a
statistical model that includes many of the other possible variables
that may affect the outcome variable. A regression analysis is commonly
used for this purpose. To go back to our shoe-math example, we would
include measurements from all the variables we believe may link shoe
size to math performance within the regression analysis. When included
in our statistical model, we're able to parse out their influence and
see how much association remains between our primary variables (e.g.,
variables \emph{x} and \emph{y}) after doing so. Recall that we began
with a correlation of \emph{r}(\emph{xy})=.74, a number that shows a
strong association between variables \emph{x} and \emph{y}.
\par\null\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.42\columnwidth]{figures/corr-causation1/corr-causation1}
\caption{{Original Correlation
{\label{826570}}%
}}
\end{center}
\end{figure}
In this example we may believe that, beyond shoe size, math performance
is also associated with the grade that students are in (let's call this
variable \emph{w}) and the amount of time per week students spend
studying (let's call this variable \emph{z}).
\par\null\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.49\columnwidth]{figures/corr-causation2/corr-causation2}
\caption{{Correlation After Variables Added
{\label{172612}}%
}}
\end{center}
\end{figure}
Now that we've included these in our model, we see the original
correlation between shoe size (\emph{x}) and math test scores (\emph{y})
essentially dissolves to~\emph{r(xy)}=.02. Where did it go? It was
mathematically absorbed by the real associations between grade level and
math performance {[}\emph{r(wy)}=.64{]} and between study time and math
performance {[}\emph{r(zy)}=.31{]}. Although that original association
between shoe size and math performance was there in a statistical sense,
it was not really there in a practical sense because it was actually
just a proxy for the other variables that really mattered in this
association.~
\par\null
Given all this, we can think of correlation as a necessary (in most
cases) but insufficient condition for causation. Associations like the
shoe-math example presented here are known as
\href{https://sociologydictionary.org/spurious-relationship/}{spurious
correlations} - real in a statistical sense but meaningless in a
practical sense. Absurd examples like this are easy to spot because the
statistical relationship between the variables makes no logical sense.
Subtler examples can be much harder to spot, and are frequently accepted
as fact. There are other ways of assessing causality beyond those
presented here, but always beware when a causal claim is presented from
merely correlational evidence.
\par\null
\textbf{Key Ideas:}
\begin{itemize}
\tightlist
\item
Correlation is a mathematical representation that summarizes the
measured association between variables.
\item
Correlation is a necessary (in most cases) but insufficient condition
for causality.
\item
Causality is typically tested through (1) experimental design, or (2)
advanced statistical analysis.
\item
Be cautious of any causal claim that is backed solely by correlational
evidence.
\end{itemize}
\par\null
\selectlanguage{english}
\FloatBarrier
\end{document}