<?xml version="1.0" encoding="UTF-8"?>

<record version="5" id="6313">
 <title>Simpson's paradox</title>
 <name>SimpsonsParadox</name>
 <created>2004-10-06 19:27:27</created>
 <modified>2008-04-15 02:23:07</modified>
 <type>Definition</type>
 <creator id="3771" name="CWoo"/>
 <author id="3771" name="CWoo"/>
 <classification>
	<category scheme="msc" code="62H17"/>
 </classification>
 <preamble>% this is the default PlanetMath preamble.  as your knowledge
% of TeX increases, you will probably want to edit this, but
% it should be fine as is for beginners.

% almost certainly you want these
\usepackage{amssymb,amscd}
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{tabls}
\usepackage{multirow}

% used for TeXing text within eps files
%\usepackage{psfrag}
% need this for including graphics (\includegraphics)
%\usepackage{graphicx}
% for neatly defining theorems and propositions
%\usepackage{amsthm}
% making logically defined graphics
%\usepackage{xypic}

% there are many more packages, add them here as you need them

% define commands here
%\renewcommand\multirowsetup{\centering}
\newlength\LL \settowidth\LL{100}
%\renewcommand\LL{\hspace{100}}</preamble>
 <content>\PMlinkescapeword{types}
\PMlinkescapeword{measure}
\PMlinkescapeword{groups}

Before describing what a \emph{Simpson's paradox} is, let's start with a hypothetical example.  During a particular summer, an experiment was conducted to find out the preference between two types of beverages: soda and lemonade.  The data was drawn from two locations: city and rural.  In each location, the gender and the choice of drinks were collected.  The results are summarized as follows:
\begin{center}
\begin{tabular}{|c|c|c|c|c|c|c|}
\hline
location&amp;gender&amp;lemonade&amp;soda&amp;total&amp;\% preferring lemonade&amp;\PMlinkname{odds ratio}{OddsRatio}\\
\hline
\multirow{2}{\LL}{city}
&amp;female&amp;150&amp;300&amp;450&amp;24.9\% &amp; \multirow{2}{\LL}{1.10} \\ \cline{2-6}
&amp;male&amp;300&amp;660&amp;960&amp;23.1\% &amp; \\ 
\hline
\multirow{2}{\LL}{rural}
&amp;female&amp;285&amp;860&amp;1145&amp;33.3\% &amp; \multirow{2}{\LL}{1.10} \\ \cline{2-6}
&amp;male&amp;30&amp;100&amp;130&amp;31.3\% &amp; \\
\hline
\end{tabular}
\end{center}
The odds ratio given that location = city is about 1.1, showing that females are about 10\% more likely to drink lemonade than males.  Because the conditional odds ratio given that location = rural is also 1.1, the same conclusion can be drawn.
\par
Next, combine the results from both locations and form the following 2 by 2 contingency table:
\begin{center}
\begin{tabular}{|c|c|c|c|c|c|}
\hline
gender&amp;lemonade&amp;soda&amp;total&amp;\% preferring lemonade&amp;odds ratio\\
\hline
female&amp;435&amp;1160&amp;1595&amp;27.3\% &amp; \multirow{2}{\LL}{0.86} \\ \cline{1-5}
male&amp;330&amp;760&amp;1090&amp;30.3\% &amp; \\
\hline
\end{tabular}
\end{center}
The odds ratio of 0.86 shows that females are about 14\% less likely to drink lemonade than males, rather than 10\% more likely as was shown earlier!  This is an example of \emph{Simpson's paradox}.
\par
In general, Simpson's paradox illustrates that the effect of an omission of a categorical explanatory variable $Z$ can have on the measure of association between a categorical explanatory variable $X$ and a categorical response variable $Y$.  
\par
In the example, given the location variable $Z$, the conditional odds ratios show that the gender variable $X$ and choice of drinks response variable $Y$ have a positive association, with positive log-odds ratios.  However, when the location variable $Z$ is removed, the marginal association between $X$ and $Y$ is negative, with a negative log-odds ratio.
\par
One reason for this apparent paradox is due to the dissimilar populations between the city and the rural groups.  In the rural area, the majority of the test subjects are female, whereas in the city area, the majority is male.
\par
For an excellent explanation of Simpson's paradox, please refer to the book below.
\begin{thebibliography}{8}
\bibitem{agresti} A. Agresti, {\em An Introduction to Categorical Data Analysis}, Wiley \&amp; Sons, New York (1996).
\end{thebibliography}</content>
</record>
