# logistic regression

Given a binary respose variable $Y$ with probability of success $p$, the logistic regression  is a non-linear regression model with the following model equation:

 $\operatorname{E}[Y]=\frac{\operatorname{exp}(\boldsymbol{X}^{\operatorname{T}}% \boldsymbol{\beta})}{1+\operatorname{exp}(\boldsymbol{X}^{\operatorname{T}}% \boldsymbol{\beta})},$

where $\boldsymbol{X}^{\operatorname{T}}\boldsymbol{\beta}$ is the product  of the transpose  of the column matrix $\boldsymbol{X}$ of explanatory variables and the unknown column matrix $\boldsymbol{\beta}$ of regression coefficients. Rewriting this so that the right hand side is $\boldsymbol{X}^{\operatorname{T}}\boldsymbol{\beta}$, we arrive at a new equation

 $\ln\Big{(}\frac{\operatorname{E}[Y]}{1-\operatorname{E}[Y]}\Big{)}=\boldsymbol% {X}^{\operatorname{T}}\boldsymbol{\beta}.$

The left hand side of this new equation is known as the logit function, defined on the open unit interval $(0,1)$ with range the entire real line $\mathbb{R}$:

 $\operatorname{logit}(p):=\ln(\frac{p}{1-p})\mbox{ where }p\in(0,1).$

Note that the logit of $p$ is the same as the natural log of the odds of success (over failures) with the probability of success = $p$. Since $Y$ is a binary response variable, so it has a binomial distribution with parameter (probability of success) $p=\operatorname{E}[Y]$, the logistic regression model equation can be rewritten as

 $\operatorname{logit}\big{(}\operatorname{E}[Y]\big{)}=\operatorname{logit}(p)=% \boldsymbol{X}^{\operatorname{T}}\boldsymbol{\beta}.$ (1)

Logistic regression is a particular type of generalized linear model. In addition, the associated logit function is the most appropriate and natural choice for a link function. By natural we mean that $\operatorname{logit}(p)$ is equal to the natural parameter $\theta$ appearing in the distribution function  for the GLM (generalized linear model). To see this, first note that the distribution function for a binomial random variable  $Y$ is

 $P(Y=y)=\left({n\atop y}\right)p^{y}(1-p)^{(n-y)},$

where $n$ is the number of trials and $Y=y$ is the event that there are $y$ success in these $n$ trials. $p$, the parameter, is the probability of success. Let there be $N$ iid binomial random variables $Y_{1},Y_{2},\ldots,Y_{N}$ each corresponding to $n_{i}$ trials with $p_{i}$ probability of success. Then the joint probability distribution of these $N$ random variables is simply the product of the individual binomial distributions. Equating this to the distribution  for the GLM, which belongs to the exponential family of distributions, we have:

 $\prod_{i=1}^{N}\left({n_{i}\atop y_{i}}\right){p_{i}}^{y_{i}}(1-p_{i})^{(n_{i}% -y_{i})}=\prod_{i=1}^{N}\operatorname{exp}\big{[}y_{i}\theta_{i}-b(\theta_{i})% +c(y_{i})\big{]}.$

Taking the natural log on both sides, we have the equality of log-likelihood function  in two different forms:

 $\sum_{i=1}^{N}\big{[}\ln\left({n_{i}\atop y_{i}}\right)+y_{i}\ln p_{i}+(n_{i}-% y_{i})\ln(1-p_{i})\big{]}=\sum_{i=1}^{N}\big{[}y_{i}\theta_{i}-b(\theta_{i})+c% (y_{i})\big{]}.$

Rearranging the left hand side and comparing term $i$, we have

 $y_{i}\ln(\frac{p_{i}}{1-p_{i}})+n_{i}\ln(1-p_{i})+\ln\left({n_{i}\atop y_{i}}% \right)=y_{i}\theta_{i}-b(\theta_{i})+c(y_{i}),$

so that $\theta_{i}=\ln\big{(}p_{i}/(1-p_{i})\big{)}=\operatorname{logit}(p_{i})$.

Next, setting the natural link function logit of the expected value  of $Y_{i}$, which is $p_{i}$, to the linear portion of the GLM, we have

 $\operatorname{logit}(p_{i})={\boldsymbol{X}_{i}}^{\operatorname{T}}\boldsymbol% {\beta},$

Remarks.

Title logistic regression LogisticRegression 2013-03-22 14:47:51 2013-03-22 14:47:51 CWoo (3771) CWoo (3771) 12 CWoo (3771) Definition msc 62J12 msc 62J02 logit probit complementary-log-log