Logistic regression: theory review

Logistic regression is a model used to predict a binary outcome (dependent variable) based on a set of features (independent variables). For example, it could be applied to customer browsing data to predict whether they will convert during that session (0 = no, 1 = yes). The model estimates the probability that the outcome belongs to one of the two classes (the one labeled 1) by predicting the log odds of the outcome as a linear function of the predictors. Since linear functions have an unbounded range, we use log odds (which range from −∞ to +∞) rather than the probability itself (which ranges from 0 to 1). The logistic (sigmoid) function then maps the log odds into the 0-1 range, transforming it into a probability. To estimate the coefficients (weights for each feature), we minimize the negative log likelihood using Maximum Likelihood Estimation (MLE) through an iterative optimization algorithm, as there is no closed-form solution. There are several assumptions that must be met for valid inference, including linearity (between the log odds and the features), the absence of multicollinearity (highly correlated features), and no endogeneity (where features are correlated with the error term). Additionally, the model requires a sufficiently large sample size and no influential outliers. These assumptions are similar to linear regression but differ in some respects due to the model’s structure and the binary outcome (for example, heteroskedasticity in the error is inherent). When these assumptions are properly addressed, logistic regression provides a simple yet effective approach to classification, offering valuable insights into the predictive factors driving the binary outcome.

How it works

Logistic regression is used for binary outcomes, meaning it has two classes that we label “0” (event didn’t occur) or “1” (event occured). We want to predict the probability of the event occuring (Y=1), given the features (X): \(P(Y=1 \mid X)\). As probabilities range from 0-1, we can’t directly fit a linear model on them. Instead we convert the outcome to the log odds (the natural logarithm of the odds of being “1” vs “0”), which ranges from −∞ to +∞. We then use an iterative process to identify the linear combination of the features that increases the likelihood of detecting the observed outcome (Maximum Likelihood Estimation). Through this process we find our estimated coefficients for each feature, and the predicted probability of “1” for each observation. In this section I’ll go through the process.

Outcome: probability -> log odds

To understand the connection between the log odds, odds, and probability of the outcome, it’s helpful to work through an example. Let’s assume we’re looking at one observation (one user session), where the probability of converting is 0.8 (80%).

Formula view

The odds is the ratio of the probability of the event happening, to the probability of it not happening, with a possible range of 0 to +∞. If the probability is 0.8, the odds is 4 to 1, meaning it’s 4 times as likely to occur than not (for every 1 time it doesn’t occur, it occurs 4 times).

\[\text{Odds}(Y=1 \mid X) = \frac{P(Y=1 \mid X)}{1 - P(Y=1 \mid X)} = \frac{0.8}{0.2} = 4\]

The log odds is the natural logarithm of the odds, and ranges from −∞ to +∞. If it’s positive, it means the event is more likely to occur than not. If it’s 0, the odds are 1:1 (equally likely to occur or not occur). If it’s negative, the event is more likely to not occur than to occur. In our example it’s positive, re-iterating that it’s more likely to occur than not.

\[\text{Log Odds}(Y=1 \mid X) = \log \left( \frac{P(Y=1 \mid X)}{1 - P(Y=1 \mid X)} \right) = \log(4) \approx 1.386\]

We can then reverse this process to convert the log odds back to the probability:

\[P(Y=1 \mid X) = \frac{1}{1 + e^{-\text{Log Odds}}} = \frac{1}{1 + e^{-1.386}} = 0.8\]

The predicted outcome is then found by applying a decision boundary to the predicted probability (typically 0.5). In our example the prediction would be “1”, meaning we predict they converted:

\[\hat{Y} = \begin{cases} 1 & \text{if } p >= 0.5 \\ 0 & \text{if } p < 0.5 \end{cases}\]

Visualising the connection

Let’s now look at a range of values to see how the three values relate. In the first plot we see that as the probability increases, the odds increase non-linearly (grow exponentially): as the probability approaches 0 the odds approach 0, and as it approaches 1 the odds approach infinity. The second plot converts the odds to log odds: as the odds increase, the log odds increase. Then the third plot is the one we’re really interested in, it compares the log odds to the original probability values, showing the sigmoid (S) curve relationship between them. This visualises what we said earlier: negative values mean the probability is < 0.5 (more likely not to occur than to occur); 0 means equal chance (p=0.5); and, positive values mean the probability is >0.5 (it’s more likely to occur than not).

log odds vs probability

Logistic model

Given the feature values (X’s), we want to determine the weights (coefficients or β), that when applied to our training data, give us predicted probabilities (\(\hat{P}(Y=1)\)) as close as possible to the actual binary outcomes (Y). We do this by finding the linear combination of the features that maximizes the likelihood of the log odds.

Logistic model:

\[\text{Log Odds} = \log \left( \frac{P(Y = 1 \mid X)}{1 - P(Y = 1 \mid X)} \right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n\] \[P(Y = 1 \mid X) = \sigma(z) = \frac{1}{1+e^{-z}} = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n)}}\]

Where:

You may notice the similarities between the logistic model formula above and linear regression. Both models are fitting a linear function involving a set of features (X) and corresponding weights (β coefficients). The commonality is a major advantage of logistic regression, but note there are two key differences: 1) the \(\hat{Y}_i\) is a binary outcome (0/1) instead of continuous; and, 2) the linear function is expressed in the logit scale (\(e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n)}\)). Given point 1, we require point 2. Linear functions have an unbounded range, so they can predict any real number, including values far outside the range of [0, 1]. Similarly, the odds range from 0 to +∞, and can be extremely large or small, making it difficult to model and interpret. The log odds on the other hand ranges from −∞ to +∞, which allows us to fit a linear relationship between the features and the outcome using standard linear regression techniques. We are then able to convert the log odds to a probability using \(P(Y = 1 \mid X) = \sigma(z) = \frac{1}{1+e^{-z}} = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n)}}\). Which is different to directly using the probabilities (which has the range issue above): \(P(Y = 1 \mid X) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n\).

Maximum Likelihood Estimation (MLE)

Logistic regression is estimated using Maximum Likelihood Estimation (MLE). The idea is to find the coefficients that make the observed data most likely, or in other words, maximise the likelihood function. Let’s break down the formula.

Likelihood of one observation

Given our m observations (rows):

Likelihood of all observations

The likelihood of the entire data set (\(L(\theta)\)) is the joint probability of all the observations. Given we assume each row is independent, this is just the product of the probabilities for each observation (i.e. we take the likelihoods from each row and multiply them together):

\[L(\theta) = \prod_{i=1}^{m} P(y^{i} \mid X^{i}) = \prod_{i=1}^{m} \left[ \hat{p}^{i^{y^{i}}} \cdot (1 - \hat{p}^{i})^{1 - y^{i}} \right]\]

Log likelihood

To make the calculations easier and avoid working with very small numbers (products of small probabilities), we take the natural logarithm of the likelihood function to get the log-likelihood function. The logarithm of a product becomes a sum, so the product of individual likelihoods becomes a sum of log-likelihoods (i.e. it sums the log of the individual likelihoods for all observations in the dataset). This transformation makes it easier to optimise: it’s differentiable, concave, and more stable.

\[\ell(\theta) = \log L(\theta) = \sum_{i=1}^{m} \left[ y^{i} \log(\hat{p}^{i}) + (1 - y^{i}) \log(1 - \hat{p}^{i}) \right]\]

When the predicted probability is close to the actual outcome label, the log likelihood will be close to 0, and if it’s far off it will be a much larger negative value (contributing more significantly to the penalty). This is why maximising the log likelihood minimises the gap between the actual and predicted values. Let’s demonstrate that:

Note, due to the non-linearity of the sigmoid function that maps the logs odds to the probability, there’s no closed form solution to this like in linear regression. Instead, we use iterative methods to optimise the model parameters.

Negative log likelihood

Given many of the common optimisation algorithms are designed to minimise the cost function not maximise it, we multiply the log likelihood function by -1 to get the negative log likelihood:

\[J(\theta) = - \sum_{i=1}^{m} \left[ y^{i} \log(\hat{p}^{i}) + (1 - y^{i}) \log(1 - \hat{p}^{i}) \right]\]

Optimisation

Now we have our cost function (negative log likelihood), we can use an optimisation algorithm to iteratively adjust the coefficients to find the values that minimise it (giving us the best fitting model). The negative log likelihood is convex (single minimum), smooth and differentiable, meaning we will converge to the minimum as long as the algorithm is set up correctly. However, some methods are more efficient than others. For logistic regression, the most common optimisation algorithms use either just the first derivative (first order methods, e.g. Gradient Descent), or both the first and the second derivative (second order methods, e.g. Newton’s method). The second order methods are typically faster to converge, but are more computationally expensive. The default in statsmodels for logistic regression is a version of Newton, and scikit-learn uses an approximation of Newton (L-BFGS).

General framework

The basic process of these iterative optimisation algorithms is the same:

  1. Start with some values for the unknown model parameters (\(\beta\) coefficients), usually a vector of 0’s or small random values (we’ll call the vector \(\theta\)).
  2. Predict the probability of the outcome being “1” for each observation \(\hat{p}^{i} = \sigma(X^i \cdot \theta) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n)}}\)
  3. Compute the cost function (negative log likelihood): \(J(\theta) = - \sum_{i=1}^{m} \left[ y^{i} \log(\hat{p}^{i}) + (1 - y^{i}) \log(1 - \hat{p}^{i}) \right]\)
  4. Compute the partial derivative(s) for each \(\beta\) (i.e. \(\theta\) in the above).
  5. Apply the update rule to update the \(\beta\) estimates in \(\theta\). Increase/decrease depending on if the outcome was over or under predicted.
  6. Repeat for a fixed number of steps (iterations), or until the convergence criteria is met (change in the cost function is smaller than a cutoff).

First Derivative

The first derivative tells us the gradient of the cost function, giving us the direction in which the function is increasing or decreasing. The first derivative of the negative log likelihood with respect to \(\theta_{j}\) is:

\[\frac{\partial J(\theta)}{\partial \theta_j} = \sum_{i=1}^{m} \left( \sigma(X_i \cdot \theta) - y_i \right) X_{ij}\]

Where:

If you look at the formula, it’s taking the error (difference between the predicted probability and actual y label), weighted by the feature value (meaning features with higher values will have a larger effect). It points in the direction of steepest ascent, so to minimise the cost function, we want to move in the opposite direction:

We can re-write the equation to do all the model parameters (θ) at once, producing a nx1 vector of partial derivatives (one for each feature \(\theta_j\)).

\[\nabla_\theta \ell(\theta) = \sum_{i=1}^{m} \left( \sigma(x_i \cdot \theta) - y_i \right) x_i\]

First order methods

Gradient descent, and versions of it (Stochastic Gradient Descent, Stochastic Average Gradient), use only the first derivative (gradient) to optimise the coefficients.

Gradient Descent

For gradient descent, the parameters θ are updated iteratively in the direction of the negative gradient of the cost function. The update rule for each iteration k is:

\[\theta_{k+1} = \theta_k - \alpha \nabla_\theta \mathcal{L}(\theta_k)\]

Where:

Stochastic Gradient Descent (SGD)

Instead of using the entire data set to compute the gradient, this uses one random data point at a time to speed it up:

This is faster per iteration and less memory intensive, so good for large data sets. However, the updates can be noisy, causing it to take longer to converge (oscillating the minimum).

Stochastic Average Gradient (SAGA)

Stochastic Average Gradient (SAGA) is designed to make Stochastic Gradient Descent (SGD) converge faster. Like in SGD, it still picks one random sample to calculate the gradient. However, it stores the average of the gradients over time, using this average to update the parameters. This reduces the variance, making the updates less noisy and gradient estimates more accurate leading to faster convergence. It’s particularly efficient for sparse data sets, e.g. models with L1 regularisation.

Coordinate Descent (liblinear)

Unlike gradient descent which updates all the parameters at the same time, coordinate descent updates one parameter at a time while keeping the others fixed (i.e. it works through k iterations, applying the update rule to one \(\theta_j\), then moving on to the next one):

\[\theta_j^{k+1} = \theta_j^{k} - \alpha \frac{\partial J(\theta)}{\partial \theta_j}\]

Where:

It’s efficient for small data sets and supports regularisation, but other methods work better for large data.

Second Derivative

The second derivative captures the curvature of the function (how quickly the gradient is changing). This helps us understand where in the concave we are, so we can optimise the step size for faster convergence. We use what’s called a Hessian matrix to represent the second-order partial derivatives of the negative log-likelihood function with respect to each pair of parameters (i.e. it’s n by n):

\[H(\theta) = \sum_{i=1}^{m} \sigma(x_i \cdot \theta) \left( 1 - \sigma(x_i \cdot \theta) \right) x_i x_i^T\]

Where:

The diagonal elements of the Hessian matrix reflect the curvature with respect to a single feature (\(\theta_j\), including the intercept). High values indicate large curvature (gradient is changing rapidly), and low values indicate flatter curvature (gradient is changing slowly). For example, a large positive value indicates it’s curving up rapidly (e.g. likely approaching the minimum), and we should take smaller steps to avoid overshooting. The off-diagonal captures the interactions (how the change in one feature affects the gradient of another), with a larger value indicating a higher correlation. This results in coordinated updates for correlated features. We take the inverse of the matrix to scale the parameter update (i.e. take a larger update when curvature is small to speed it up, and small when it’s large to avoid overshooting). While using the second derivative speeds up convergence, computing the inverse of the Hessian matrix makes it more computationally expensive.

Second order methods

Newton’s method uses both the gradient (first derivative) and the Hessian matrix (second derivative) of the cost function to update the model parameters more efficiently. The Hessian matrix is computationally expensive, so there are also methods that try approximate it like IRLS and BFGS.

Newton’s method

Using Newton’s method, the parameters θ are updated iteratively in the direction of the negative of the inverse of the Hessian matrix, scaled by the gradient of the cost function. The update rule for each iteration k is:

\[\theta_{k+1} = \theta_k - H^{-1}(\theta_k) \nabla_\theta \mathcal{L}(\theta_k)\]

Where:

The product of the inverse Hessian matrix and the gradient is subtracted from the current \(\theta\) to move in the direction of the negative gradient, but adjust the size of the step based on the curvature. This is what allows it to converge faster than the first order methods.

Iteratively Reweighted Least Squares (IRLS)

The IRLS algorithm can be viewed as a special case of Newton’s method (approximates it). It uses a diagonal weight matrix instead of the full Hessian, making it computationally more efficient. The update rule is:

\[\theta_{k+1} = \theta_k + (X^T W_k X)^{-1} X^T (y - \hat{p}_k)\]

Where:

This generally converges faster than the first order methods, and is more efficient than Newton’s method as it doesn’t need to compute the inverse Hessian matrix.

Broyden–Fletcher–Goldfarb–Shanno (BFGS)

BFGS is a quasi-Newton method that approximates the inverse Hessian matrix, making it computationally more efficient than standard Newton’s method. It starts with a guess (e.g. the identity matrix), and then builds up an approximation to the Hessian matrix based on the observed changes in gradients and parameters during the optimization process.

\[H_{k+1} = H_k + \frac{y_k y_k^T}{y_k^T s_k} - \frac{H_k s_k s_k^T H_k}{s_k^T H_k s_k}\]

Where:

For high dimensional problems, there’s L-BFGS (Limited-memory BFGS), which avoids storing the full Hessian approximation.


Assumptions

The key assumptions that must be met for logistic regression include: independence of errors, linearity in the log odds, no multicollinearity, no measurement error in X, no endogeneity (for causal inference), and a “large enough” sample to avoid overfitting (~20 observations per feature). While not an explicit assumption, logistic regression is sensitive to outliers and they can have a significant impact on the model’s performance and results. There are a number of parallels with the assumptions of linear regression, but also some key differences. Most notably, we don’t make the same assumptions about the error due to the binomial nature of the outcome variable (i.e. we don’t assume homoscedasticity, and normally distributed errors centered on 0). Rather than modeling the residuals as errors in the traditional sense, the logistic model focuses on estimating the log-odds of the outcome given the predictors, which can handle the non-normality and heteroscedasticity inherent in binary outcomes.

assumptions table

Exploratory Data Analysis

Some of the assumptions like linearity and multicollinearity can be checked by doing some pre-modelling analaysis of the data. Typically this would include: 1) reviewing the features that should be included (giving thought to potential confounders); 2) visualising the relationship between the log odds and each feature; and, 3) checking if features are highly correlated.

1. Causal map

This can be as simple as writing down the outcome, then adding the common factors (features) that might influence it, with lines and arrows to show the expected relationship. Next you can add unmeasured (or difficult to measure) factors that might influence either the outcome or features. This process helps you understand how much of the outcome you are likely to be able to explain, allows you to be explicit in your assumptions, and can help you with interpreting the results. Using a very simplified example, if you predicted the rates of violence based on the size of the police force, you may conclude they’re positively correlated (more police equals more violence) because you’re missing the confounder of crime rate (higher crime equals more police and more violence).

causal map

2. X vs Y plots

This is not as straightforward as linear regression where we can use simple scatterplots, as we are interested in the log odds of Y (which ranges from −∞ to +∞). One option is to break continuous features into bins, within each bin compute the average X and average Y, and then transform the average Y to log odds (\(\text{logit}(p) = \ln \left( \frac{p}{1 - p} \right)\), where p is the proportion of 1’s). This puts Y on a reasonable scale, but also reduces the influence of noise to focus on the broader trend. Different binning methods can also be tested depending on the distribution of X (e.g. equal sized bins for a skewed distribution).

empirical plot

3. Correlation

The measure of correlation will depend on the feature type. For example, you could use:

If you have a lot of features it can be useful to compute all these measures and then display them in a heatmap to quickly spot the highly correlated pairs, though you may need to scale some of the scores which are on different scales.

corrplot


Evaluation

When evaluating the model, we want to assess whether we have a “good” model, meaning it fits the training data well, has good predictive accuracy (on unseen test data, i.e. doesn’t overfit), and doesn’t violate the assumptions. This section will cover the most common metrics and plots you would use to do this for logistic regression.

Splitting the data

Something to keep in mind is the bias variance tradeoff. Typically, the more you increase model complexity, the more you reduce the bias (error), but increase variance (the model doesn’t generalise to new data). This is the same thing as overfitting the training data. To avoid this, we split the data so that we can check the performance on both the training data and unseen test data. This can be a fixed split (e.g. 80% used for training, 20% held out for testing), or you can use k-fold cross validation (train on k-1 folds and test on the remaining, repeating k times). The benefit of the cross validation approach being that we can use all of the data for training (more efficient), and we’re less likely to overfit as we’re splitting multiple times and averaging over these to understand the performance. However, it’s more computationally expensive, so better for smaller data sets (k is typically 5-10, but stability and cost increase with k).

Overall fit

Below are the most common metrics for logistic regression. These can all be computed for the training and test data, but those focused on the model fit and comparison of models are typically only used on the training (e.g. AIC, BIC).

Metric Formula Description
Deviance (log likelihood) \(\text{D} = -2 \cdot \left(\log(\mathcal{L}(\hat{\theta})) - \log(\mathcal{L}(\theta_0))\right)\) Measures the goodness of fit for the model by comparing the likelihood of the fitted model (\(\mathcal{L}(\hat{\theta})\)) to that of the null model (\(\mathcal{L}(\theta_0)\)). Lower values indicate a better-fitting model. If the deviance is close to the deviance of the null model, the model is not explaining much variation. Used in likelihood ratio tests for model comparison.
Pearson chi-square \(\chi^2 = \sum_{i=1}^{n} \frac{(y_i - \hat{y}_i)^2}{\hat{y}_i(1-\hat{y}_i)}\) Measures the discrepancy between the observed (\(y_i\)) and expected (\(\hat{y}_i\)) frequencies for each data point. It is used to assess how well the model fits the data. A large value indicates a poor fit. A Pearson chi-square value greater than the degrees of freedom suggests overdispersion.
Hosmer-Lemeshow goodness-of-fit test \(\chi^2 = \sum_{g=1}^{G} \frac{(O_g - E_g)^2}{E_g(1 - E_g)}\) This test groups data into bins based on predicted probabilities, then compares the observed frequencies (\(O_g\)) to the expected frequencies (\(E_g\)) for each group. The statistic tests whether the observed data deviate significantly from the expected values. A significant p-value (typically < 0.05) suggests poor model fit.
AIC (Akaike Information Criterion) \(\text{AIC} = 2k - 2\log(\mathcal{L}(\hat{\theta}))\) A measure of model fit that also penalizes for the number of parameters (k) in the model. Lower AIC values indicate a better model. A model with a smaller AIC is considered better in terms of balance between fit and complexity. However, AIC should only be used for model comparison and not to assess a single model’s quality.
BIC (Bayesian Information Criterion) \(\text{BIC} = \log(n)k - 2\log(\mathcal{L}(\hat{\theta}))\) Similar to AIC but with a stronger penalty for more parameters, where n is the sample size and k is the number of parameters. A lower BIC value indicates a better model. BIC tends to favor simpler models compared to AIC, especially as the sample size increases.
Pseudo-R-squared (McFadden’s) \(R^2_{\text{McFadden}} = 1 - \frac{\log(\mathcal{L}(\hat{\theta}))}{\log(\mathcal{L}(\theta_0))}\) Compares the log-likelihood of the fitted model with that of the null model (intercept-only). Values closer to 1 indicate a better-fitting model, while values close to 0 indicate a poor fit. A common rule of thumb: McFadden’s \(R^2\) values between 0.2 and 0.4 indicate a good fit, though values lower than this may still be acceptable depending on the application.
Pseudo-R-squared (Adjusted McFadden’s) \(R^2_{\text{adj}} = 1 - \frac{(1 - R^2_{\text{McFadden}})(n - 1)}{n - p - 1}\) Adjusts the McFadden’s pseudo-\(R^2\) for the number of predictors p and sample size n to account for overfitting. This version is useful when comparing models with different numbers of predictors. A higher value indicates better fit, but the adjustment makes it more conservative in penalizing complexity compared to the regular McFadden’s \(R^2\).

Classification performance

In addition to the overall model fit, it’s common to check the model’s ability to discriminate between the two classes (0 and 1). This is typically done for both the training and test data. To do this we set the decision threshold (e.g. 1 if p>=0.5, else 0), and then we can use standard classification metrics:

These can be conveniently calculated and displayed using just a cross tabulation of the predicted and actual labels (i.e. a confusion matrix):

confusion matrix

Precision and Recall are often inversely related: as you increase recall (by lowering the threshold to classify positives), precision might decrease, and vice versa. This happens because decreasing the threshold increases the number of true positives, but also increases the number of false positives. Depending on your use case you can adjust the probability threshold to select a trade-off that’s suitable (e.g. if the cost of a False Negative is high and the intervention cost is low, you might be ok with a lower precision, i.e. more False Positives).

ROC

One way of visualising the trade-off it to plot the Receiver Operator Curve (ROC) which shows the False Positive Rate vs the True Positive Rate (Recall) at different probability thresholds (each point on the curve represents a different threshold):

ROC AUC

We can quantify the curve using metrics:

Precision-Recall Curve

However, precision and recall are more sensitive to imbalanced datasets than the ROC-AUC. For imbalanced datasets, precision-recall curves and metrics like F1 score may be more informative, whereas the ROC curve might appear overly optimistic.

PR curve

Lift/Gain:

Another way to visualise how good the model is at discriminating the positive class is to plot the lift and cumulative gain. Lift measures the improvement a model has over random guessing. It shows how much better a model is at predicting positive cases in the top n% of the data compared to randomly selecting n% of the data. Meanwhile, gain measures the cumulative number of positive cases that the model is able to correctly identify when we rank all instances in the dataset by predicted probability. It provides a way to see how much better the model is at identifying positive cases compared to random selection.

Lift and gain

These are useful when the outcome is imbalanced and you’re primarily interested in the minority class (e.g. fraud), as you can get a clear picture of the effectiveness.

Residuals

Looking at the performance at a more granular level, we want to understand which observations the model is doing well with and which it is not. To do this we need to define a residual. Residuals represent the difference between the observed values and the values predicted by the model. In a linear regression model, this is just the difference between Y actual and Y predicted (the error in predicting the continuous outcome). However, in logistic regression we model the probability of Y equalling 1 (\(P(Y=1 \mid X)\)), rather than the outcome directly (which is 0 or 1). Therefore, the residuals behave a bit differently, and we have two common methods of defining them: Pearson and Deviance. Pearson residuals give a measure of the absolute deviation from the expected outcome in terms of the predicted standard deviation, whereas deviance residuals give a measure of the log-likelihood ratio discrepancy between the observed and expected outcomes. They can provide complementary insights, so both are often computed and plotted to check for outliers an potentially influential observations.

Deviance Residuals

Deviance residuals measures the contribution of each observation to the overall deviance (a goodness-of-fit statistic), and is closely related to the likelihood function. They are useful for detecting model misfit, influential observations, and outliers.

\[d_i = \text{sign}(y_i - \hat{p}_i) \sqrt{-2 \ln \left( \frac{L(y_i | \hat{p}_i)}{L(1 - y_i | \hat{p}_i)} \right)}\]

Where:

\[\text{sign}(x) = \begin{cases} 1 & \text{if } x > 0 \\ 0 & \text{if } x = 0 \\ -1 & \text{if } x < 0 \end{cases}\]

Compared to Pearson residuals they’re more sensitive to extreme cases, as the logarithmic function in the formula exaggerates errors when the predicted probabilities are far from the observed outcomes. The rule of thumb for an outlier is an absolute value greater than 2, which comes from the idea that this corresponds to roughly 2 standard deviations from the mean. Deviance residuals reflect how well the model fits the data in terms of the log-likelihood, making them useful for comparing models. The overall goodness of fit metric Deviance is just the sum of the squared deviance residuals.

Pearson Residuals

Pearson residuals offer a different perspective. They’re based on the chi-square statistic (the sum of squared residuals divided by the predicted variance), and are useful for identifying observations that are poorly fit by the model.

\[r_i = \frac{y_i - \hat{p}_i}{\sqrt{\hat{p}_i(1 - \hat{p}_i)}}\]

Where:

If the logistic regression model is well-calibrated, the Pearson residuals should follow a normal distribution with mean 0 and standard deviation 1. This gives the rule of thumb that absolute values of \(r_i\) greater than 2 could be an outlier, or observation the model is not good at predicting. To limit it to extreme cases, a threshold of 3 is also common. Pearson residuals provide a standardised and more stable measure of residuals compared to Deviance, and are useful for identifying observations where the model performs poorly. However, this also means they tend to reduce the impact of extreme deviations compared to Deviance as they’re scaled down by the denominator.

Assessing residuals

Standard residual plots (e.g. residuals vs fitted values, and residuals vs Xs) are hard to judge. If the actual outcome is 0, then the probability will be more, and residuals have to be negative (blue) and if the actual outcome is 1, we underestimate, and residuals have to be positive (red). Using Pearson/Deviance residuals doesn’t help here much. We can add the LOWESS (Locally Weighted Scatterplot Smoothing) curve to try detect the pattern (the closer it is to flat with a 0 intercept the better):

resid vs fitted plot

Further, some heteroscedasticity (non constant variance) is expected in the residual plots as the variance of the Pearson residuals is a function of the predicted probability: \(\text{Var}(Y_i) = \hat{p}_i (1 - \hat{p}_i)\), meaning it will be largest near \(\hat{p}_i=0.5\) and smallest close to 0 and 1. Deviance residuals have a similar pattern due to the likelihood of observing the data being larger (and the residual smaller) when the predicted probability is closer to 0 or 1. However, non-random patterns (e.g. a U shape or clusters, outliers) would still indicate a potential issue.

Marginal Model Plot

This visualises the relationship between a single feature (x) and the predicted probabilities (y) in a logistic regression model. By varying the feature from its minimum to maximum value while holding other features constant (typically at their mean or another chosen value), you calculate the predicted probabilities and plot them against the feature values. This shows how the predicted probability changes with the feature, allowing you to assess the marginal effect of that feature on the outcome. The plot helps evaluate whether the relationship is linear or not.

marginal plot

Partial Residual Plot

This is used to evaluate whether the relationship between each feature and the outcome is linear in the log-odds. Partial residuals are trying to isolate the effect of X1 on the outcome, while holding the influence of all other features constant. This is done by taking the residual (unexplained part) and adding back the effect of \(\beta_1 X_1\):

\[\text{Partial Residuals} = Y - \hat{Y} + \beta_1 X_1\]

Where:

Adding \(\beta_1 X_1\) back to the residuals allows us to examine how X1 is associated with the outcome in the presence of other features, and to check if its relationship with the outcome is linear or not. In the example below, X1 looks linear but X4 does not:

PR plot

Outliers / influential observations

Outliers can be identified during the residual diagnostics, but can be more formally defined using Cook’s distance or leverage. Given these are about the model fit, we calculate them only for the training data:

It’s important to understand the cause of outliers before choosing a method to handle them. If they’re due to data entry errors, they can be removed or corrected. For random outliers, consider capping or winsorizing values at specific percentiles (e.g., 5th and 95th), but this can introduce bias or data loss if done incorrectly. Replacing outliers with the mean or median is another option, though it can also skew results if not representative of the data. If outliers reflect distinct groups (e.g, travel agents behaving differently from regular customers), consider modeling them separately or using interaction terms. For random outliers, robust regression methods can reduce their impact without altering the data.

Multicollinearity

Multicollinearity occurs when one feature is highly correlated with another, which can lead to unreliable estimates of the coefficients and inflated standard errors. Two commonly used diagnostics for identifying highly correlated features are the Variance Inflation Factor (VIF) and the condition number.

If multicollinearity is an issue, a common solution is to reduce the feature set by: manually removing one from each pair; combining features (e.g, with PCA); or using regularisation.


Iterating on the model

Based on the evaluation, you can test a number of adjustments to the model to see if it improves the performance. For example, considering terms for interactions or non-linearities, and using regularisation or other methods to correct for multicollinearity.

Adding terms

Adding new features that are predictive of the outcome can improve the model’s accuracy, but also the precision of the coefficient estimates (i.e. smaller confidence intervals). However, adding too many features that have a weak or no effect on the outcome leads to overfitting. If they’re highly correlated (multicollinearity), it can also destabilise the coefficient estimates and increase their variance. An example of this is adding features that are predictive of another feature, but not the outcome. It can lead to multicollinearity, or reduce the variance of the feature (which is harmful). The other type of features you might add are variations of the existing, such as: polynomials (i.e. X^n to capture non-linearity); and, interactions (when the effect of one feature might depend on the value of a second feature). As with the above, adding too many can lead to overfitting. This is where the next section on feature selection is useful.

Feature selection

You will typically start by selecting all the features you believe are important based on theory, prior knowledge, or existing literature. However, if you suspect multicollinearity or redundancy, you can also use feature selection methods:

Each model iteration is evaluated based on a chosen statistic (e.g., AIC, BIC, likelihood ratio test) to decide which variables remain in the model. The main downside of these is that there may not be good reasons for the choice of one feature over another, with implications on the interpretation of the model.

Regularisation

An alternative to the iterative feature selection methods is regularisation, which adds a penalty term to the loss function based on the coefficients to minimise the magnitude of the coefficients at the same time as maximising the log likelihood. This prevents overfitting by discouraging the model from assigning too much importance to any particular feature, but at the cost of higher bias (i.e. it reduces variance to improve test performance, but worsens training performance). Feature scaling (typically standardisation) is essential to avoid features with a larger range dominating the penalty term during regularisation. There are 3 main methods of regularisation:

1) Lasso (L1) adds a penalty proportional to the absolute value of the coefficients (some go to exactly 0). This is performing feature selection, but does bias the remaining coefficients (they’re shrunk). For interpretation purposes you might use it to select the features, but then re-fit a non-regularised model.

\[J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(h_{\theta}(x^{(i)})) + (1 - y^{(i)}) \log(1 - h_{\theta}(x^{(i)})) \right] + \lambda \sum_{j=1}^{n} |\theta_j|\]

2) Ridge (L2) adds a penalty proportional to the square of the coefficients (shrinks all, but not to 0):

\[J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(h_{\theta}(x^{(i)})) + (1 - y^{(i)}) \log(1 - h_{\theta}(x^{(i)})) \right] + \frac{\lambda}{2} \sum_{j=1}^{n} \theta_j^2\]

3) Elastic Net combines L1 and L2, with an additional parameter (alpha) added to control the mix. This method can be useful for correlated features. For example, if two features are highly correlated, the L1 term will push both coefficients towards zero if both are irrelevant, but the L2 term ensures that they are shrunk together. If both are important, the model will keep them but shrink their coefficients. In this way it handles groups of features better, as correlated features are more likely to be selected or discarded together, rather than arbitrarily.:

\[J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(h_{\theta}(x^{(i)})) + (1 - y^{(i)}) \log(1 - h_{\theta}(x^{(i)})) \right] + \lambda \left( \alpha \sum_{j=1}^{n} |\theta_j| + \frac{1 - \alpha}{2} \sum_{j=1}^{n} \theta_j^2 \right)\]

Where:

The strength of the regularisation in all three methods is controlled by lambda (\(\lambda\)). Close to 0 (e.g. 0.01) there will be minimal regularisation (coefficients similar to without), around 1-5 there will be some shrinkage of the coefficients, and at large values (e.g. 10-100) there will be a lot of shrinkage and possibly all 0’s in the case of Lasso. Usually the value would be selected based on a grid search (test a set of values e.g. 0.001, 0.01, 0.1, 1, 10), potentially with cross-validation (to ensure it generalises well).


Inference

When interpreting the coefficients of logistic regression, we often have causal questions in mind (e.g., how does feature X influence outcome Y?). However, in most cases, logistic regression reveals associations rather than causal relationships. While causal estimates can be obtained (e.g. if the data is from a randomised experiment), using observational data makes it challenging to confirm that all confounders have been accounted for. Despite this, logistic regression remains a valuable tool for generating hypotheses that can be further tested with other methods. If the primary goal is prediction, the coefficients may be less important. Nonetheless, understanding and explaining the predictions is still useful—and in some cases, necessary even in predictive contexts.

Coefficients

In Logistic Regression the coefficients give us how much the log-odds of the outcome is expected to change for a one-unit change in the feature, holding all other features constant. Exponentiating the coefficients transforms them into odds ratios, which are easier to interpret as the multiplicative change in the odds of the event. When the log odds changes linearly with the feature, the odds change multiplicatively, which we can interpret as a percentage change in the odds. You could convert them to probabilities, but this will have a non-linear relationship with the outcome due to the use of the sigmoid function. Let’s walk through some examples.

We fit the logistic model:

\[\log\left( \frac{P(Y = 1)}{1 - P(Y = 1)} \right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n\]

The sign and magnitude of the coefficients tell us the direction of the effect and the size of the impact (+ means log odds increases with the feature; - means it decreases; and, the higher the magnitude the larger the effect). Then we can convert each coefficient from the log odds to odds (the multiplicative change in the odds of the outcome for a one-unit increase in \(X_i\)):

\[\text{Odds Ratio} = e^{\beta_j}\]

Example:

Let’s say we want to predict whether a customer will convert based on their age, income, and loyalty status:

\[\log\left( \frac{P(\text{buy})}{1 - P(\text{buy})} \right) = -2 - 0.1 \cdot \text{age} + 0.03 \cdot \text{income} + 0.1 \cdot \text{loyalty}\]

Other feature types:

Feature Importance

When using the magnitude of the coefficients to comment on the importance of a feature, make sure the features have been standardised so they have comparable scales. Also keep in mind they may be biased by multicollinearity and/or missing confounders, so it’s important to consider the results in the context of your model. That said, this is still a useful way for generating hypotheses about factors that are important for the outcome.

Standard errors

The coefficient estimates are based on one sample, so like with all sample statistics we want to know how much these estimates might vary from sample to sample. The standard error is exactly this, it gives us a measure of the uncertainty of the estimate. This is then used to construct confidence intervals and do significance tests. The larger the standard error, the higher the uncertainty, and the wider the confidence intervals. The standard errors of the coefficients in logistic regression are derived from the inverse of the Fisher Information matrix (or equivalently, the Hessian of the log-likelihood function).

Hessian Matrix (covered above under optimisation): represents the second-order partial derivatives of the negative log-likelihood function with respect to each pair of parameters \(\beta_j\) and \(\beta_k\) (i.e. it’s n by n).

\[H_{jk} = \frac{\partial^2 \ell(\beta)}{\partial \beta_j \partial \beta_k} = -\sum_{i=1}^n p_i (1 - p_i) X_{ij} X_{ik}\]

The Fisher Information matrix is the negative of the Hessian.

\[I(\beta) = -H = \sum_{i=1}^{n} p_i (1 - p_i) X_i X_i^T\]

It gives us a measure of how much information we have about each feature, based on the curvature of the log-likelihood function (the greater the information, the more precise our estimates of the coefficients will be). A larger value indicates small changes in the coefficient will have a large effect on the log likelihood and thus we have higher confidence in the estimates (lower variance).

The Standard Errors are then:

\[\text{SE}(\hat{\beta_j}) = \sqrt{\left( I(\beta)^{-1} \right)_{jj}}\]

Where:

Robust standard errors

In linear regression, heteroskedasticity doesn’t lead to bias in coefficient estimates, only inefficiency. Robust standard errors adjust the variance-covariance matrix to account for heteroskedasticity, allowing us to still make valid inferences with consistent standard errors. In logistic regression it’s slightly more nuanced as we inherently expect heteroskedasticity due to the binary nature of the outcome variable (i.e. the variance of the errors depends on the predicted probability). This heteroskedasticity is accounted for in the model and not a problem. Other sources of heteroskedasticity are possible, but are typically related to model misspecification (e.g. omitted variables, missing interactions, wrong functional form), which leads to biased estimates (in both linear and logistic regression). Robust standard errors do not correct biased coefficient estimates, only the variance. Therefore, they are not very useful for this issue as it’s a correct standard error for an incorrect estimate. Using them might mislead you into thinking your results are more reliable than they actually are. It is better to address these issues directly (e.g. by adding the missing variables, transforming the predictors, or using more appropriate model forms).

Confidence Intervals

The calculations for the confidence intervals in logistic regression are similar to those in linear regression once we have the standard error. Recall, a confidence interval quantifies the uncertainty of a sample estimate (in this case \(\hat{\beta}_j\)), by giving us a range of values around the estimate that “generally” contains the true population value. Like any sample estimate, it’s unlikely that our coefficient is exactly equal to the true value. If we repeatedly sampled, we would get a distribution of coefficients around the true value (a sampling distribution), which, with a large enough sample size, converges to a normal distribution due to the properties of Maximum Likelihood Estimation (MLE). In logistic regression, we typically use the Wald test to calculate significance, which uses z-scores. These are standardised (mean = 0 and standard deviation = 1), so when we select the critical z value, we are deciding how many standard errors we need to get x% coverage of the samples. Or in other words, how many standard errors will ensure that x% of samples contain the true value.

\[\hat{\beta}_j \pm z_{\frac{\alpha}{2}} \cdot \text{SE}(\hat{\beta}_j)\]

Where:

To determine the critical value, we have to decide the confidence level (1-alpha). Typically in a business context the confidence is set at 95% (alpa=0.05) or 90% (alpha=0.1). The higher the confidence required, the wider the confidence interval will be to increase the chances of capturing the population value. Noting, the actual value may or may not be in the range for our given sample. Coming back to the sampling distribution (distribution of coefficient estimates across multiple samples), if we placed the resulting distance around each estimate we would see that 95% of the time (for alpha=0.05) the true value is within the interval. The other 5% of the time, we have a particularly high or low estimate (by chance) and miss the actual value. You can read more on confidence intervals here.

Interpretation

For logistic regression recall the coefficients represent a change in the log odds of the outcome being 1, but we typically exponentiate these coefficients to get the odds ratio which is more interpretable. If we exponentiate the coefficient then we should do the same to the upper and lower bounds of the confidence interval to get a confidence interval around the odds.

Hypothesis Tests

As we covered above, we’re estimating these coefficients based on just one sample. It’s possible that even if there is no relationship between feature j and the outcome y, we get a non-zero coefficient in our model. Hypothesis testing tells us how often we would expect to see a coefficient as large as that observed (or larger), when there’s actually no effect. If it’s uncommon, then we conclude there is likely a real effect. To determine if the effect of feature j on the outcome is statistically significant we use a Wald test, and to test if any of the features are significant we use the Likelihood Ratio Test (LRT).

1. Wald test

Null hypothesis (\(H_0: \beta_j = 0\)): feature j has no effect on the outcome y. Alternate hypothesis (\(H_0: \beta_j \neq 0\)): feature j has some non-zero effect on the outcome y.

Test statistic:

\[W_j = \frac{\hat{\beta}^2_j}{\text{SE}(\hat{\beta}^2_j)}\]

\(W_j\) is the squared Wald statistic and follows a chi-square distribution with 1 degree of freedom. While Wald follows a standard normal distribution, it’s generally approximated by the chi-square distribution. A large value indicates the null hypothesis is unlikely (coefficient is significantly different to 0). To assess the statistical significance, we compare this to a critical value from the chi-square distribution and compute the p-value:

2. Likelihood ratio test (LRT)

The likelihood ratio test (LRT) can be used to test the significance of one or more coefficients by comparing the log-likelihood of the full model with that of a restricted model (in this case the null model).

\[H_0: \beta_1 = \beta_2 = \cdots = \beta_k = 0\] \[H_1: \text{At least one} \, \beta_j \neq 0\] \[\text{LRT Statistic} = -2 \left( \log(\text{L}_0) - \log(\text{L}_1) \right)\] \[\text{p-value} = P(\chi^2_k \geq \text{LRT Statistic})\]

Where:

As with the Wald test, we compare the LRT test statistic to the critical value, or check if the p value is less than alpha to determine whether we accept or reject the null hypothesis. E.g. if the p-value is less than alpha, we reject the null hypothesis and conclude that at least one of the coefficients is significantly different from zero, suggesting that the predictors collectively contribute to explaining the response variable.


Extensions

While basic logistic regression is for binary classification, it can be extended to multiclass problems using a technique called multinomial logistic regression or softmax regression. In this case, instead of a single sigmoid function, we use the softmax function to handle multiple classes.


Summary

Logistic regression is a powerful technique for modeling binary outcomes, where the goal is to estimate the probability of a specific event occurring. It uses the logistic (sigmoid) function to map the linear output to a probability between 0 and 1. While it’s widely used in practice due to its simplicity and interpretability, logistic regression relies on certain assumptions, such as linearity in the log-odds and independence of observations. One limitation is that it may not perform well in the presence of high multicollinearity or when the sample size is small. To estimate the coefficients, logistic regression uses Maximum Likelihood Estimation (MLE), which is computationally intensive and requires iterative optimisation. Although more complex models (e.g. Random Forest, XGBoost) might outperform it in terms of predictive accuracy, logistic regression still provides valuable insights into relationships between features and binary outcomes. However, as with any observational model, care must be taken when making causal claims due to the potential for unmeasured confounding. Despite this, logistic regression remains a useful tool for hypothesis generation and for cases where interpretability is essential.

References: