Generalized Linear Models (GLMs)
Generalized Linear Models (GLMs) are a broad class of models that extend linear regression to handle a variety of outcome variables, each potentially having a different distribution and link function. GLMs consist of three main components: the random component (which defines the probability distribution of the outcome variable), the systematic component (a linear combination of the features), and the link function (which transforms the linear predictor to the scale of the outcome). Linear and logistic regression are two specific examples of GLMs: linear regression uses the normal distribution and the identity link, while logistic regression uses the binomial distribution and the logit link. Other commonly used GLMs include Poisson regression (for count data, using the Poisson distribution and a log link) and Gamma regression (for positive continuous outcomes, using the inverse link). GLMs are typically estimated using Maximum Likelihood Estimation (MLE), with iterative methods such as Iterative Reweighted Least Squares (IRLS) to minimise the negative log likelihood. GLMs provide a flexible and robust framework for modeling various types of data, allowing the underlying distribution and link function to be tailored to the nature of the response variable. However, as with any model, it is crucial to check the underlying assumptions, as they can vary depending on the specific type of GLM being used.
Components
GLMs consist of three key components:
- Random Component (Outcome Variable): The distribution of the outcome variable (e.g. Normal, Binomial, Poisson).
- Systematic Component (Predictors): This part represents the linear combination of predictors (independent variables).
- Link Function: The link function maps the linear predictors to the mean of the outcome variable (the expected value). The choice of the link function depends on the distribution of the outcome variable.
Examples
Below is a summary of some common GLMs. Note, they’re generally minimising the negative log likelihood (with MLE), with the exception of linear regression. You can hopefully see from this list how flexible the approach is, while maintaining a common framework.
Model | Scenario | Components | Key Assumptions | Cost Function |
---|---|---|---|---|
Linear Regression | Housing Price Prediction | Continuous outcome, Normal Distribution, Identity Link | Linearity, Homoscedasticity, No Multicollinearity, Normality of residuals | \(L(\beta) = \sum_{i=1}^{n} (y_i - \hat{y_i})^2\) |
Logistic Regression | Customer Churn Prediction | Binary outcome (0/1), Binomial Distribution, Logit Link | Linearity in log-odds, No Multicollinearity, No Endogeneity | \(L(\beta) = - \sum_{i=1}^{n} \left[ y_i \log(\hat{p_i}) + (1 - y_i) \log(1 - \hat{p_i}) \right]\) |
Poisson Regression | Insurance Claims Frequency | Count outcome (rate of occurrences), Poisson Distribution, Log Link | Mean = Variance, Independence of observations, No overdispersion | \(L(\beta) = \sum_{i=1}^{n} \left[ y_i \log(\hat{\lambda_i}) - \hat{\lambda_i} \right]\) |
Gamma Regression | Claim Severity in Insurance | Positive continuous outcome, Gamma Distribution, Inverse Link | Positive outcomes, Homoscedasticity, No skewness | \(L(\beta) = \sum_{i=1}^{n} \left[ \frac{y_i}{\hat{\mu_i}} - \log(\hat{\mu_i}) \right]\) |
Tweedie Regression | Modeling Insurance Claims: predict both the number (frequency) and claim cost (severity) | Count and continuous outcome (combined), Tweedie Distribution, Log Link | A mixture of zero and continuous outcomes, No overdispersion | \(L(\beta) = \sum_{i=1}^{n} \left[ \text{Poisson part} + \text{Gamma part} \right]\) |
Multinomial Logistic Regression | Predict Product Type (A, B, or C) | Categorical outcome (multiple categories), Multinomial Distribution, Logit Link | Independence of irrelevant alternatives (IIA), No Multicollinearity | \(L(\beta) = \sum_{i=1}^{n} \left[ y_i \log(\hat{p_i}) \right]\) |
Fitting GLMs
GLMs are typically estimated using Maximum Likelihood Estimation (MLE). This approach maximises the likelihood that the observed data came from the assumed distribution. Since GLMs do not typically have closed-form solutions (with the exception of linear regression), iterative optimization algorithms like Iterative Reweighted Least Squares (IRLS) are used to estimate the model parameters. The cost function is minimised (typically the negative log likelihood), by updating the coefficients iteratively to find the values that best fit the model to the data.
Diagnostics
GLMs require careful checking of assumptions, which can vary depending on the specific type of model. Common issues to check for include:
- Outliers: GLMs can be sensitive to outliers, which may disproportionately influence the model’s estimates and lead to biased results.
- Multicollinearity: Highly correlated predictors can lead to unreliable estimates of the coefficients and inflate standard errors. It’s important to check for multicollinearity to ensure stable and interpretable parameter estimates.
- Endogeneity: This occurs when predictors are correlated with the error term, leading to biased coefficient estimates. Addressing endogeneity typically requires methods like instrumental variables or specialized regression techniques.
- Misspecification of the Distribution or Link Function: Choosing the wrong distribution (e.g. using a normal distribution when the data is actually binomial) or the wrong link function can significantly degrade the model’s performance and lead to biased estimates.
To ensure that the GLM provides valid and reliable results, it’s important to conduct thorough diagnostic checks, such as examining residuals, assessing goodness-of-fit (e.g. using deviance or AIC), and performing model validation.
Interpretation
The coefficients in a GLM generally represent the effect of a one-unit change in the predictor (independent variable) on the expected value of the outcome (dependent variable), transformed by the link function. Therefore, they vary depending on the specific link function and distribution:
- Logistic Regression (Logit Link): The coefficient represents the change in the log-odds of the outcome variable (e.g. event occurring) for a one-unit change in the predictor.
- Poisson Regression (Log Link): The coefficient represents the log-relative change in the expected count for a one-unit change in the predictor.
- Gamma: The coefficient represents the change in the reciprocal of the expected outcome (\(\frac{1}{\mu}\)) for a one-unit change in the predictor.
As a result the tests for the statistical significance of the features also vary.
More Info
This post is a general summary on GLMs, but for more information I would recommend reviewing the specific model of interest. See the linked posts for linear and logistic regression.