Statistical Analysis
Selected variables for analysis
The table shows the results from the fitted linear regression model
Linearity and consequences of non-linearity
One of the assumptions of a linear regression model is that the mean of the outcome Y is a linear function of the predictor variable X. In multiple linear regression, the relationship between every predictor variable and the mean of outcome, is assumed to be linear when the other variables are held constant. This implies that the model is linear in the regression parameters/coefficients meaning the conditional mean of the residuals is assumed to be zero for any given combination of values of the predictor variables.
This assessment aims to check the assumption of linearity on the fitted model. We investigate if there are any deviations or specification errors which may result in underfitting.
Violation of the linearity assumption implies that the model fails to represent the pattern of the relationship between the mean response and the predictor variables. The estimates of the regression parameters will be biased (fail to estimate the true value), inconsistent (convergence is not guaranteed) and inefficient (estimator has a large sampling variance).
How to assess the linearity assumption?
Prior knowledge: used terminology.
A
fitted value
is a value of the outcome variable estimated from the OLS regression line. Fitted values can also be referred to as predicted values.
Residuals
are the differences between observed values and their corresponding fitted values that lie on the regression line.
Graphical methods that can be used
Plots of the residuals based on the fitted model can be used to check the assumption of linearity.
- A scatter plot of the observed values against the fitted values gives an overview of the marginal relationships between \(Y\) and \(X\). It is plotted with a loess curve (locally estimated scatter plot smoother) which does not assume the form of the relationship between \(Y\) and \(X\) (e.g. linear model) but rather produces a smooth line that follows the trend in the data.
- A plot of the residuals versus the fitted values can be examined to complement the information from the scatter plot.
Acceptable appearance of residual plots
The scatter plot of the observed values against fitted values measures the accuracy of the fitted model and assesses any strong deviations from the regression line. The linearity assumption is met if the loess curve (red) approximately follows the regression line (blue) and remains in the confidence interval bounds (in grey).
When the linear regression model is correct, the points should be randomly scattered around the zero-line (the horizontal line where residuals equal zero), with no systematic pattern. This zero-line represents the situation where there is no difference between observed and fitted values of Y. The loess curve should approximately follow the zero-line and a curvature could indicate model misspecification and nonlinearity.
Remedies
When non-linearity is detected, it is recommended to use procedures that account for the model misspecification.
Approach 1: Transformations of predictors (X)
Transforming predictors is useful when residuals indicate nonlinearity with a specific covariate.
These transformations improve model fit by restoring approximate linearity without necessarily altering the distribution of the residuals via the outcome.
-
Log(X)
: use when the effect of
X
on
Y
diminishes as
X
increases (e.g., multiplicative or diminishing-returns patterns).
-
Sqrt(X)
: helpful for skewed or count-like predictors to soften curvature.
-
Centering / standardising X
: improves interpretability and numerical stability when adding polynomial terms.
Interpretation note:
coefficients relate to the transformed scale.
Approach 2: Polynomial regression
The core principle behind polynomial regression is to use a non-linear function to transform the predictor variable. For example, a simple and commonly used transformation is to square the predictor variable (second-order polynomial) to model a U-shaped relationship.
After this adjustment, the fitted model will follow the structure in the data. Therefore, if the relationship between \(X\) and mean response of \(Y\) is U-shaped (curvelinear), the appropriate model is a quadratic regression model which has a second-order polynomial in X (i.e., \(\hat{Y} = β_0+ β_1 X_{i1}+ β_2 X_{i1}^2\)). In a cubic regression model, a third-order polynomial in X is introduced such that, \(\hat{Y} = β_0+ β_1 X_{i1}+ β_2 X_{i1}^2+ β_3 X_{i1}^3\).
Variables can be transformed under the
Transform
sidebar tab to incorporate such high order terms, and users can then redifine the model in the
Define Model
sidebar tab. Use sparingly to avoid overfit.
Note: Polynomial regression models are still linear models because they are linear in their parameters.
Approach 3: Piecewise regression
This form of regression allows multiple linear models to be fitted to the data for different ranges of \(X\) i.e. flexible option for complex but smooth nonlinearity. For example, if the data follow different linear trends over various regions of the data. We should model the regression function in “pieces” which can be connected. In this version of ReDiag, we will not be using this approach.
Approach 4: Addressing omitted variable bias
Systematic non-linear patterns in the residual plots may suggest that important predictors are missing from the model.
Omitting relevant variables can bias estimates and lead to apparent non-linearity in the residuals.
For example, if the residuals show distinct clusters or trends by subgroups, this could indicate the presence of a discrete variable (e.g., gender, treatment group)
that should be explicitly included in the model.
Adding such omitted variables can restore linearity and improve model adequacy.
As with influential observations, careful investigation is needed before adding predictors to ensure they are theoretically justified.
Component-plus-Residual plots
When your model includes multiple predictor variables, component-plus-residual (partial residual) plots are important to inspect in addition to the plots described in the
Linearity Assumption
tab. This is because the residuals are determined by several predictor variables and it becomes easier to link any deviations from linearity to a specific predictor variable.
A scatter plot matrix is useful in multiple linear regression. This is a two-dimensional scatter diagram of \(y\) versus each \(X\) (i.e y versus \(X_1\), y versus \(X_2\),\(...\), y versus \(X_k\)). However, to check the assumption of linearity, these plots do not paint the whole picture (and can be misleading) because our interest centres on the partial relationship between \(y\) and each \(X\), controlling for the other \(X\)s, not on the marginal relationship between \(y\) and a single \(X\).
Component-plus-residual plots become relevant in checking the linearity assumption in cases where there is more than one predictor variable. These partial residual plots display the residuals of one predictor variable against the outcome variable.
The linearity assumption is met when the loess curve (solid puple line) follows the regression line (dashed blue line). Deviations between the loess curve and regression line are indicative of deviations from linearity in the partial relationship between X and Y. Box plots are used for categorical variables instead of scatter plots. This is because the linearity assupmtion is not required for the relationship between a categorical predictor and a continuous outcome.
Normality and consequences of non-normality
Another assumption of a linear regression model is that the residuals are normality distributed. In short, this means the normal density plot should have a single peak (i.e. be unimodal) and be symmetric instead of skewed. Note that the outcome Y is not required to be normally distributed because normality in Y does not guarantee normality in residuals. It is the outcome, controlling for the predictor variables, that needs to fulfill the requirement of normality. However, the outcome has to be continuous (not discrete) for us to assess this assumption.
When the errors are non-normally distributed, the least-squares estimator has a large sampling variance (inefficient) and this distorts the interpretation of the model. This is because the conditional mean of Y given the X's, is a sensitive measure in skewed distributions.
How to assess the normality assumption?
Graphical methods that can be used
- A plot of the theoretical quantiles versus sample quantiles (QQ-plot) can be used to compare observed values to a theoretical distribution. The ordered quantiles of the observed residuals are plotted against the quantiles of the standard normal distribution.
- A histogram of the residuals can also be used to visualise the distribution, but caution needs to be taken in small sample sizes as the plot may be inconclusive.
Acceptable appearance of residual plots
The ordered residuals are plotted against theoretical expected values for a standard normal sample. To meet the normality assumption, the residuals should follow the diagonal straight-line (which represents the normal distribution) without devaiting from the confidence interval bound (grey).
To provide a decent indicator of normality of the residuals, the histogram should have some symmetry and a bell-shape. The main objective is to avoid seeing histograms that are very irregularly shaped (e.g., heavily skewed).
Remedies
Approach 1: Transformations
Data transformation is one strategy to address non-normality, heteroscedasticity, or nonlinearity.
It involves adapting the data to the model by altering either the outcome (Y) or one or more predictors (X).
We perform transformations for three main reasons:
- normalise the residuals
- stabilise variance of the outcome
- linearise the regression model
When should I transform Y vs X?
-
Transform Y
when residuals are skewed and/or the variance of
Y
changes with its mean (heteroscedasticity).
-
Transform X
when the relationship between a predictor and
Y
is clearly non-linear (curved pattern in residuals or component+residual plots).
A well-chosen transformation can help satisfy the above concerns; in some instances, the same transformation addresses both normality and variance.
Interpretation of the model depends on the transformed variable(s). Common transformations of the outcome include:
-
Log(Y)
:
(a) stabilises variance when it increases with
Y
;
(b) normalises positively skewed residuals;
(c) linearises approximately exponential relationships.
-
Square(Y)
:
(a) stabilises variance when it decreases with the mean of
Y
;
(b) normalises negatively skewed residuals;
(c) linearises downward-curving relationships.
-
Sqrt(Y)
:
(a) stabilises variance proportional to the mean;
often suitable for count-like outcomes (approx. Poisson).
Another common approach is the
Box–Cox
family of power transformations for the outcome.
The procedure uses maximum likelihood to find the optimal power.
More information is provided in the
Manual
tab.
Interpretation note:
While some researchers prefer to back-transform estimates to the original scale,
this is not always appropriate. If the chosen transformation is not monotonic,
back-transformation can distort effect sizes and complicate interpretation.
In such cases, results should be interpreted on the transformed scale using partial orders
or relative comparisons. The Box–Cox transformations implemented here are monotonic,
so interpretation remains coherent between the transformed and original scales.
Approach 2: Generalized linear models
When the model assumptions are violated even after applying transformations, this implies a multiple linear regression model poorly describes the data. The solution is to adapt the regression model to the data and model the non-normality. The generalised linear model is a generalisation of the basic regression model that makes it possible to relax the normality assumption and assume other error distributions instead. Logistic regression (binomial and multinomial data) and Poisson regression (count data) are good options.
Approach 3: Addition of omitted discrete variables and handling of influential observations
A multimodal (i.e., more than one peak) error distribution implies that the model omitted one or more discrete predictor variables that naturally divide the data into groups. Adding these variables can help normalise the distribution of the residuals.
Cook’s Distance can identify influential observations that may disproportionately affect the model. Rather than simply deleting these points, it is essential to first investigate why these observations poorly fit the model. This examination can reveal data entry errors, measurement anomalies, or the presence of important omitted predictors. Removal should only be considered if the observations are confirmed invalid or erroneous.
Homoscedasticity and consequences of heteroscedasticity
Homoscedasticity is an assumption of the linear regression model that states the variance of the residuals (or errors) should be constant for all levels of the predictor variables. When this assumption is violated, the residual variance differs at various levels of the predictor variables, which is known as heteroscedasticity.
If the assumption of homoscedasticity is violated, the model's estimates will remain unbiased, but they will be inefficient (i.e., they will have a larger variance than necessary). Furthermore, the standard errors of the coefficients will be incorrect, leading to unreliable confidence intervals and significance tests, ultimately affecting the validity of any conclusions drawn from the model.
How to assess the homoscedasticity assumption?
Graphical methods that can be used:
- Residuals vs. fitted values plot: This plot helps detect unequal variance (heteroscedasticity) by examining whether residuals are randomly scattered around the horizontal line without distinct patterns or trends.
- Scale-location plot (also known as spread-location plot): This plot uses the square root of the absolute standardized residuals to emphasize variations in residual spread. It is particularly useful because it stabilizes variance, making it easier to detect subtle heteroscedasticity patterns.
Acceptable appearance of residual plots
In a residuals vs. fitted values plot, homoscedasticity is evident if the residuals are randomly scattered around the horizontal line at zero without any systematic pattern. If the spread of residuals increases or decreases across levels of fitted values, this suggests heteroscedasticity.
In a scale-location plot, the homoscedasticity assumption holds if the points are scattered randomly around a horizontal line. If the points exhibit a pattern or trend (such as a funnel shape), heteroscedasticity is likely present.
Remedies
Approach 1: Transformation of variables
One of the most common ways to address heteroscedasticity is to transform the dependent variable. For example, applying a logarithmic, square root, or inverse transformation can help stabilize the variance.
For example, if the residual variance increases with the magnitude of \(Y\), taking the logarithm of \(Y\) can reduce this issue. The model becomes \(\log(Y) = β_0 + β_1 X + \varepsilon\).
Approach 2: Weighted least squares
Another approach is to use weighted least squares (WLS), which assigns a weight to each data point based on the inverse of the variance of its residual. This technique reduces the impact of data points with higher variance, resulting in a model with more constant variance.
Note: Both transformations and WLS address the heteroscedasticity issue, but they should be applied with caution as they can also alter the interpretation of the model coefficients.
R Script
You will be able to re-run this analysis in R by copying the R codes on the plots.
It is highly recommended to run the codes in RStudio to be able to edit the script, view and interact with the objects stored in your environment during your analysis.