Welcome to the Regression Diagnostics web-application

The ReDiag app assists students and researchers with their residual analysis for linear regression models. It offers visualization tools to assess the normality of residuals and the often-overlooked assumption of linearity. The user-friendly interface enables the user to critically assess their model and correct any violations. A dynamic report can be downloaded with the results.

This app was developed by the Support for Quantitative and Qualitative Research (SQUARE) to provide the research community with complimentary support in statistics. The developers are part of the Biostatistics and Medical Informatics research group (BISI) at the Vrije Universiteit Brussel (VUB).


Terms of use

ReDiag is not designed for model building or variable selection. It is distributed in the hope that it may be useful in assessing regression assumptions.

Uploaded data and outputs from analyses won't be kept on our servers. Therefore, you must refrain from uploading any sensitive information because the research tool is supplied WITHOUT ANY WARRANTY. Instead, you can download the reports and any modified data. If you submit any data to this application, you are solely responsible for its confidentiality, availability, security, loss, abuse, and misappropriation.


Data statement

Data for the analysis of behavioural outcomes (Janssen et al., 2024) and to assess the relationship between occupancy and ammonia build-up (Eskandarani et al., 2023) were obtained from published studies.


Feedback

We would love to hear your thoughts, suggestions, concerns or problems you encountered while using ReDiag so that we can improve. To do this, kindly evaluate the web-application via this link.


Citation

If you use ReDiag for your research, teaching, or presentations, please cite the following publication:

Savieri P, Barbé K, Stas L (2025). ReDiag: An Interactive Research Tool to Address Common Misconceptions in Linear Regression Model Diagnostics. Journal of Open Research Software, 13: 14. DOI: https://doi.org/10.5334/jors.553











Statistical Analysis

Selected variables for analysis


                        

The table shows the results from the fitted linear regression model



Linearity and consequences of non-linearity

One of the assumptions of a linear regression model is that the mean of the outcome Y is a linear function of the predictor variable X. In multiple linear regression, the relationship between every predictor variable and the mean of outcome, is assumed to be linear when the other variables are held constant. This implies that the model is linear in the regression parameters/coefficients meaning the conditional mean of the residuals is assumed to be zero for any given combination of values of the predictor variables.

This assessment aims to check the assumption of linearity on the fitted model. We investigate if there are any deviations or specification errors which may result in underfitting.

Violation of the linearity assumption implies that the model fails to represent the pattern of the relationship between the mean response and the predictor variables. The estimates of the regression parameters will be biased (fail to estimate the true value), inconsistent (convergence is not guaranteed) and inefficient (estimator has a large sampling variance).


How to assess the linearity assumption?

Prior knowledge: used terminology.

A fitted value is a value of the outcome variable estimated from the OLS regression line. Fitted values can also be referred to as predicted values. Residuals are the differences between observed values and their corresponding fitted values that lie on the regression line.


Graphical methods that can be used

Plots of the residuals based on the fitted model can be used to check the assumption of linearity.

  1. A scatter plot of the observed values against the fitted values gives an overview of the marginal relationships between \(Y\) and \(X\). It is plotted with a loess curve (locally estimated scatter plot smoother) which does not assume the form of the relationship between \(Y\) and \(X\) (e.g. linear model) but rather produces a smooth line that follows the trend in the data.
  2. A plot of the residuals versus the fitted values can be examined to complement the information from the scatter plot.

Acceptable appearance of residual plots




The scatter plot of the observed values against fitted values measures the accuracy of the fitted model and assesses any strong deviations from the regression line. The linearity assumption is met if the loess curve (red) approximately follows the regression line (blue) and remains in the confidence interval bounds (in grey).



When the linear regression model is correct, the points should be randomly scattered around the zero-line (the horizontal line where residuals equal zero), with no systematic pattern. This zero-line represents the situation where there is no difference between observed and fitted values of Y. The loess curve should approximately follow the zero-line and a curvature could indicate model misspecification and nonlinearity.




Remedies

When non-linearity is detected, it is recommended to use procedures that account for the model misspecification.


Approach 1: Transformations of predictors (X)

Transforming predictors is useful when residuals indicate nonlinearity with a specific covariate. These transformations improve model fit by restoring approximate linearity without necessarily altering the distribution of the residuals via the outcome.

  • Log(X) : use when the effect of X on Y diminishes as X increases (e.g., multiplicative or diminishing-returns patterns).
  • Sqrt(X) : helpful for skewed or count-like predictors to soften curvature.
  • Centering / standardising X : improves interpretability and numerical stability when adding polynomial terms.

Interpretation note: coefficients relate to the transformed scale.


Approach 2: Polynomial regression

The core principle behind polynomial regression is to use a non-linear function to transform the predictor variable. For example, a simple and commonly used transformation is to square the predictor variable (second-order polynomial) to model a U-shaped relationship.

After this adjustment, the fitted model will follow the structure in the data. Therefore, if the relationship between \(X\) and mean response of \(Y\) is U-shaped (curvelinear), the appropriate model is a quadratic regression model which has a second-order polynomial in X (i.e., \(\hat{Y} = β_0+ β_1 X_{i1}+ β_2 X_{i1}^2\)). In a cubic regression model, a third-order polynomial in X is introduced such that, \(\hat{Y} = β_0+ β_1 X_{i1}+ β_2 X_{i1}^2+ β_3 X_{i1}^3\).

Variables can be transformed under the Transform sidebar tab to incorporate such high order terms, and users can then redifine the model in the Define Model sidebar tab. Use sparingly to avoid overfit.

Note: Polynomial regression models are still linear models because they are linear in their parameters.


Approach 3: Piecewise regression

This form of regression allows multiple linear models to be fitted to the data for different ranges of \(X\) i.e. flexible option for complex but smooth nonlinearity. For example, if the data follow different linear trends over various regions of the data. We should model the regression function in “pieces” which can be connected. In this version of ReDiag, we will not be using this approach.


Approach 4: Addressing omitted variable bias

Systematic non-linear patterns in the residual plots may suggest that important predictors are missing from the model. Omitting relevant variables can bias estimates and lead to apparent non-linearity in the residuals.

For example, if the residuals show distinct clusters or trends by subgroups, this could indicate the presence of a discrete variable (e.g., gender, treatment group) that should be explicitly included in the model.

Adding such omitted variables can restore linearity and improve model adequacy. As with influential observations, careful investigation is needed before adding predictors to ensure they are theoretically justified.



Component-plus-Residual plots

When your model includes multiple predictor variables, component-plus-residual (partial residual) plots are important to inspect in addition to the plots described in the Linearity Assumption tab. This is because the residuals are determined by several predictor variables and it becomes easier to link any deviations from linearity to a specific predictor variable.

A scatter plot matrix is useful in multiple linear regression. This is a two-dimensional scatter diagram of \(y\) versus each \(X\) (i.e y versus \(X_1\), y versus \(X_2\),\(...\), y versus \(X_k\)). However, to check the assumption of linearity, these plots do not paint the whole picture (and can be misleading) because our interest centres on the partial relationship between \(y\) and each \(X\), controlling for the other \(X\)s, not on the marginal relationship between \(y\) and a single \(X\).

Component-plus-residual plots become relevant in checking the linearity assumption in cases where there is more than one predictor variable. These partial residual plots display the residuals of one predictor variable against the outcome variable.

The linearity assumption is met when the loess curve (solid puple line) follows the regression line (dashed blue line). Deviations between the loess curve and regression line are indicative of deviations from linearity in the partial relationship between X and Y. Box plots are used for categorical variables instead of scatter plots. This is because the linearity assupmtion is not required for the relationship between a categorical predictor and a continuous outcome.



Normality and consequences of non-normality

Another assumption of a linear regression model is that the residuals are normality distributed. In short, this means the normal density plot should have a single peak (i.e. be unimodal) and be symmetric instead of skewed. Note that the outcome Y is not required to be normally distributed because normality in Y does not guarantee normality in residuals. It is the outcome, controlling for the predictor variables, that needs to fulfill the requirement of normality. However, the outcome has to be continuous (not discrete) for us to assess this assumption.

When the errors are non-normally distributed, the least-squares estimator has a large sampling variance (inefficient) and this distorts the interpretation of the model. This is because the conditional mean of Y given the X's, is a sensitive measure in skewed distributions.


How to assess the normality assumption?

Graphical methods that can be used

  1. A plot of the theoretical quantiles versus sample quantiles (QQ-plot) can be used to compare observed values to a theoretical distribution. The ordered quantiles of the observed residuals are plotted against the quantiles of the standard normal distribution.
  2. A histogram of the residuals can also be used to visualise the distribution, but caution needs to be taken in small sample sizes as the plot may be inconclusive.

Acceptable appearance of residual plots




The ordered residuals are plotted against theoretical expected values for a standard normal sample. To meet the normality assumption, the residuals should follow the diagonal straight-line (which represents the normal distribution) without devaiting from the confidence interval bound (grey).



To provide a decent indicator of normality of the residuals, the histogram should have some symmetry and a bell-shape. The main objective is to avoid seeing histograms that are very irregularly shaped (e.g., heavily skewed).




Remedies

Approach 1: Transformations

Data transformation is one strategy to address non-normality, heteroscedasticity, or nonlinearity. It involves adapting the data to the model by altering either the outcome (Y) or one or more predictors (X). We perform transformations for three main reasons:

  1. normalise the residuals
  2. stabilise variance of the outcome
  3. linearise the regression model

When should I transform Y vs X?

  • Transform Y when residuals are skewed and/or the variance of Y changes with its mean (heteroscedasticity).
  • Transform X when the relationship between a predictor and Y is clearly non-linear (curved pattern in residuals or component+residual plots).

A well-chosen transformation can help satisfy the above concerns; in some instances, the same transformation addresses both normality and variance. Interpretation of the model depends on the transformed variable(s). Common transformations of the outcome include:

  1. Log(Y) : (a) stabilises variance when it increases with Y ; (b) normalises positively skewed residuals; (c) linearises approximately exponential relationships.
  2. Square(Y) : (a) stabilises variance when it decreases with the mean of Y ; (b) normalises negatively skewed residuals; (c) linearises downward-curving relationships.
  3. Sqrt(Y) : (a) stabilises variance proportional to the mean; often suitable for count-like outcomes (approx. Poisson).

Another common approach is the Box–Cox family of power transformations for the outcome. The procedure uses maximum likelihood to find the optimal power. More information is provided in the Manual tab.

Interpretation note: While some researchers prefer to back-transform estimates to the original scale, this is not always appropriate. If the chosen transformation is not monotonic, back-transformation can distort effect sizes and complicate interpretation. In such cases, results should be interpreted on the transformed scale using partial orders or relative comparisons. The Box–Cox transformations implemented here are monotonic, so interpretation remains coherent between the transformed and original scales.


Approach 2: Generalized linear models

When the model assumptions are violated even after applying transformations, this implies a multiple linear regression model poorly describes the data. The solution is to adapt the regression model to the data and model the non-normality. The generalised linear model is a generalisation of the basic regression model that makes it possible to relax the normality assumption and assume other error distributions instead. Logistic regression (binomial and multinomial data) and Poisson regression (count data) are good options.


Approach 3: Addition of omitted discrete variables and handling of influential observations

A multimodal (i.e., more than one peak) error distribution implies that the model omitted one or more discrete predictor variables that naturally divide the data into groups. Adding these variables can help normalise the distribution of the residuals.

Cook’s Distance can identify influential observations that may disproportionately affect the model. Rather than simply deleting these points, it is essential to first investigate why these observations poorly fit the model. This examination can reveal data entry errors, measurement anomalies, or the presence of important omitted predictors. Removal should only be considered if the observations are confirmed invalid or erroneous.



Homoscedasticity and consequences of heteroscedasticity

Homoscedasticity is an assumption of the linear regression model that states the variance of the residuals (or errors) should be constant for all levels of the predictor variables. When this assumption is violated, the residual variance differs at various levels of the predictor variables, which is known as heteroscedasticity.

If the assumption of homoscedasticity is violated, the model's estimates will remain unbiased, but they will be inefficient (i.e., they will have a larger variance than necessary). Furthermore, the standard errors of the coefficients will be incorrect, leading to unreliable confidence intervals and significance tests, ultimately affecting the validity of any conclusions drawn from the model.


How to assess the homoscedasticity assumption?

Graphical methods that can be used:

  1. Residuals vs. fitted values plot: This plot helps detect unequal variance (heteroscedasticity) by examining whether residuals are randomly scattered around the horizontal line without distinct patterns or trends.
  2. Scale-location plot (also known as spread-location plot): This plot uses the square root of the absolute standardized residuals to emphasize variations in residual spread. It is particularly useful because it stabilizes variance, making it easier to detect subtle heteroscedasticity patterns.

Acceptable appearance of residual plots




In a residuals vs. fitted values plot, homoscedasticity is evident if the residuals are randomly scattered around the horizontal line at zero without any systematic pattern. If the spread of residuals increases or decreases across levels of fitted values, this suggests heteroscedasticity.



In a scale-location plot, the homoscedasticity assumption holds if the points are scattered randomly around a horizontal line. If the points exhibit a pattern or trend (such as a funnel shape), heteroscedasticity is likely present.




Remedies

Approach 1: Transformation of variables

One of the most common ways to address heteroscedasticity is to transform the dependent variable. For example, applying a logarithmic, square root, or inverse transformation can help stabilize the variance.

For example, if the residual variance increases with the magnitude of \(Y\), taking the logarithm of \(Y\) can reduce this issue. The model becomes \(\log(Y) = β_0 + β_1 X + \varepsilon\).


Approach 2: Weighted least squares

Another approach is to use weighted least squares (WLS), which assigns a weight to each data point based on the inverse of the variance of its residual. This technique reduces the impact of data points with higher variance, resulting in a model with more constant variance.


Note: Both transformations and WLS address the heteroscedasticity issue, but they should be applied with caution as they can also alter the interpretation of the model coefficients.



R Script

You will be able to re-run this analysis in R by copying the R codes on the plots.

It is highly recommended to run the codes in RStudio to be able to edit the script, view and interact with the objects stored in your environment during your analysis.


Documentation


Steps on how to use the app


The app has input tabs on the sidebar panel and outputs are displayed on the main panel.


Step 1: Data input

Select an example dataset or load your data file by choosing the correct file extension in the Data Input sidebar tab. To change a data type, select a variable to edit, choose the New data type and apply changes using the Change data type button.

The View Data tab shows a preview of the data, and by default, only 10 rows of data are shown at a time. You can change this setting through the Show entries dropdown. The Data Summary tab displays descriptive statistics of the data including the distributions of variables.


Step 2: Define the model

The Define Model sidebar tab allows users to select the outcome variable and one or more predictor variables. If the model contains interaction terms, one can create and add them to the model by marking the checkbox. The model is then run by clicking the Run Analysis button and viewing the fitted regression model results under the Model Summary tab.


Step 3: Model diagnostics

The Linearity Assumption , Normality Assumption and Homoscedasticity Assumption tabs provide visualisations and diagnostics plots to validate the regression model.

  1. The 'observed vs fitted values' scatter plot and the 'residuals vs fitted values' plot assess the linearity assumption. Both illustrate the relationship between the outcome and the predictors.
  2. The QQ-plot and the histogram of residuals provide visualisations to assess the normality assumption.
  3. The Scale-Location plot of residuals provide additional visualisations to assess the homoscedasticity assumption.

Recommendations for the ideal plots are provided together with suggestions for remedies if there are violations.


Step 4: Data transformation

After assessing the plots, the next step is to make any data transformations necessary. If there were no violations, skip to Step 6.

Switch to the View Data tab by clicking on the Transform sidebar tab.

Select a variable to transform from the drop-down menu and choose a transformation type from the list. Type in the extension to the transformed variable before clicking Apply Changes. Data will automatically update with the transformed variable in the last column. It is possible to save the updated dataset by choosing a file extension and clicking the Download button.


Below, we describe the transformation functions:

  1. Ln (natural log): takes the natural logarithm of the variable, which helps reduce skewness and makes data more normally distributed.
  2. Ln (X+1): takes the natural logarithm of (X+1), used for data containing zeros to avoid undefined values.
  3. Exp: calculates the exponential function of the variable, which helps reverse the natural logarithm transformation.
  4. Square: raises the variable to the power of 2, often used to capture quadratic relationships in data.
  5. Cube: cubes the variable, capturing cubic relationships in data.
  6. Square root: takes the square root of the variable, helpful for stabilising variance and reducing skewness.
  7. Standardise: centres the variable around its mean and scales it to have a standard deviation of 1, ensuring all variables are on the same scale.
  8. Centre: shifts the variables' values to have a mean of zero, useful for comparing variables on different scales
  9. Inverse: takes the reciprocal of the variable, useful for transforming ratios or proportions back to their original scale.

Box-Cox Transformation

When users tick the Box-Cox transformation checkbox, a slider pops up. Below the data preview, a QQ plot shows how the residuals change as users alter the slider inputs, representing the new power of the outcome variable. The partial log-likelihood shows the estimated lambda and its 95% confidence interval. If this estimated lambda (Est_lambda), from the slider input, is close to one of the Exact_lambda values in the table below, use the latter value as it is easier to interpret.


Step 5: Reassemble the model

After data transformation, return to Step 2 and rerun the analysis with the transformed variables.


Step 6: Generate a report

Users can generate an analysis report in the Download Report tab. Reports can either be in PDF, Word or HTML format.

There is also an option to save the R Code for the plots generated from the study to reproduce the analysis or recreate the results. Users can then edit the code and interact with the objects stored in your environment during your analysis in RStudio.


Note

A tutorial of a basic example can be found as a PDF file in the GitHub folder.

At any stage of the analysis, users can refresh and start a new session by clicking the Reset button.






Developers






Perseverence Savieri is a doctoral researcher in the Biostatistics and Medical Informatics research group (BISI) at the Vrije Universiteit Brussel medical campus Jette. He is also a statistical consultant for the humanities and social sciences at campus Etterbeek through the Support for Quantitative and Qualitative Research (SQUARE) core facility. Here, he offers statistical and methodological quantitative support in the form of consultations, statistical coaching, data analyses and workshops.

Perseverence.Savieri@vub.be

Supervisors

dr. Lara Stas & Prof. dr. Kurt Barbe
2025 Support for Quantitative and Qualitative Research (SQUARE)