An introduction to linear regression
Author: Tina Nguyen, MBiostat1
1. Biostatistics Collaboration of Australia
SHPA proudly supports the Research Toolkit series, which aims which aims to support members in conducting and publishing their research. This series is coordinated by the SHPA Research Leadership Committee, and hence shares the insights and experience of our most research-passionate members. If you’re keen to make a difference to patient care, not just in your daily practice, but in improving practice itself, then this series is for you.
If you are conducting medical research and want to find whether there is a relationship between your variables in your study, you can use a statistical method called linear regression to analyse your data. Linear regression is widely used in medical research to calculate the correlation between variables (e.g. whether birth weight is linked to weight in childhood). You can also use linear regression to make predictions (e.g. predicting a person’s BMI based on their calorie intake). In this article, I will show you how to conduct simple linear regression and multiple linear regression using the statistical applications, R (R Foundation for Statistical Computing, Vienna, Austria) and Stata (StataCorp, College Station, TX, USA), and how to interpret the results.
In medical research, there are two types of variables. These are independent variables and dependent variables. These types of variables can be seen as the cause and effect, with the independent variable being the cause and the dependent variable being the effect. For example, if your aim is to investigate whether income influences life expectancy, income is the independent variable and life expectancy is the dependent variable. Other names exist for these types of variables, such as predictor variables and outcome variables, or explanatory variables and response variables. In every regression model, there can be one or more independent variables, but there can only be one dependent variable. A simple regression model refers to a regression model where there is only one independent variable. Otherwise, a multiple regression model has two or more independent variables.
There are steps you should consider before deciding to perform linear regression on your research data. If you are conducting a simple linear regression, your variables need to be numerical (i.e. continuous variables) for example, BMI or weight. If the values of the dependent variable are not numerical, for example disease status, then linear regression is not suitable and you should try a different statistical method. If you are conducting a multiple linear regression analysis, at least one of the independent variable/s needs to be numerical.1
In statistics, assumptions are all the characteristics that we assume about our data before we run our analysis. These can be thought of as rules that the data need to follow so we can conduct the statistical analysis. If the data does not meet the assumptions required by the statistical method, the results obtained from the analysis may not be accurate.
Here are the assumptions of a linear regression model:
- The relationships between each independent variable (referred to as ‘X’) and the dependent variable (referred to as ‘Y’) are linear
- The variance of the residuals is constant for any value of X
- The distribution of the residuals is normal (in small samples)
- There are no other confounding variables.
These assumptions are explained below.
Linearity between X and Y
A simple regression model can be expressed as a mathematical formula below.
In this equation:
Yi = β0 + β1Xi + εi
Yi is the value of the dependent variable for individual i (i = 1, 2, … n, where n is the number of individuals in the study)
- β0 and β1 are the parameters of the regression model. Parameters are unknown values that we try to calculate using regression analysis.
- εi is a ‘random error’ with mean zero. (This will be explained later).
Note: If we compare the formula for the simple regression model (excluding εi to an algebraic line equation (y = ax + b), β0 would be the y-intercept and β1 would be the slope. We commonly refer to β0 as the constant in statistics.
Linearity means that if the values of X and Y are to be graphed into a scatterplot below, the points follow the shape of a straight line.
Figure 1. Example scatterplot of data which is appropriate for linear regression
The red line in Figure 1 is a straight line that fits most closely with the data points. This is called the regression line. The formula for this line is the same as the formula for simple regression but without the ‘random errors’.
Note: It is possible to run linear regression commands with only categorical independent variables using Stata and R. However, the results show the averages of the dependent variable values in the categorical group.
If the points do not follow the shape of a straight line, then a transformation or a different statistical method should be considered. The most common form of transformation is the log transformation. This will be explained later in this article.
Figure 2. Example scatterplot of data which is not suitable for linear regression. A transformation may be required here.
Constant variance of the residuals
Recall from the formula for simple regression that there is a ‘random error’ ε added to the regression line. These are also called residuals. Residuals can be graphed into a residual plot. An example of a residual plot is shown below (Figure 3).
Figure 3. Example residual plot showing constant variance
To meet the assumption of constant residual variance, you want the vertical spread of the data points in the residual plot to be approximately the same across all x axis values. Figure 3 shows an example of the data meeting this assumption.
Figure 4. Example residual plot where the variance decreases with increasing x value
If the variance of the residuals is not constant (shown in Figure 4), then you should either use different statistical method or apply a transformation.
Normal distribution of the residuals
From the simple regression formula, we assume that the errors (ε) follow a normal distribution with mean 0 and variance σ2. This means that most of the residual values are close to 0 while there are fewer values further away from 0.
Figure 5. Graph showing the frequency of a normal distribution
In statistical programs, the normality of the residuals can be checked using Q–Q (quantile-quantile) plots.
Figure 6. Q–Q plot example of assessment of the normality of residuals
The Q–Q plot in Figure 6 shows a dataset that meets the assumption for the normality of the residuals. Notice that the residual data points follow closely to the diagonal line.
If most of the residual data points stray too far away from the diagonal line, then the assumption is not met. In that case, a transformation or a different statistical method should be considered.
No other confounding variables
Confounding variables are other variables that are linked to both the independent and dependent variables. A common example of this is the high correlation between ice cream consumption and shark attacks. Obviously, ice cream consumption is not the cause for the higher number of shark attacks. In this example, the confounding variable is temperature. As the temperature increases, more people are likely to eat ice cream and more people are likely to go swimming. Therefore, if we are to only investigate the amount of ice cream consumption and the number of shark attacks, the correlation between the two would be overestimated because of the effect of the confounding variable temperature.
We should consider all possible confounding variables when conducting research studies. Any confounding variable that is not considered may lead to inaccurate results.
Effects of confounding variables can be adjusted by adding the confounding variables into our multiple regression models. Therefore, when producing the final results for the study, it is best to use the results from a multiple regression model as opposed to a simple regression model. Simple regression models are useful for measuring the confounding effects of possible confounding variables (by comparing the results to a multiple regression model). These findings are usually included in the discussion section of a research article.
Transformations modify all values of a numerical variable the same way. We do this so our data can meet the statistical assumptions. The most widely used transformation is the log transformation (base e). This transformation modifies all the values of a variable by taking the log of it (ln(x)). When the values of a variable are spread out but highly concentrated at the lower values, this is a clear sign that a log transformation is needed.
Figure 7. Example scatter plot of data that requires log transformation
Figure 8. Example scatterplot after log transformation
The data in Figure 7 is unevenly spread out. This suggests the variable should be log transformed. Figure 8 shows the aftermath of the log transformation. Notice how the variable is more linear after log transforming.
In the above regression output in Stata, the main feature to look for is in the bottom table. This is called the parameter estimates table and it contains the results we are looking for. The most useful columns in this table are ‘Coef.’, ‘P>|t|’ and ‘[95% Conf. Interval]’. When writing the results of your research project, you need to at least include the values under these three columns.
The ‘Coef.’ column contains the values of the coefficients. The value 52.02942 can be interpreted as the increase in life expectancy for every unit increase in income. Similarly, -0.1680367 is interpreted the same way for schooling. The value 37.98063 is called the constant. This constant is the value of the dependent variable (which is life expectancy in this example) when all the other variables are zero. The multiple regression formula using these values is as follows:
Life expectancy = 37.98063 + 52.02942 * income + -0.1680367 * schooling + εi
This is similar to the simple regression equation, but it includes other variable(s).
The ‘P>|t|’ column contains the p-values. These values are the probability that the positive or negative coefficient result is obtained by random chance. For example, there is a 49.0% probability that the negative result for schooling is due to chance. This is very high. Therefore, the result for schooling is deemed insignificant. A common rule of thumb for statisticians is to consider a p-value less than 0.05 as significant. However, some studies use a threshold of 0.01 or 0.1 depending on the nature of the study. In a multiple regression model, variables with insignificant results should be excluded.
The last column shows the 95% confidence intervals. These intervals show the range in which we are 95% confident that the real population coefficient is within the interval. Like p-values, some studies use 90% confidence intervals or 99% confidence intervals.
Similar to the Stata output, in the R output, the main results are in the bottom table. The ‘Estimate’ column shows the coefficient values and the ‘Pr(>|t|)’ column shows the p-values. Note that the asterisks next to the p-values show the significance. Confidence intervals can be calculated on R using the confint command.
The statistical method linear regression is an important tool when analysing your data. It allows you to calculate the correlation between variables and to make predictions. In this article, we have demonstrated when to use both simple linear regression and multiple linear regression, how to use the statistical applications R and Stata to conduct linear regression, and provided some guidance about how to best interpret your results. For additional reading on the use of linear regression, Applied Linear Statistical Models, 5th edition by Kutner et al. is recommended.
- 95% confidence interval: An interval where we are 95% confident that the real population result is included in
- Coefficient: the increase in the dependent variable for every 1 unit increase in the independent variable or the decrease in the dependent variable if the coefficient is the negative value
- Constant (β0): the value of the dependent variable when the values of all independent variables are zero
- Dependent variable (Y): a variable whose values depend on the value of the independent variable/s
- Independent variable (X): a variable whose values do not depend on another variable
- p-value: in linear regression, the p-value is the probability that the positive or negative coefficient result obtained is a result of random chance. Because studies use data from a sample of the population, your result of your statistical analysis would most likely not be true for the population. Therefore, there is always a chance that your positive or negative coefficient result from your study could be the opposite of what is actually true for the population
- Regression line: line that best shows the correlation between two variables. See the graph in page 2
- Regression model: A formula that is created using regression analysis (for example, see the mathematical formula in the section Stata). Models can be used to make predictions.
- Residuals: Difference between the fitted values formed from the regression line and the data values
- Statistical methods: formulas or techniques used to conduct statistical analysis. There many statistical methods and linear regression is one of them
- Transformation: modifying all the values of the variable the same way. The most used transformation is the log transformation
- Variables: something we measure that can change and may have more than one value e.g. weight, height, income
- Variance: number that shows how spread out the data is
1. Kutner M, Nachtsheim C, Neter J, Li W. Applied linear statistical models. 5th ed. Boston: McGraw-Hill/Irwin; 2005.