Assumptions of Linear Regression - Ace the Most Asked Interview Question
Fundamentals of Linear Regression and Machine Learning
Introduction
Linear Regression is one of the most popular statistical models and machine learning algorithms. Considered the holy grail in the world of Data Science and Machine Learning.
It is one of the first (if not the first) algorithms that is thought in ML schools and courses alike.
However, one of the most important aspects that a lot of tutorials skip is that Linear Regression cannot be applied to all datasets alike. There are certain mandates that a dataset and its distribution must follow for Linear Regression to be successfully modeled to it.
These are popularly also known as the Assumptions of Linear Regression.
💡 Assumptions of Linear Regression model is a favorite interview question for the Data Scientist and Machine Learning Engineer positions.
In this article, we will not only list the different assumptions of a linear regression model but also discuss why they are so, and the rationale behind each of them.
The prerequisite for this discussion is a good understanding of the Linear Regression algorithm itself.
So let’s go! 🚀
A quick review of Linear Regression
We know that the Linear Regression model aims at establishing the best-fit line between the dependent and independent features of a dataset as shown below.
Figure: y = 3 + 5x + np.random.rand(100, 1)
The Linear Regression model is defined as follows.
Now, let us discuss the assumptions of the Linear Regression model.
Assumption of Linear Regression Model
The assumptions of Linear Regression are as follows:
Linearity
Homoscedasticity or Constant Error Variance
Independent Error Terms or No Autocorrelation
Normality of Residuals
No or Negligible Multi-collinearity
Exogeneity
💡 NOTE
Different sources and textbooks might list a different number of assumptions of a linear regression model. And they are all correct.
However, the 6 assumptions that we will discuss today shall cover all of the different assumptions.
Many textbooks break individual assumptions into multiple different assumptions, therefore, can list out about 10 different assumptions.
⭐ The significance of these assumptions can be understood as guidelines that if a dataset follows, becomes highly suitable for a Linear Regression model.
Alright! Let’s discuss each of these assumptions in detail.
1. Linearity
This essentially means that there must be a linear relationship between the dependent and the independent features of a dataset.
And this is fairly intuitive as the best-fit line of a linear regression model is a straight line, which is most suitable for linear data distribution.
Compare the two different distributions below:
Data is linearly distributed
Figure:
y = 3 + 5x + np.random.rand(100, 1)
Data is non-linearly distributed
Figure:
y = 3 + 50x^2 + np.random.rand(100, 1)
We can clearly distinguish between the two different distributions that the linear regression model is a better fit for the linear distribution.
How to detect linearity between dependent & independent features?
Well, one way is to plot the data and detect it visually. However, in real-world scenarios, it may not be so simple to detect linearity in data.
The Likelihood Ratio (LR) Test is a good test for establishing linearity.
2. Homoscedasticity or Constant Error Variance
The second assumption of linear regression is Homoscedasticity.
It means that the residuals (or error terms) should have constant variance along the axis, in other words, the error terms must be evenly spread across the axis as shown below.
Figure: The residuals for a linearly distributed dataset have constant variance.
There are instances where the residuals are not evenly spread along the axis, and this condition is known as Heteroscedasticity. A few examples are shown below.
Figure: Homoscedasticity vs Heteroscedasticity [Source]
When there is Heteroscedasticity in data, the standard errors cannot be relied upon and hence is a violation of the assumptions of Linear Regression.
How to detect Heteroscedasticity in data?
Apart from visually detecting it, there are statistical tests for determining Heteroscedasticity, the popular ones are:
Goldfeldt-Quant test
Breusch-Pagan test
How to remove Heteroscedasticity in data?
There are certain ways to remove Heteroscedasticity from your data, some of them are:
White’s standard errors: These are additive terms that sort of normalize the variance in the spread of residual terms, however, the downside is that the confidence in the coefficients of independent features also decreases.
Weighted least squares: Updating the weights of independent features in the Linear Regression equation. This is a trial-and-error method that may lead to Homoscedasticity.
Log transformations: Many times a curved distribution can be converted into a linear distribution (i.e., a straight line) by simply applying the log function to it. Other transformations may work out as well.
3. Independent Error Terms or No Autocorrelation
Here the assumption states that each residual term is not related to the other residual term occurring before or after it. A good example of this is shown below.
Figure: The residuals for a linearly distributed dataset are independent of each other.
💡 NOTE
Autocorrelation is the relation of the data series with itself, where the error term of the next data record is related to the residual of the previous data record.
It is most often found in time-series data and not so prevalent in regular cross-sectional datasets. An example of a time-series distribution is shown below.
Figure: Autocorrelation in time series data helps forecast future outcomes.
Therefore, it is not something that you may encounter very often, however, if you do it is a violation of the assumptions of linear regression.
With autocorrelation in the data, the standard error of the output becomes unreliable.
How to detect autocorrelation?
There are a few tests for detecting autocorrelation in a dataset. Here are a few:
ACF & PACF plots
Durbin-Watson test
4. Normality of Residuals
This assumption states that the residuals of errors in the model must be normally distributed.
If the normality of errors is violated and the number of records is small, then the standard errors in output are affected. That impacts the best-fit line of the model.
💡 NOTE
This assumption generally is considered a weak assumption for Linear Regression models and slight (or greater) violations can be neglected while modeling. This is particularly true for large datasets.
How to detect normality in errors?
There are multiple visual and statistical tests for detecting normality in error terms. Some of the popular ones are:
Histogram
Figure: Residuals are normally distributed [Source]
Q-Q Plot
Figure: Q-Q plot for normally distributed errors [Source]
Shapiro-Wilk test
Kolmogorov-Smirnov test
Anderson-Darling test
How to bring normality in errors?
As mentioned above, this is a weak assumption and can be neglected in many cases as well.
However, some ways to bring normality in residuals are:
Mathematical transformations like log transformations etc.
Standardization or normalization of the dataset
Adding more data reduces the need for normally distributed error terms
5. No Multi-collinearity
Multi-collinearity occurs when 2 or more features of a dataset are internally correlated with each other.
Consider a house price dataset with multiple variables about the property and price being the target variable. There is a high chance that the features 'floor area' and 'land dimensions' are highly correlated since the area is a direct multiple of individual dimensions.
Now, this is a problem for the regression model since what it effectively is trying to do is isolate the individual effects of each feature on the target variable. This is represented by the weights of each feature as shown below.
$$X = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n + \epsilon$$
Therefore, it is highly recommended to verify that there is collinearity between individual features within a dataset.
How does this affect our model?
It disturbs the best-fit line by impacting the individual coefficients of the variables, which then becomes unreliable.
How to detect multicollinearity?
Calculating correlation (ρ) between each feature in the dataset.
Variance Inflation Factor (VIF)
How to remove multicollinearity?
Simply removing one of the correlated variables.
Merging them into a single feature can prevent multicollinearity.
⚠️ CAUTION!
Merging correlated features into a single feature will only work if the new feature actually has real-world existence or impacts the target variable equally.
6. Exogeneity (or No Endogeneity)
Exogeneity or no omitted variable bias is the final assumption on our list.
But let’s first understand what omitted variable bias actually is.
If there is a variable in the model that has been omitted and/or is not present but still impacts the target variable, then there is omitted variable bias or Endogeneity in the model.
For example, consider the following model.
$$UsedCarPrice_i = \beta_0 + \beta_1(DistanceTravelled)_i + \epsilon_i$$
Now the price of a used car here is determined by the distance it has already covered. However, the year of manufacturing also impacts both the target variable (Y), the car price of the used car, and the X variable, Distance traveled by car as the longer the age of the car, the more likely the car has traveled greater distances.
This is a clear case of omitted variable bias and it is undesirable for accurate modeling.
💡 NOTE
Exogeneity in a model tells us that all features that impact the target variable (Y) are part of the model features (X) and no other external feature can be further included.
Summary
So this was our discussion on the Assumptions of Linear Regression. This is one of the favorite questions of Data Scientist interviewers and now you know how to ace it!
Here is a quick summary of the same.
Linearity: There must be a linear relationship between the dependent and independent variables.
Homoscedasticity or Constant Error Variance: The variance of the errors is constant across all levels of the independent variables.
Independent Error Terms or No Autocorrelation: There is no correlation between the errors of the variables.
Normality of Residuals: The residuals or errors follow a normal distribution.
No multicollinearity: There exists no correlation between the different independent variables.
Exogeneity (No Endogeneity): There must be no relationship between the independent variables and the errors.
Keep this list handy when you prepare for your interviews.
Hope you enjoyed this! Feel free to leave your feedback and queries below.