Machine Learning - Linear Regression

Regression

The term 'regression' is used when you try to find relationships between variables.

This relationship is used to predict the results of future events in machine learning and statistical modeling.

Linear regression

Linear regression draws a straight line between all data points using the relationship between data points.

This line can be used to predict future values.


In machine learning, predicting the future is very important.

Working Principle

Python provides some methods to find the relationship between data points and draw linear regression lines. We will show you how to use these methods instead of through mathematical formulas.

In the following example, the x-axis represents the age of the car, and the y-axis represents the speed. We have recorded the age and speed of 13 cars passing through the toll station. Let's see if the data we have collected can be used for linear regression:

Example

First draw the scatter plot:

import matplotlib.pyplot as plt
x = [5, 7, 8, 7, 2, 17, 2, 9, 4, 11, 12, 9, 6]
y = [99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86]
plt.scatter(x, y)
plt.show()

Result:


Run Instance

Example

import scipy And draw the linear regression line:

import matplotlib.pyplot as plt
from scipy import stats
x = [5, 7, 8, 7, 2, 17, 2, 9, 4, 11, 12, 9, 6]
y = [99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86]
slope, intercept, r, p, std_err = stats.linregress(x, y)
def myfunc(x):
  return slope * x + intercept
mymodel = list(map(myfunc, x))
plt.scatter(x, y)
plt.plot(x, mymodel)
plt.show()

Result:


Run Instance

Example Explanation

Import the required modules:

import matplotlib.pyplot as plt
from scipy import stats

Create an array representing the values on the x and y axes:

x = [5, 7, 8, 7, 2, 17, 2, 9, 4, 11, 12, 9, 6]
y = [99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86]

Execute a method that returns some important key values of linear regression:

slope, intercept, r, p, std_err = stats.linregress(x, y)

Create a use slope and intercept The function of values returns new values. This new value represents the position of the corresponding x value on the y-axis:

def myfunc(x):
  return slope * x + intercept

Run each value of the x array through the function. This will produce a new array with new values on the y-axis:

mymodel = list(map(myfunc, x))

Plot the original scatter plot:

plt.scatter(x, y)

Plot the linear regression line:

plt.plot(x, mymodel)

Display the graph:

plt.show()

R-Squared

It is important to know how well the values on the x-axis and y-axis are related. If there is no relationship, linear regression cannot be used to predict anything.

This relationship is measured by a value called r-squared (r-squared).

The range of the coefficient of determination (r-squared) is from 0 to 1, where 0 indicates no correlation, and 1 indicates 100% correlation.

Python and the Scipy module will calculate this value for you. All you need to do is provide it with the x and y values:

Example

How well does my data fit in linear regression?

from scipy import stats
x = [5, 7, 8, 7, 2, 17, 2, 9, 4, 11, 12, 9, 6]
y = [99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86]
slope, intercept, r, p, std_err = stats.linregress(x, y)
print(r)

Run Instance

Note:The result of -0.76 indicates that there is some relationship, but not a perfect one. However, it shows that we can use linear regression in future predictions.

Predict future values

Now, we can use the collected information to predict future values.

For example: let's try to predict the speed of a car with a 10-year history.

For this, we need the same as in the previous example: myfunc() Function:

def myfunc(x):
  return slope * x + intercept

Example

Predict the speed of a car with a 10-year age:

from scipy import stats
x = [5, 7, 8, 7, 2, 17, 2, 9, 4, 11, 12, 9, 6]
y = [99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86]
slope, intercept, r, p, std_err = stats.linregress(x, y)
def myfunc(x):
  return slope * x + intercept
speed = myfunc(10)
print(speed)

Run Instance

The prediction speed is 85.6, and we can also read it from the figure:


Poor fit?

Let's create an instance where linear regression is not the best method for predicting future values.

Example

These values for the x and y axes will result in a very poor fit for linear regression:

import matplotlib.pyplot as plt
from scipy import stats
x = [89,43,36,36,95,10,66,34,38,20,26,29,48,64,6,5,36,66,72,40]
y = [21,46,3,35,67,95,53,72,58,10,26,34,90,33,38,20,56,2,47,15]
slope, intercept, r, p, std_err = stats.linregress(x, y)
def myfunc(x):
  return slope * x + intercept
mymodel = list(map(myfunc, x))
plt.scatter(x, y)
plt.plot(x, mymodel)
plt.show()

Result:


Run Instance

and r-squared value?

Example

You should have obtained a very low r-squared value.

import numpy
from scipy import stats
x = [89,43,36,36,95,10,66,34,38,20,26,29,48,64,6,5,36,66,72,40]
y = [21,46,3,35,67,95,53,72,58,10,26,34,90,33,38,20,56,2,47,15]
slope, intercept, r, p, std_err = stats.linregress(x, y)
print(r)

Run Instance

Result: 0.013 indicates a very poor relationship and tells us that the dataset is not suitable for linear regression.