Machine Learning - Polynomial Regression

Polynomial Regression

If your data points are clearly not suitable for linear regression (a straight line passing through the data points), polynomial regression may be the ideal choice.

Like linear regression, polynomial regression uses the relationship between variables x and y to find the best method to draw a line through the data points.


Working principle

Python has some methods to find the relationship between data points and draw a polynomial regression line. We will show you how to use these methods instead of through mathematical formulas.

In the following example, we registered 18 cars passing through a specific toll station.

We have recorded the speed and passing time (hours) of the cars.

The x-axis represents the hour of the day, and the y-axis represents the speed:

Example

First draw the scatter plot:

import matplotlib.pyplot as plt
x = [1, 2, 3, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 18, 19, 21, 22]
y = [100, 90, 80, 60, 60, 55, 60, 65, 70, 70, 75, 76, 78, 79, 90, 99, 99, 100]
plt.scatter(x, y)
plt.show()

Result:


Run Instance

Example

import numpy and matplotlibThen draw the polynomial regression line:

import numpy
import matplotlib.pyplot as plt
x = [1, 2, 3, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 18, 19, 21, 22]
y = [100, 90, 80, 60, 60, 55, 60, 65, 70, 70, 75, 76, 78, 79, 90, 99, 99, 100]
mymodel = numpy.poly1d(numpy.polyfit(x, y, 3))
myline = numpy.linspace(1, 22, 100)
plt.scatter(x, y)
plt.plot(myline, mymodel(myline))
plt.show()

Result:


Run Instance

Example explanation

Import the required modules:

import numpy
import matplotlib.pyplot as plt

Create an array representing the values on the x and y axes:

x = [1, 2, 3, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 18, 19, 21, 22]
y = [100, 90, 80, 60, 60, 55, 60, 65, 70, 70, 75, 76, 78, 79, 90, 99, 99, 100]

NumPy has a method that allows us to establish a polynomial model:

mymodel = numpy.poly1d(numpy.polyfit(x, y, 3))

Then specify the display method of the line, starting from position 1 to position 22:

myline = numpy.linspace(1, 22, 100)

Draw the original scatter plot:

plt.scatter(x, y)

Plot the polynomial regression line:

plt.plot(myline, mymodel(myline))

Display chart:

plt.show()

R-Squared

It is important to know how well the values on the x and y axes are related. If there is no relationship, polynomial regression cannot be used to predict anything.

This relationship is measured by a value called R-squared (r-squared).

The range of the coefficient of determination (R-squared) is 0 to 1, where 0 indicates no correlation, and 1 indicates 100% correlation.

Python and the Sklearn module will calculate this value for you. All you need to do is input the x and y arrays:

Example

How is my data fitted in polynomial regression?

import numpy
from sklearn.metrics import r2_score
x = [1, 2, 3, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 18, 19, 21, 22]
y = [100, 90, 80, 60, 60, 55, 60, 65, 70, 70, 75, 76, 78, 79, 90, 99, 99, 100]
mymodel = numpy.poly1d(numpy.polyfit(x, y, 3))
print(r2_score(y, mymodel(x)))

Run Instance

Note:The result of 0.94 indicates a good relationship, and we can use polynomial regression in future predictions.

Predicting future values

Now, we can use the collected information to predict future values.

For example: let's try to predict the speed of cars passing through the toll station around 5:00 PM in the evening:

For this, we need the same as the above example: mymodel Array:

mymodel = numpy.poly1d(numpy.polyfit(x, y, 3))

Example

Predict the speed of the car at 5:00 PM:

import numpy
from sklearn.metrics import r2_score
x = [1, 2, 3, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 18, 19, 21, 22]
y = [100, 90, 80, 60, 60, 55, 60, 65, 70, 70, 75, 76, 78, 79, 90, 99, 99, 100]
mymodel = numpy.poly1d(numpy.polyfit(x, y, 3))
speed = mymodel(17)
print(speed)

Run Instance

The prediction speed is 88.87 in this example, and we can also see it in the figure:


Poor fit?

Let's create an instance where polynomial regression is not the best method for predicting future values.

Example

These values for the x and y axes will cause the fit of polynomial regression to be very poor:

import numpy
import matplotlib.pyplot as plt
x = [89,43,36,36,95,10,66,34,38,20,26,29,48,64,6,5,36,66,72,40]
y = [21,46,3,35,67,95,53,72,58,10,26,34,90,33,38,20,56,2,47,15]
mymodel = numpy.poly1d(numpy.polyfit(x, y, 3))
myline = numpy.linspace(2, 95, 100)
plt.scatter(x, y)
plt.plot(myline, mymodel(myline))
plt.show()

Result:


Run Instance

What about the r-squared value?

Example

You should get a very low r-squared value.

import numpy
from sklearn.metrics import r2_score
x = [89,43,36,36,95,10,66,34,38,20,26,29,48,64,6,5,36,66,72,40]
y = [21,46,3,35,67,95,53,72,58,10,26,34,90,33,38,20,56,2,47,15]
mymodel = numpy.poly1d(numpy.polyfit(x, y, 3))
print(r2_score(y, mymodel(x)))

Run Instance

Result: 0.00995 indicates a very poor relationship and tells us that the dataset is not suitable for polynomial regression.