Machine Learning - Decision Trees


Decision Tree (Decision Tree)

In this chapter, we will show you how to create a 'Decision Tree'. A decision tree is a flowchart that can help you make decisions based on past experience.

In this example, a person will try to decide whether they should participate in the comedy show or not.

Luckily, the characters in our example register every time a comedy show is held in the town, and register some information about the comedy actors, as well as record whether they have been there or not.

Age Experience Rank Nationality Go
36 10 9 UK NO
42 12 4 USA NO
23 4 6 N NO
52 4 4 USA NO
43 21 8 USA YES
44 14 5 UK NO
66 3 7 N YES
35 14 9 UK YES
52 13 7 N YES
35 5 9 N YES
24 3 5 USA NO
18 3 7 UK YES
45 9 9 UK YES

Now, based on this dataset, Python can create a decision tree that can be used to decide whether it is worth attending any new shows.

How it works

First, import the required modules and use pandas to read the dataset:

Example

Read and print the dataset:

import pandas
from sklearn import tree
import pydotplus
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
import matplotlib.image as pltimg
df = pandas.read_csv("shows.csv")
print(df)

Run Instance

All data must be numeric to create a decision tree.

We must convert the non-numeric columns 'Nationality' and 'Go' to numeric.

Pandas has a map() Method, which accepts a dictionary containing information on how to convert values.

{'UK': 0, 'USA': 1, 'N': 2}

This indicates that the value 'UK' is converted to 0, 'USA' to 1, and 'N' to 2.

Example

Convert string values to numbers:

d = {'UK': 0, 'USA': 1, 'N': 2}
df['Nationality'] = df['Nationality'].map(d)
d = {'YES': 1, 'NO': 0}
df['Go'] = df['Go'].map(d)
print(df)

Run Instance

Then, we must separate the feature column from the target column.

The feature column is the one we try to predict from, and the target column is the one with the values we try to predict.

Example

X is the feature column, and y is the target column:

features = ['Age', 'Experience', 'Rank', 'Nationality']
X = df[features]
y = df['Go']
print(X)
print(y)

Run Instance

Now, we can create an actual decision tree, tailored to our specifics, and then save it as a .png file on the computer:

Example

Create a decision tree, save it as an image, and then display the image:

dtree = DecisionTreeClassifier()
dtree = dtree.fit(X, y)
data = tree.export_graphviz(dtree, out_file=None, feature_names=features)
graph = pydotplus.graph_from_dot_data(data)
graph.write_png('mydecisiontree.png')
img = pltimg.imread('mydecisiontree.png')
imgplot = plt.imshow(img)
plt.show()

Run Instance

Result interpretation

The decision tree uses your previous decisions to calculate the probability of you being willing to watch a comedian.

Let's read the different aspects of the decision tree:


Rank

Rank <= 6.5 Indicates that comedians with a rank below 6.5 will follow True Arrow (to the left), and the rest follow False Arrow (to the right).

gini = 0.497 Indicates the quality of the split, and it is always a number between 0.0 and 0.5, where 0.0 means all samples yield the same result, and 0.5 means the split is completely in the middle.

samples = 13 Indicates that there are still 13 comedians at this decision point, because this is the first step, so they are all comedians.

value = [6, 7] Indicates that among the 13 comedians, 6 will receive 'NO' and 7 will receive 'GO'.

Gini

There are many ways to split samples, and we use the GINI method in this tutorial.

The Gini method uses the following formula:

Gini = 1 - (x/n)2 - (y/n)2

Among them, x is the number of positive answers ('GO'), n is the number of samples, and y is the number of negative answers ('NO'), calculated using the following formula:

1 - (7 / 13)2 - (6 / 13)2 = 0.497


The next step includes two boxes, one for comedians with a 'Rank' of 6.5 or lower, and the rest in one box.

True - 5 comedians end here:

gini = 0.0 Indicates that all samples yield the same result.

samples = 5 Indicates that there are still 5 comedians left in this branch (5 comedians with a rank of 6.5 or lower).

value = [5, 0] Indicating that 5 get 'NO' and 0 get 'GO'.

False - 8 actors continue:

Nationality (Nationality)

Nationality <= 0.5 Meaning comedians with a nationality value less than or equal to 0.5 will follow the left arrow (this means everyone from the UK), and the rest will follow the right arrow.

gini = 0.219 Meaning about 22% of the samples will move in one direction.

samples = 8 Meaning there are still 8 comedians left in this branch (8 comedians with a rating higher than 6.5).

value = [1, 7] Indicating that among these 8 comedians, 1 will receive 'NO', and 7 will receive 'GO'.


True - 4 actors continue:

Age (Age)

Age <= 35.5 Meaning comedians aged 35.5 years or under will follow the left arrow, and the rest will follow the right arrow.

gini = 0.375 Meaning about 37.5% of the samples will move in one direction.

samples = 4 Meaning there are still 4 comedians left in this branch (4 comedians from the UK).

value = [1, 3] Indicating that among these 4 comedians, 1 will receive 'NO', and 3 will receive 'GO'.

False - 4 comedians end here:

gini = 0.0 It indicates that all samples get the same result.

samples = 4 Meaning there are still 4 comedians left in this branch (4 comedians from the UK).

value = [0, 4] Indicating that among these 4 comedians, 0 will receive 'NO', and 4 will receive 'GO'.


True - 2 comedians end here:

gini = 0.0 It indicates that all samples get the same result.

samples = 2 Meaning there are still 2 comedians left in this branch (2 comedians who are 35.5 years old or younger).

value = [0, 2] Indicating that in these 2 comedians, 0 will receive 'NO', and 2 will receive 'GO'.

False - 2 actors continue:

Experience (Experience)

Experience <= 9.5 Meaning comedians with 9.5 years or more of experience will follow the arrow on the left, and the rest will follow the arrow on the right.

gini = 0.5 Indicating that 50% of the samples will move in one direction.

samples = 2 Meaning there are still 2 comedians left in this branch (2 comedians over 35.5 years old).

value = [1, 1] Indicating that one of the two comedians will receive 'NO', and the other will receive 'GO'.


True - 1 comedian ends here:

gini = 0.0 It indicates that all samples get the same result.

samples = 1 It indicates that there is still 1 comedian left in this branch (1 comedian with 9.5 years or less of experience).

value = [0, 1] It indicates that 0 represents "NO", 1 represents "GO".

False - 1 comedian up to this point:

gini = 0.0 It indicates that all samples get the same result.

samples = 1 It indicates that there is still 1 comedian left in this branch (1 comedian with more than 9.5 years of experience).

value = [1, 0] 1 represents "NO", 0 represents "GO".

Predicted Values

We can use the decision tree to predict new values.

For example: Should I watch a show starring a 40-year-old American comedian with 10 years of experience and a comedy rating of 7?

Example

Use predict() Methods to predict new values:

print(dtree.predict([[40, 10, 7, 1]]))

Run Instance

Example

What is the answer if the comedy level is 6?

print(dtree.predict([[40, 10, 6, 1]]))

Run Instance

Different Results

If it runs enough times, even if you input the same data, the decision tree will provide you with different results.

This is because the decision tree cannot give us 100% certain answers. It is based on the possibility of the results, and the answers may vary.