Machine Learning - Decision Trees
- Previous Page Training/Testing
- Next Page MySQL Basics

Decision Tree (Decision Tree)
In this chapter, we will show you how to create a 'Decision Tree'. A decision tree is a flowchart that can help you make decisions based on past experience.
In this example, a person will try to decide whether they should participate in the comedy show or not.
Luckily, the characters in our example register every time a comedy show is held in the town, and register some information about the comedy actors, as well as record whether they have been there or not.
Age | Experience | Rank | Nationality | Go |
---|---|---|---|---|
36 | 10 | 9 | UK | NO |
42 | 12 | 4 | USA | NO |
23 | 4 | 6 | N | NO |
52 | 4 | 4 | USA | NO |
43 | 21 | 8 | USA | YES |
44 | 14 | 5 | UK | NO |
66 | 3 | 7 | N | YES |
35 | 14 | 9 | UK | YES |
52 | 13 | 7 | N | YES |
35 | 5 | 9 | N | YES |
24 | 3 | 5 | USA | NO |
18 | 3 | 7 | UK | YES |
45 | 9 | 9 | UK | YES |
Now, based on this dataset, Python can create a decision tree that can be used to decide whether it is worth attending any new shows.
How it works
First, import the required modules and use pandas to read the dataset:
Example
Read and print the dataset:
import pandas from sklearn import tree import pydotplus from sklearn.tree import DecisionTreeClassifier import matplotlib.pyplot as plt import matplotlib.image as pltimg df = pandas.read_csv("shows.csv") print(df)
All data must be numeric to create a decision tree.
We must convert the non-numeric columns 'Nationality' and 'Go' to numeric.
Pandas has a map()
Method, which accepts a dictionary containing information on how to convert values.
{'UK': 0, 'USA': 1, 'N': 2}
This indicates that the value 'UK' is converted to 0, 'USA' to 1, and 'N' to 2.
Example
Convert string values to numbers:
d = {'UK': 0, 'USA': 1, 'N': 2} df['Nationality'] = df['Nationality'].map(d) d = {'YES': 1, 'NO': 0} df['Go'] = df['Go'].map(d) print(df)
Then, we must separate the feature column from the target column.
The feature column is the one we try to predict from, and the target column is the one with the values we try to predict.
Example
X is the feature column, and y is the target column:
features = ['Age', 'Experience', 'Rank', 'Nationality'] X = df[features] y = df['Go'] print(X) print(y)
Now, we can create an actual decision tree, tailored to our specifics, and then save it as a .png file on the computer:
Example
Create a decision tree, save it as an image, and then display the image:
dtree = DecisionTreeClassifier() dtree = dtree.fit(X, y) data = tree.export_graphviz(dtree, out_file=None, feature_names=features) graph = pydotplus.graph_from_dot_data(data) graph.write_png('mydecisiontree.png') img = pltimg.imread('mydecisiontree.png') imgplot = plt.imshow(img) plt.show()
Result interpretation
The decision tree uses your previous decisions to calculate the probability of you being willing to watch a comedian.
Let's read the different aspects of the decision tree:

Rank
Rank <= 6.5
Indicates that comedians with a rank below 6.5 will follow True
Arrow (to the left), and the rest follow False
Arrow (to the right).
gini = 0.497
Indicates the quality of the split, and it is always a number between 0.0 and 0.5, where 0.0 means all samples yield the same result, and 0.5 means the split is completely in the middle.
samples = 13
Indicates that there are still 13 comedians at this decision point, because this is the first step, so they are all comedians.
value = [6, 7]
Indicates that among the 13 comedians, 6 will receive 'NO' and 7 will receive 'GO'.
Gini
There are many ways to split samples, and we use the GINI method in this tutorial.
The Gini method uses the following formula:
Gini = 1 - (x/n)2 - (y/n)2
Among them, x is the number of positive answers ('GO'), n is the number of samples, and y is the number of negative answers ('NO'), calculated using the following formula:
1 - (7 / 13)2 - (6 / 13)2 = 0.497

The next step includes two boxes, one for comedians with a 'Rank' of 6.5 or lower, and the rest in one box.
True - 5 comedians end here:
gini = 0.0
Indicates that all samples yield the same result.
samples = 5
Indicates that there are still 5 comedians left in this branch (5 comedians with a rank of 6.5 or lower).
value = [5, 0]
Indicating that 5 get 'NO' and 0 get 'GO'.
False - 8 actors continue:
Nationality (Nationality)
Nationality <= 0.5
Meaning comedians with a nationality value less than or equal to 0.5 will follow the left arrow (this means everyone from the UK), and the rest will follow the right arrow.
gini = 0.219
Meaning about 22% of the samples will move in one direction.
samples = 8
Meaning there are still 8 comedians left in this branch (8 comedians with a rating higher than 6.5).
value = [1, 7]
Indicating that among these 8 comedians, 1 will receive 'NO', and 7 will receive 'GO'.

True - 4 actors continue:
Age (Age)
Age <= 35.5
Meaning comedians aged 35.5 years or under will follow the left arrow, and the rest will follow the right arrow.
gini = 0.375
Meaning about 37.5% of the samples will move in one direction.
samples = 4
Meaning there are still 4 comedians left in this branch (4 comedians from the UK).
value = [1, 3]
Indicating that among these 4 comedians, 1 will receive 'NO', and 3 will receive 'GO'.
False - 4 comedians end here:
gini = 0.0
It indicates that all samples get the same result.
samples = 4
Meaning there are still 4 comedians left in this branch (4 comedians from the UK).
value = [0, 4]
Indicating that among these 4 comedians, 0 will receive 'NO', and 4 will receive 'GO'.

True - 2 comedians end here:
gini = 0.0
It indicates that all samples get the same result.
samples = 2
Meaning there are still 2 comedians left in this branch (2 comedians who are 35.5 years old or younger).
value = [0, 2]
Indicating that in these 2 comedians, 0 will receive 'NO', and 2 will receive 'GO'.
False - 2 actors continue:
Experience (Experience)
Experience <= 9.5
Meaning comedians with 9.5 years or more of experience will follow the arrow on the left, and the rest will follow the arrow on the right.
gini = 0.5
Indicating that 50% of the samples will move in one direction.
samples = 2
Meaning there are still 2 comedians left in this branch (2 comedians over 35.5 years old).
value = [1, 1]
Indicating that one of the two comedians will receive 'NO', and the other will receive 'GO'.

True - 1 comedian ends here:
gini = 0.0
It indicates that all samples get the same result.
samples = 1
It indicates that there is still 1 comedian left in this branch (1 comedian with 9.5 years or less of experience).
value = [0, 1]
It indicates that 0 represents "NO", 1 represents "GO".
False - 1 comedian up to this point:
gini = 0.0
It indicates that all samples get the same result.
samples = 1
It indicates that there is still 1 comedian left in this branch (1 comedian with more than 9.5 years of experience).
value = [1, 0]
1 represents "NO", 0 represents "GO".
Predicted Values
We can use the decision tree to predict new values.
For example: Should I watch a show starring a 40-year-old American comedian with 10 years of experience and a comedy rating of 7?
Example
Use predict()
Methods to predict new values:
print(dtree.predict([[40, 10, 7, 1]]))
Example
What is the answer if the comedy level is 6?
print(dtree.predict([[40, 10, 6, 1]]))
Different Results
If it runs enough times, even if you input the same data, the decision tree will provide you with different results.
This is because the decision tree cannot give us 100% certain answers. It is based on the possibility of the results, and the answers may vary.
- Previous Page Training/Testing
- Next Page MySQL Basics