How Decision Tree Learning Works
Now that you have an understanding of how decision trees will be used, and what
the features of a good decision tree are, we can talk about how to actually
come up with an algorithm that learns one.
Since there are way too many possible trees for an algorithm to simply try
every one of them and use the best, we'll take a less time-consuming approach
that usually results in fairly good results. In particular, we'll learn a tree
that seems to split on attributes that tend to give a lot of information about
For instance, in the following data set (which is not real), which might be
used to learn a decision tree to classify whether someone is at high risk for a
heart attack, it might make sense to use the age attribute, which is often correlated with having a heart attack.
|Age over 60?
||Weight over 200?
||Height over 6'0?
||Had a heart attack?
In this data, there is no single attribute that correctly classifies whether
every person has had a heart attack, but there are some attributes that
indicate whether it is more or less likely. For instance, 4 out of 5 people
over age sixty had heart attacks, and only 1 out of 5 under age sixty did.
This suggests that the age attribute is fairly informative about whether or not
someone may have a heart attack. If we were trying to learn a decision tree
from this data, since we want our decision tree to be short, age would probably
be a good attribute to use early in a decision tree.
There is one catch, here: there needs to be some way of automating or
formalizing the process of figuring out what is or is not an important
attribute. Otherwise, we're stuck having someone look at the data and say,
"yes, this looks informative" or "no, it looks like this doesn't tell us
anything at all".
Measuring Information Content using Entropy
It turns out that there is a formula, called entropy, to determine which
attribute is best to split on. What entropy measures is how much
'information' is contained in the classification of a piece of data. For
instance, if you told me that you flipped a normal coin, I would have
no idea if it had come up heads or tails. Telling me that it came up heads
gives me a lot of information that I couldn't have gotten just from knowing
that you had flipped the coin. On the other hand, if you told me that you
played in the lottery, telling me whether or not you won generally wouldn't
give me a lot of information--I wouldn't have expected you to win in the first
place. The information content of telling me about the coin is much higher
than the information content of telling me about the lottery. Some things can
have no information content at all if I would have known them anyway.
We can use entropy to decide how informative an attribute is by seeing how much
information it gives us. The more informative an attribute is, the less
information we get from being told the what classification is (because the value of
an attribute--whether someone is over sixty, or not, for instance--will give us
clues about the classification).
Of course, since each attribute can have multiple different values, it might
not be the case that the information gleaned from each value is the same. For
instance, if you were classifying someone as wealthy or poor, and one of the
attributes was, "Has over 1,000,000 dollars in cash", then if that attribute
had a value of true for any person, we'd classify that person as being wealthy.
On the other hand, not having 1,000,000 dollars in cash doesn't mean someone is
poor. They might have real estate or stock holdings. So the amount of
information given by this attribute varies based on the value of its attribute.
This means there needs to be a way of determining the overall usefulness of an
attribute to classifying inputs by accounting for the weighted average of the
information gain. For instance, if most wealthy people were wealthy because
they had over $1,000,000 in cash, then this attribute would be more valuable
than if only a very small number of wealthy people were wealthy because they
had $1,000,000 in cash. Why? Because even if we used that attribute, it would
be rare that anyone actually could be classified by using it.
When we're actually learning a decision tree, we'll only have the training data
to learn from. What we want to know is, given that we choose to split on a
given attribute, what is the average amount of information that having the
classification would give us? The goal is to split on the attribute that, if
we knew it, would tell us the most information for each of its possible values, weighted by the likelihood that the attribute will take on each value. That likelihood is simply the number of elements of the training set that have that value divided by the total number of elements in the training set.
Now let's look at the way we can compute the amount of information having a
classification for an example from the training set would give us, and then we'll put it all together to come up with the full decision tree learning algorithm.
The Entropy Formula
The formula for entropy is a bit strange looking at first, but we'll work through a few examples to see that it makes sense. The idea behind the formula is to describe how many bits of information you get from having the classification when you know the probabilities of the classification. In the following formula, P(ci) is the probability that some training instance has the ith classification of c, and log2(n) is the power to which one must raise 2 to get the value of n.
Sumi = 1 to n(-P(ci) log2(P(ci))
This is definitely strange looking! It turns out that it does make some sense, though. For instance, let's look at what happens when when we choose the probability of two classifications as both being .5 (for instance, flipping a coin).
= - .5 * log2(.5) - .5 * log2(.5)
The reason it equals one is that log2(.5) = log2(1/2) = log2(1) - log2(2) (by the rules of logarithms), and log2(2) = 1.
Don't worry if you didn't follow that completely. The point is that knowing
the value of the coin flip turns out to give you exactly 1 bit of information
(think of it has having the answer to a single yes/no classification). On the
other hand, if you had an event that you expected with probability .99, then
working through the entropy formula, you would find that being told the actual
result of the event would only give you .08 bits of information!
Putting it together
Now that we can figure out the information gained by having a classification,
we can figure out which attribute we should choose so that on average we get
the most information gain from knowing it. This should result in short decision
trees because we try to maximize the amount of information we have, but there are times when this doesn't work perfectly (we'll talk about that next time).
Now, knowing that that is our strategy, how can we use the training data along
with the entropy function to do this?
First, remember that the training data contains both the values of each
attribute and a classification for each instance. This means we can estimate
the probability of a given classification just by counting the number of a
particular classification and dividing by the total number of classifications.
Second, when we're deciding which attribute to split on, we can pretend that
we've made the split and then partition the training set into subsets based on
the value of that attribute. (For instance, all the training examples in which
someone is over sixty might go in one set, and the rest of the training
examples into the other.) Then we compute the information we'd get from having
the actual classifications of all instances in the training set using the
entropy formula and average together these pieces of information, weighting by
the frequency with which the attribute took on each value.
This would give us the best attribute to split on. Then we just need to re-run
the algorithm to create decision trees for each branch leading from the new
node. If it turns out that every element in a given branch would have the same
classification, we don't need to create a decision sub-tree, we just store the
classification. If it turns out that we're out of possible split attributes,
on the other hand, but don't have only a single classication, we just pick the
most likely classification. For instance, if only one of two people who were
62, 5'6 and overweight had a heart attack, we'd just have to pick randomly
whether to classify other examples with the same attribute values one way or
another. This should only happen when the training data is not internally consistent, so if you are fortunate enough to have clean data, this shouldn't be a problem.
Remember, the algorithm works as follows:
If all of the training examples are the same classification, assign
that classification to the node
Otherwise, pick an attribute to split on by performing the following
test for each attribute that hasn't already been used:
compute the current entropy
compute the weighted average of the entropy in each subset
partitioned by the value of the given attribute
compute the difference, and pick the attribute with the
greatest change (the most information gain)
For each value of the attribute, repeat the process using only the
training examples that have the given value for an attribute.
This algorithm is guaranteed to stop because we split on each attribute once. This makes sense because once we've split on an attribute value, splitting again wouldn't make sense because we would always know the value of the attribute in the sub-tree beneath it!
The next article in this series will discuss the issues with this approach to
learning decision trees and will look at some learning techniques that can
improve the performance of decision trees.