AI Horizon: Intelligence and Reason in Computers
Introduction Home Essays Resources Contact Info
Essays
  Basic CS
General AI
Chess AI
Go AI

Resources
  Source Code
AI Tips and Tricks
Books
Links

Other
  Code Journal
Site Updates

How Decision Tree Learning Works

Now that you have an understanding of how decision trees will be used, and what the features of a good decision tree are, we can talk about how to actually come up with an algorithm that learns one.

Since there are way too many possible trees for an algorithm to simply try every one of them and use the best, we'll take a less time-consuming approach that usually results in fairly good results. In particular, we'll learn a tree that seems to split on attributes that tend to give a lot of information about the data.

For instance, in the following data set (which is not real), which might be used to learn a decision tree to classify whether someone is at high risk for a heart attack, it might make sense to use the age attribute, which is often correlated with having a heart attack.
Age over 60? Weight over 200? Height over 6'0? Had a heart attack?
Yes No No Yes
No Yes Yes No
No Yes Yes Yes
No No Yes No
Yes Yes No No
Yes No Yes Yes
No Yes Yes No
No No No No
Yes No No Yes
Yes Yes No Yes

In this data, there is no single attribute that correctly classifies whether every person has had a heart attack, but there are some attributes that indicate whether it is more or less likely. For instance, 4 out of 5 people over age sixty had heart attacks, and only 1 out of 5 under age sixty did. This suggests that the age attribute is fairly informative about whether or not someone may have a heart attack. If we were trying to learn a decision tree from this data, since we want our decision tree to be short, age would probably be a good attribute to use early in a decision tree.

There is one catch, here: there needs to be some way of automating or formalizing the process of figuring out what is or is not an important attribute. Otherwise, we're stuck having someone look at the data and say, "yes, this looks informative" or "no, it looks like this doesn't tell us anything at all".

Measuring Information Content using Entropy

It turns out that there is a formula, called entropy, to determine which attribute is best to split on. What entropy measures is how much 'information' is contained in the classification of a piece of data. For instance, if you told me that you flipped a normal coin, I would have no idea if it had come up heads or tails. Telling me that it came up heads gives me a lot of information that I couldn't have gotten just from knowing that you had flipped the coin. On the other hand, if you told me that you played in the lottery, telling me whether or not you won generally wouldn't give me a lot of information--I wouldn't have expected you to win in the first place. The information content of telling me about the coin is much higher than the information content of telling me about the lottery. Some things can have no information content at all if I would have known them anyway.

We can use entropy to decide how informative an attribute is by seeing how much information it gives us. The more informative an attribute is, the less information we get from being told the what classification is (because the value of an attribute--whether someone is over sixty, or not, for instance--will give us clues about the classification).

Of course, since each attribute can have multiple different values, it might not be the case that the information gleaned from each value is the same. For instance, if you were classifying someone as wealthy or poor, and one of the attributes was, "Has over 1,000,000 dollars in cash", then if that attribute had a value of true for any person, we'd classify that person as being wealthy. On the other hand, not having 1,000,000 dollars in cash doesn't mean someone is poor. They might have real estate or stock holdings. So the amount of information given by this attribute varies based on the value of its attribute.

This means there needs to be a way of determining the overall usefulness of an attribute to classifying inputs by accounting for the weighted average of the information gain. For instance, if most wealthy people were wealthy because they had over $1,000,000 in cash, then this attribute would be more valuable than if only a very small number of wealthy people were wealthy because they had $1,000,000 in cash. Why? Because even if we used that attribute, it would be rare that anyone actually could be classified by using it.

When we're actually learning a decision tree, we'll only have the training data to learn from. What we want to know is, given that we choose to split on a given attribute, what is the average amount of information that having the classification would give us? The goal is to split on the attribute that, if we knew it, would tell us the most information for each of its possible values, weighted by the likelihood that the attribute will take on each value. That likelihood is simply the number of elements of the training set that have that value divided by the total number of elements in the training set.

Now let's look at the way we can compute the amount of information having a classification for an example from the training set would give us, and then we'll put it all together to come up with the full decision tree learning algorithm.

The Entropy Formula

The formula for entropy is a bit strange looking at first, but we'll work through a few examples to see that it makes sense. The idea behind the formula is to describe how many bits of information you get from having the classification when you know the probabilities of the classification. In the following formula, P(ci) is the probability that some training instance has the ith classification of c, and log2(n) is the power to which one must raise 2 to get the value of n.

Sumi = 1 to n(-P(ci) log2(P(ci))

This is definitely strange looking! It turns out that it does make some sense, though. For instance, let's look at what happens when when we choose the probability of two classifications as both being .5 (for instance, flipping a coin).

-P(c1)log2(P(c1)) -P(c2)log2(P(c2))

= - .5 * log2(.5) - .5 * log2(.5)

= 1

The reason it equals one is that log2(.5) = log2(1/2) = log2(1) - log2(2) (by the rules of logarithms), and log2(2) = 1.

Don't worry if you didn't follow that completely. The point is that knowing the value of the coin flip turns out to give you exactly 1 bit of information (think of it has having the answer to a single yes/no classification). On the other hand, if you had an event that you expected with probability .99, then working through the entropy formula, you would find that being told the actual result of the event would only give you .08 bits of information!

Putting it together

Now that we can figure out the information gained by having a classification, we can figure out which attribute we should choose so that on average we get the most information gain from knowing it. This should result in short decision trees because we try to maximize the amount of information we have, but there are times when this doesn't work perfectly (we'll talk about that next time).

Now, knowing that that is our strategy, how can we use the training data along with the entropy function to do this?

First, remember that the training data contains both the values of each attribute and a classification for each instance. This means we can estimate the probability of a given classification just by counting the number of a particular classification and dividing by the total number of classifications. Second, when we're deciding which attribute to split on, we can pretend that we've made the split and then partition the training set into subsets based on the value of that attribute. (For instance, all the training examples in which someone is over sixty might go in one set, and the rest of the training examples into the other.) Then we compute the information we'd get from having the actual classifications of all instances in the training set using the entropy formula and average together these pieces of information, weighting by the frequency with which the attribute took on each value.

This would give us the best attribute to split on. Then we just need to re-run the algorithm to create decision trees for each branch leading from the new node. If it turns out that every element in a given branch would have the same classification, we don't need to create a decision sub-tree, we just store the classification. If it turns out that we're out of possible split attributes, on the other hand, but don't have only a single classication, we just pick the most likely classification. For instance, if only one of two people who were 62, 5'6 and overweight had a heart attack, we'd just have to pick randomly whether to classify other examples with the same attribute values one way or another. This should only happen when the training data is not internally consistent, so if you are fortunate enough to have clean data, this shouldn't be a problem.

Remember, the algorithm works as follows:
If all of the training examples are the same classification, assign
that classification to the node

Otherwise, pick an attribute to split on by performing the following
test for each attribute that hasn't already been used:

        compute the current entropy

        compute the weighted average of the entropy in each subset
partitioned by the value of the given attribute

        compute the difference, and pick the attribute with the
greatest change (the most information gain)

For each value of the attribute, repeat the process using only the
training examples that have the given value for an attribute. 
This algorithm is guaranteed to stop because we split on each attribute once. This makes sense because once we've split on an attribute value, splitting again wouldn't make sense because we would always know the value of the attribute in the sub-tree beneath it!
The next article in this series will discuss the issues with this approach to learning decision trees and will look at some learning techniques that can improve the performance of decision trees.

All content is written and published by the people at or affiliated with AI Horizon <http://www.aihorizon.com/>.
Send any comments and suggestions to comments@aihorizon.com.

Please report any errors to webmaster@aihorizon.com.