Decision Trees, Part I: How Decision Trees Work
Decision trees are conceptually fairly straight-forward and designed to work in
conditions of supervised learning. Before talking about the learning
algorithm, it's helpful to understand exactly how decision trees will be used
to solve problems once we've actually learned one.
(Up to General
Using decision trees
The idea behind decision trees is that, given an input with well-defined
attributes, you can classify the input entirely based on making choices about
each attribute. For instance, if you're told that someone has a piece of US
currency, you could ask what denomination it is; if the denomination is less
than one dollar, it must be the coin with that denomination. That could be one
branch of the decision tree that was split on the attribute of "value". On the
other hand, if the denomination happened to be one dollar, you'd then need to
ask what type of currency it is: if it were a bill, it would have to be a
If it were a coin, it could still be one of a few things (such as
a Susan B. Anthony Dollar or a Sacajawea dollar). Then your next question
might be, is it named after a female; but that wouldn't tell you anything at
all! This suggests that some attributes may be more valuable for making
decisions than others. In fact, we'll take advantage of this fact when learning decision trees; we'd obviously not want our algorithm to "learn" to distinguish between two things using an attribute that doesn't actually make that distinction!
A useful way of thinking about decision trees is as though each node of the
tree were a question to be answered. Once you've reached the leaves of the
tree, you've reached the answer! For instance, if you were choosing a computer
to buy, your first choice might be "Laptop or desktop". If you chose a
laptop, your second choice might be "Mac or PC". If you chose a desktop, you
might have more important factors than Mac or PC (perhaps you know that you
will buy a PC because you think they are faster, and if you want a desktop, you
want a fast machine).
In essence, with a decision tree, the algorithm is designed to make
choices in a similar fashion. Each of the questions corresponds to a node in
the decision tree, and each node has branches for each possible answer.
Eventually, the algorithm reaches a leaf node that contains the correct
classification of the input or the correct decision to make.
Visually, a decision tree looks something like this:
Desktop or Laptop?
Laptop / \ Desktop
Mac or PC? Dell or IBM?
/ \ / \
Mac / \ PC Dell/ \ IBM
iBook or PowerBook? Dell or IBM? ... ...
1 Ghz iBook 1.5 Ghz Powerbook
In the case of the decision tree learning algorithm, the questions at each node will correspond to a question about the value of an attribute on the input, and each branch from the node will correspond.
For example, to decide the exclusive-or function using a decision tree, there would be two inputs, one for each input to exclusive-or:
First input: true or false?
true / \ false
Second input? Second input?
/ \ / \
true / \false true / \ false
false true true false
Notice something about this tree: we had to use both attributes in every path
through the tree in order to make this classification! In this case, that was
unavoidable; we have to look at both inputs to know the answer to exclusive-or.
However, in many cases, we'd like to avoid having to do this because if we look
at every attribute, then we're not doing much better than memorizing the
inputs. There may be a few paths through the tree that require looking at everyattribute, but it would be nice to be able to limit at least some of the paths.
Are there times when we can avoid memorizing the inputs? Sure! For instance, the boolean-and function can only be true if both of its inputs are true. This means that in at least one branch of the decision tree, we can immediately classify the input as false.
Learning Decision Trees
Now that we have an idea of how decision trees work, we can think about how one
could go about learning a good decision tree. First, we need to have a sense
of what a "good" decision tree is! As you might have guessed from the
discussion about memorizing inputs, we'd like our decision tree to generalize
well and not over-learn the inputs. To do this, we want trees that look at as
few attributes as possible because the more attributes the tree looks at, the more likely it is encoding the exact training set.
One way of thinking about this requirement is that we want the shortest decision trees possible (we'd like to find the shortest, simplest way of characterizing the inputs). This introduces a bit of the bias that we know is necessary for learning by favoring simpler solutions over more complex solutions. We make this choice because it works well in general, and has certainly been a useful principle for scientists trying to come up with explanations for data.
So our goal is to find the shortest decision tree possible. Unfortunately,
this is rather difficult; there's no known way of generating the shortest
possible tree without looking at nearly every possible tree. Unfortunately,
the number of possible decision trees is exponential in the number of
attributes. We're forced to resort to using approaches that merely yield
The approach we'll take is one of generating a reasonably short tree using a
greedy algorithm that always makes the choice that appears best at the time.
The next article in this series will cover the techniques used by this approach
and discuss some of the pitfalls. We'll also look at some ways of improving
the performance of decision trees.