Basic CS General AI Chess AI Go AI

 Source CodeAI Tips and Tricks Books Links

Decision Trees, Part I: How Decision Trees Work
(Up to General AI)

Decision trees are conceptually fairly straight-forward and designed to work in conditions of supervised learning. Before talking about the learning algorithm, it's helpful to understand exactly how decision trees will be used to solve problems once we've actually learned one.

### Using decision trees

The idea behind decision trees is that, given an input with well-defined attributes, you can classify the input entirely based on making choices about each attribute. For instance, if you're told that someone has a piece of US currency, you could ask what denomination it is; if the denomination is less than one dollar, it must be the coin with that denomination. That could be one branch of the decision tree that was split on the attribute of "value". On the other hand, if the denomination happened to be one dollar, you'd then need to ask what type of currency it is: if it were a bill, it would have to be a dollar bill.

If it were a coin, it could still be one of a few things (such as a Susan B. Anthony Dollar or a Sacajawea dollar). Then your next question might be, is it named after a female; but that wouldn't tell you anything at all! This suggests that some attributes may be more valuable for making decisions than others. In fact, we'll take advantage of this fact when learning decision trees; we'd obviously not want our algorithm to "learn" to distinguish between two things using an attribute that doesn't actually make that distinction!

A useful way of thinking about decision trees is as though each node of the tree were a question to be answered. Once you've reached the leaves of the tree, you've reached the answer! For instance, if you were choosing a computer to buy, your first choice might be "Laptop or desktop". If you chose a laptop, your second choice might be "Mac or PC". If you chose a desktop, you might have more important factors than Mac or PC (perhaps you know that you will buy a PC because you think they are faster, and if you want a desktop, you want a fast machine).

In essence, with a decision tree, the algorithm is designed to make choices in a similar fashion. Each of the questions corresponds to a node in the decision tree, and each node has branches for each possible answer. Eventually, the algorithm reaches a leaf node that contains the correct classification of the input or the correct decision to make.

Visually, a decision tree looks something like this:
```			Desktop or Laptop?
/		\
Laptop /                 \ Desktop
Mac or PC?		Dell or IBM?
/	\		/         \
Mac  /	 \ PC	   Dell/           \ IBM
iBook or PowerBook?  Dell or IBM? ...          ...
/       \
/       	 \
1 Ghz iBook  1.5 Ghz Powerbook
```

In the case of the decision tree learning algorithm, the questions at each node will correspond to a question about the value of an attribute on the input, and each branch from the node will correspond.

For example, to decide the exclusive-or function using a decision tree, there would be two inputs, one for each input to exclusive-or:
```			First input: true or false?
/		\
true   /		 \ false
Second input?		Second input?
/	 \		/	\
true /	  \false true  / 	 \ false
false	true	     true	false
```
Notice something about this tree: we had to use both attributes in every path through the tree in order to make this classification! In this case, that was unavoidable; we have to look at both inputs to know the answer to exclusive-or. However, in many cases, we'd like to avoid having to do this because if we look at every attribute, then we're not doing much better than memorizing the inputs. There may be a few paths through the tree that require looking at everyattribute, but it would be nice to be able to limit at least some of the paths.

Are there times when we can avoid memorizing the inputs? Sure! For instance, the boolean-and function can only be true if both of its inputs are true. This means that in at least one branch of the decision tree, we can immediately classify the input as false.

### Learning Decision Trees

Now that we have an idea of how decision trees work, we can think about how one could go about learning a good decision tree. First, we need to have a sense of what a "good" decision tree is! As you might have guessed from the discussion about memorizing inputs, we'd like our decision tree to generalize well and not over-learn the inputs. To do this, we want trees that look at as few attributes as possible because the more attributes the tree looks at, the more likely it is encoding the exact training set.

One way of thinking about this requirement is that we want the shortest decision trees possible (we'd like to find the shortest, simplest way of characterizing the inputs). This introduces a bit of the bias that we know is necessary for learning by favoring simpler solutions over more complex solutions. We make this choice because it works well in general, and has certainly been a useful principle for scientists trying to come up with explanations for data.

So our goal is to find the shortest decision tree possible. Unfortunately, this is rather difficult; there's no known way of generating the shortest possible tree without looking at nearly every possible tree. Unfortunately, the number of possible decision trees is exponential in the number of attributes. We're forced to resort to using approaches that merely yield reasonable results.

The approach we'll take is one of generating a reasonably short tree using a greedy algorithm that always makes the choice that appears best at the time. The next article in this series will cover the techniques used by this approach and discuss some of the pitfalls. We'll also look at some ways of improving the performance of decision trees.