Machine Learning, Part I: Supervised and Unsupervised Learning
(Up to General
Machine Learning, Part II: Supervised and Unsupervised Learning
Last time, we discussed two types of learning that were based on the result of
learning. This article will focus on another dimension to learning: whether it is supervised or unsupervised. Supervised learning is the type of
learning that takes place when the training instances are labelled with the
correct result, which gives feedback about how learning is progressing. This
is akin to having a supervisor who can tell the agent whether or not it was
correct. In unsupervised learning, the goal is harder because there are no pre-determined categorizations.
Supervised learning is fairly common in classification problems because the
goal is often to get the computer to learn a classification system that we have
created. Digit recognition, once again, is a common example of classification
learning. More generally, classification learning is appropriate for any
problem where deducing a classification is useful and the classification is
easy to determine. In some cases, it might not even be necessary to give
pre-determined classifications to every instance of a problem if the agent can
work out the classifications for itself. This would be an example of unsupervised learning in a classification context.
Supervised learning is the most common technique for training neural networks
and decision trees. Both of these techniques are highly dependent on the
information given by the pre-determined classifications. In the case of neural
networks, the classification is used to determine the error of the network and
then adjust the network to minimize it, and in decision trees, the
classifications are used to determine what attributes provide the most
information that can be used to solve the classification puzzle. We'll look at
both of these in more detail, but for now, it should be sufficient to know that
both of these examples thrive on having some "supervision" in the form of
Speech recognition using hidden Markov models and Bayesian networks relies
on some elements of supervision as well in order to adjust parameters to,
as usual, minimize the error on the given inputs.
Notice something important here: in the classification problem, the goal of the
learning algorithm is to minimize the error with respect to the given inputs.
These inputs, often called the "training set", are the examples from which the
agent tries to learn. But learning the training set well is not necessarily
the best thing to do. For instance, if I tried to teach you exclusive-or, but
only showed you combinations consisting of one true and one false, but never
both false or both true, you might learn the rule that the answer is always
true. Similarly, with machine learning algorithms, a common problem is
over-fitting the data and essentially memorizing the training set rather than
learning a more general classification technique.
As you might imagine, not all training sets have the inputs classified
correctly. This can lead to problems if the algorithm used is powerful enough
to memorize even the apparently "special cases" that don't fit the more general
principles. This, too, can lead to overfitting, and it is a challenge to find
algorithms that are both powerful enough to learn complex functions and robust
enough to produce generalizable results.
Unsupervised learning seems much harder: the goal is to have the computer learn
how to do something that we don't tell it how to do! There are actually two
approaches to unsupervised learning. The first approach is to teach the agent
not by giving explicit categorizations, but by using some sort of reward system
to indicate success. Note that this type of training will generally fit into
the decision problem framework because the goal is not to produce a
classification but to make decisions that maximize rewards. This approach
nicely generalizes to the real world, where agents might be rewarded for doing
certain actions and punished for doing others.
Often, a form of reinforcement learning can be used for unsupervised learning,
where the agent bases its actions on the previous rewards and punishments
without necessarily even learning any information about the exact ways that its
actions affect the world. In a way, all of this information is unnecessary
because by learning a reward function, the agent simply knows what to do
without any processing because it knows the exact reward it expects to achieve
for each action it could take. This can be extremely beneficial in cases where
calculating every possibility is very time consuming (even if all of the
transition probabilities between world states were known). On the other hand, it can be very time consuming to learn by, essentially, trial and error.
But this kind of learning can be powerful because it assumes no pre-discovered
classification of examples. In some cases, for example, our classifications
may not be the best possible. One striking exmaple is that the conventional
wisdom about the game of backgammon was turned on its head when a series of
computer programs (neuro-gammon and TD-gammon) that learned through
unsupervised learning became stronger than the best human chess players merely
by playing themselves over and over. These programs discovered some principles
that surprised the backgammon experts and performed better than backgammon
programs trained on pre-classified examples.
A second type of unsupervised learning is called clustering. In this type of
learning, the goal is not to maximize a utility function, but simply to find
similarities in the training data. The assumption is often that the clusters
discovered will match reasonably well with an intuitive classification. For
instance, clustering individuals based on demographics might result in a
clustering of the wealthy in one group and the poor in another.
Although the algorithm won't have names to assign to these clusters, it can
produce them and then use those clusters to assign new examples into one or the
other of the clusters. This is a data-driven approach that can work well when
there is sufficient data; for instance, social information filtering
algorithms, such as those that Amazon.com use to recommend books, are based on
the principle of finding similar groups of people and then assigning new users
to groups. In some cases, such as with social information filtering, the
information about other members of a cluster (such as what books they
read) can be sufficient for the algorithm to produce meaningful results. In other cases, it may be the case that the clusters are merely a useful tool for a human analyst.
Unfortunately, even unsupervised learning suffers from the problem of
overfitting the training data. There's no silver bullet to avoiding the
problem because any algorithm that can learn from its inputs needs to be quite
Unsupervised learning has produced many successes, such as world-champion
calibre backgammon programs and even machines capable of driving cars! It can
be a powerful technique when there is an easy way to assign values to actions.
Clustering can be useful when there is enough data to form clusters
(though this turns out to be difficult at times) and especially when additional
data about members of a cluster can be used to produce further results due to dependencies in the data.
Classification learning is powerful when the classifications are known to be
correct (for instance, when dealing with diseases, it's generally
straight-forward to determine the design after the fact by an autopsy), or when
the classifications are simply arbitrary things that we would like the computer
to be able to recognize for us. Classification learning is often necessary
when the decisions made by the algorithm will be required as input somewhere
else. Otherwise, it wouldn't be easy for whoever requires that input to
figure out what it means.
Both techniques can be valuable and which one you choose should depend on the
circumstances--what kind of problem is being solved, how much time is allotted
to solving it (supervised learning or clustering is often faster than
reinforcement learning techniques), and whether supervised learning is even
Machine Learning, Part III: Testing algorithms and the "No Free Lunch" theorem