Machine Learning, Part I: Types of Learning Problems
Before launching into a series of tutorials on different machine learning
algorithms, it can be helpful to understand the background material--what each
algorithm is aiming to do, and where it fits into the world of artificial
intelligence. This essay will cover the types of learning that are often
studied in AI, some examples of each type of learning, and I'll try to provide
a sense for where different learning algorithms fit. This is the first article in a series on the basics of machine learning, which will serve as an introduction to a series of articles giving more specific details about machine learning algorithms.
(Up to General
Types Problems in which Machine Learning is Used
In artificial intelligence, there are several categories of
problems, one of which is machine learning. The goal of machine learning
is not quite the search for consciousness that seems so exciting, but in some
ways it comes closest to reaching for what may seem to be the traditional goals
of AI. Machine learning is about just that: designing algorithms that allow a
computer to learn.
Learning, of course, doesn't necessarily imply consciousness. Rather, learning is a matter of finding statistical regularities or other patterns in the data. Thus, many machine learning algorithms will hardly resemble how you yourself might approach a learning task. Nevertheless, learning algorithms can give insight into the relative difficulty of learning in different environments.
Classification and Decision Problems
The types of learning algorithms fall along several classifications. One
classification is the type of result expected from the algorithm. For
instance, some problems are classification problems. You may be familiar with
these problems from popular literature; a common example of a classification
problem is for the computer to learn how to recognize handwritten digits. In
fact, handwriting recognition is extremely good, with some specially tuned
solutions achieving over 99% accuracy (pretty good, considering some of the
messy handwriting that is out there). Much of the work on digit recognition
has been done in the neural network community, but more recently support vector
machines have proven to be even better classifiers.
Supervised learning can also be used in medical diagnoses--for instance, given
a set of attributes about potential cancer patients, and whether those
patients actually had cancer, the computer could learn how to distinguish
between likely cancer patients and possible false alarms. This sort of
learning could take place with neural networks or support vector machines, but another approach is to use decision trees.
Decision trees are a relatively simple classification technique that relies on
the idea of following a tree of questions and answers: if the answer to a
question is yes, then the algorithm proceeds down one branch of the tree; if
the answer to a question is no, the algorithm takes the other branch. Eventually, the algorithm reaches a leaf node with the final classification.
Learning decision trees is fairly straight-forward and requires significantly
less work tuning parameters than do neural nets, and clever algorithms such as
ada-boost can be used to quickly improve the performance. We'll look at
decision trees (and neural networks) in later essays. For now, be aware that
even apparently simple learning algorithms can accomplish a great deal. You
could use a decision tree learning algorithm in almost any setting where you
can collect a reasonable number of attributes and a classification system and expect reasonable (though probably not outstanding) results.
A final example of classification learning to whet your appetite is speech
recognition--usually, a computer will be given a set of training instances
consisting of sounds and what word the sound corresponds to. This type of learning could probably be carried out with neural networks, though it is hard to imagine that the problem is simple enough for decision trees. There is another machine learning approach, called the hidden markov model, that is designed specifically for working with time series data like this, and it has shown good results in speech processing.
The other common type of learning is designed not to create classifications of
inputs, but to make decisions; these are called decision problems. Typically,
decision problems require making some sorts of assumptions about the state of
the world to be at all tractable. Decision problems may be one-shot, in which
case only a single decision is to be made, or repeated, in which case the
computer may need to make multiple decisions. When there may be multiple
decisions in the future, the decision problem becomes decidedly more tricky
because it requires taking into account both the tangible consequences of
action and the possibility of gaining information by taking a certain course of
A common framework for understanding either type of decision problem is to use
the concept of a utility function, borrowed from economics. The idea is that
the computer (or the "agent") can get a certain amount of value from performing some actions. It isn't necessarily the case that the utility function is known in advance--it may be necessary for the agent to learn what actions result in utility payoffs and what actions result in punishment or negative utility.
Utility functions are usually combined with probabilities about things like the
state of the world and the likelihood that the agent's actions will actually
work in the expected way. For instance, if you were programming a robot
exploring a jungle, you might not always know exactly where the robot was
located, and sometimes when the robot took a step forward, it might end up
running into a tree and turning left. In the first case, the robot doesn't
know exactly what the state of the world is; in the second case, the robot
isn't sure that its actions will work as expected.
The common framework combining these results is to say that there is a
probability distribution describing the state of the world, another describing
the chance of different results from each action an agent can take, and a
utility function that determines whether taking an action in a state is good or
bad. Once the agent learns each aspect of the model that isn't given. For
instance, it might know its utility, but not what the world looks like, which
might be the case for an explorer robot, or it might know the state of the
world, but not the value of its actions, as might be the case if the agent were
learning to play a game like backgammon.
Once these different functions (such as the probability distribution) are
learned, the correct action to take is simply a matter of deciding which action
maximizes the "expected utility" of the agent. Expected utility is calculated
by multiplying the probability of each payoff by the payoff and taking the
total sum. In some sense, this works out to what the "average" value of taking
an action should be. In order to plan ahead for multiple moves, an algorithm
known as a markov decision process is commonly used when there are only a reasonably small group of possible world states. (This doesn't work in games like chess or go, where there are many many possible states, but if you're moving around a 3x3 room, then it's not so bad.)
In some cases, the actual calculation of the utilities can be eschewed in favor
of using learning algorithms that actually incorporate the information learned
to make the correct move without having to plan ahead. Reinforcement
learning is a common technique for this scenario as well as the more
traditional scenario of actually learning the utility function.
Note that some decision problems can be reformulated as classification
problems in the following way: each decision is actually classifying a state of
the world by the proper action to take in that state! The trick is to come up
with the right framework for your problem so that you know which techniques are most likely going to be appropriate. It might be very silly to use decision trees for learning how to actually explore a jungle but very reasonable to use them for picking food at a restaurant.
Continue to Machine Learning, Part II: Supervised and Unsupervised Learning