A Survey of Neural Networks
by Jeannette Lawrence
Jan. 23, 1990
I. Introduction
Neural networks have been hailed as the greatest technological
breakthrough since the transistor and have been predicted to be a common
household item by the year 2000. How much of this is hype? What are
they capable, or not capable of? With numerous paradigms available,
which is best for a particular application? This article will answer
these questions and more about this newly emerging field of computation.
Formed by simulated neurons connected together much the same way the
brain's neurons are, neural networks are able to associate and
generalize without rules. They have solved problems in pattern
recognition, robotics, speech processing, financial predicting and signal
processing, to name a few.
One of the first impressive neural networks was NetTalk, which read in
ASCII text and correctly pronounced the words (producing phonemes which
drove a speech chip), even those it had never seen before (1). Designed
by John Hopkins biophysicist Terry Sejnowski and Charles Rosenberg of
Princeton in 1986, this application made the Backprogagation training
algorithm famous. Using the same paradigm, a neural network has been
trained to classify sonar returns from an undersea mine and rock. This
classifier, designed by Sejnowski and R. Paul Gorman, performed better
than a nearest-neighbor classifier (2).
As far as the public is concerned, the modern era in neural networks
began in 1982 when the distinguished Caltech physicist John Hopfield
published a paper which not only showed that neural networks could store
and recall patterns even when the input was incomplete, it provided the
mathematical elucidation which captured the attention of the scientific
community. (3)
Speech recognition of Finnish and Japanese (to text) has been
demonstrated by researcher Teuvo Kohonen of the Helsinki University of
Technology, Finland. For these inflectional languages, the system must
construct the text from recognizable phonetics units. (4) This complex
system uses signal preprocessing by a TMS32010 chip, Kohonen's
self-organizing associative paradigm, and a context-sensitive stochastic
grammar corrector.
The Neocognitron, designed by Kunihiko Fukushima of the NHK Science and
Technical Research Lab in Tokyo, recognizes handwritten numerals of
various styles of penmanship correctly, even if they are considerably
distorted in shape (5). Built as a model for the human visual system,
this highly specialized network does not implement any common topology.
The kinds of problems best solved by neural networks are those that
people are good at such as association, evaluation and pattern
recognition. Problems that are difficult to compute and do not require
perfect answers, just very good answers, are also best done with neural
networks. A quick, very good response is often more desirable than a
more accurate answer which takes longer to compute. This is especially
true in robotics or industrial controller applications. Predictions of
behavior and general analysis of data are also affairs for neural
networks. In the financial arena, consumer loan analysis and financial
forecasting make good applications. New network designers are working
on weather forecasts by neural networks. Currently, doctors are
developing medical neural networks as an aid in diagnosis. Attorneys
and insurance companies are also working on neural networks to help
estimate the value of claims.
Neural networks are poor at precise calculations and serial processing.
They are also unable to predict or recognize anything that does not
inherently contain some sort of pattern. For example, they cannot
predict the lottery, since this is a random process. It is unlikely
that a neural network could be built which has the capacity to think as
well as a person does for two reasons. Neural networks are terrible at
deduction, or logical thinking and the human brain is just too complex
to completely simulate. Also, some problems are too difficult for
present technology. Real vision, for example, is a long way off.
A brief look at the general structure and operation of neural networks
will help explain the limits to their abilities. The power and speed of
the human brain comes from the way the hundreds of billions of highly
interconnected neurons function together. Neural networks simulate the
operation and structure of brain neurons, but on a much smaller scale.
Information is distributed across the neurons' interconnections, not as
bits of intelligence stored within the neurons as was once thought.
There are many types of neural networks, but all have three things in
common. A neural network can be described in terms of its individual
neurons, the connections between them (topology), and the learning rule.
Together they constitute the neural network paradigm.
Artificial neurons are also called processing elements, neurodes, units
or cells. Each neuron receives the output signals from many other
neurons. A neuron calculates its output by finding the weighted sum of
its inputs. The point where two neurons communicate is called a
connection (analogous to a synapse). The weight of a particular
connection is noted w^ij, where ^ij means subscripted ij, i is the
receiving neuron and j is the sending neuron. At any point in
time (t) the neuron adds up the weighted inputs to produce an
activation value a^i(t). The activation is passed through an
output, or transfer, function f^i, which produces the actual
output for that neuron for that time, o^i(t).
The activation function specifies what the neuron is to do with the
signals after the weights have had their effect. Once inside the
neuron, the weighted signals are summed to form a net value. In most
models, signals can either be excitatory or inhibitory. After
summation, the net input of the neuron is combined with the previous
state of the neuron to produce a new activation value. In the
simplest models, the activation function is the weighted sum of the
neuron's inputs; the previous state is not taken into account. In
more complicated models, the activation function also uses the
previous output of the neuron, so that the neuron can self-excite.
These activation functions slowly decay over time; an excited state
slowly returns to an inactive level. Sometimes the activation
function is stochastic, i.e. it includes a random noise factor.
The transfer function of a neuron defines how the activation value is
output. The earliest models used a linear transfer function. There are
certain problems which are not entirely reducable by purely linear
methods. Nonlinear neurons allow more interesting problems to be
solved. The most simple nonlinear model consists of threshold neurons.
A threshold transfer function is an all-or-nothing function. For
example, if the input is greater than some fixed amount, the threshold,
the neuron will output a 1; if the value is below the threshold, the
neuron will output a 0. Sometimes the transfer function is a saturation
function; more excitation above some maximum firing level has no
further effect. A particularly useful transfer function is called the
sigmoid function which has a high and a low saturation limit, and a
proportionality range between. This function is 0 when the activation
value is a large negative number. The sigmoid function is 1 when the
activation value is a large positive number, and makes a smooth
transition in between.
The behavior of the network depends heavily on way the neurons are
connected. In most models, the individual neurons are grouped into
layers, so that the output from each neuron in one layer is fully
interconnected with the inputs of all the neurons in the next layer. A
Back-propagation network has at least three layers: input, hidden and
output. The network structure may involve inhibitory connections from
one neuron to the rest of the neurons in the same layer. This is called
lateral inhibition. Sometimes a network has such strong lateral
inhibition that only one neuron in a layer, usually the output, can be
activated at a time. This effect of minimizing the number of active
neurons is known as competition. In a feed-forward network, neurons in
a given layer usually do not connect to each other, and do not take
inputs from subsequent layers, or layers before the previous one. Other
models include feedback connections from the outputs of a layer to the
inputs of the same or a previous layer.
A neural network learns by changing its response as the inputs change.
The learning rule is the very heart of a neural network; it determines
how the weights are adjusted as the neural network gains experience.
There are lots of different learning rules. Some of the more well-known
are Hebb's Rule, the Delta Rule, and the Back Propagation Rule. The
best learning rule to use with linear neurons is the Delta Rule. This
allows arbitrary associations to be learned, provided that the inputs
are all linearly independent. Other learning rules (such as Hebb's)
require that the inputs also be orthogonal.
More than 30 years ago, Donald O. Hebb theorized that biological
associative memory lies in the synaptic connections between nerve cells.
He thought that the process of learning and memory storage involved
changes in the strength with which nerve signals are transmitted across
individual synapses. Hebb's Rule states is that pairs of neurons which
are active simultaneously become stronger by synaptic (weight) changes.
The result is a reinforcement of those pathways in the the brain. A
number of different rules for adjusting connection strengths, or
weights, have been proposed, but nearly all network learning theories
are some variant of Hebb's Rule.
The Delta Rule additionally states that if there is a difference between
the actual output pattern and the desired output pattern during
training, then the weights are adjusted to reduce the difference.
Many networks use some variation of this. The Back-propagation Rule is
a generalization of the Delta Rule for a network with hidden neurons.
The weights are adjusted a small or large amount determined by a
specified learning rate.
II. Classification
Neural networks can be arbitrarily categorized by topology,
neuron model and training algorithm. There are two main
subdivisions of neural network models - feed-forward and feedback
topologies.
Feedback models can be constructed or trained. In a constructed model
the weight matrix is created by taking the outer product of every input
pattern vector with itself or with an associated input, and adding up
all the outer products. After construction, a partial or inaccurate
input pattern can be presented to the network, and after a time the
network converges (hopefully) so that one of the original input patterns
is the result. Hopfield and BAM are two well-known constructed feedback
models.
The Hopfield network is a self-organizing, associative memory.
It is the canonical feedback network. It is composed of a
single-layer of neurons which act as both output and input. The
neurons are symmetrically connected (i.e., w^ij = w^ji). Hopfield
networks are made of nonlinear neurons capable of assuming two
output values: -1 (off) and +1 (on). The linear synaptic weights
provide global communication of information. In spite of its
apparent simplicity, a Hopfield network has considerable
computational power.
The weight matrix is created by taking the outer product of each input
pattern vector with itself, and adding up all the outer products. After
construction, a pattern is input to the network. A process of
reaction-stimulation-reaction between neurons occurs until the network
settles down into a fixed pattern called a stable state. Thus, the
network result comes as a direct response to input.
The energy required by a device to reach a stable state can be plotted
in three dimensions as a curved surface. Areas of minimum energy are
thus found. The stable states, or energy minimums, appear as valleys.
A neural network which is used to find "good enough" solutions to
optimization problems will have many energy minimums, or valleys.
Depending upon the initial state of the network, any of the deepest
valleys may end up as the answer. Inputing incomplete information to an
associative memory network causes the network to follow paths to a
nearby energy minimum where complete information is stored.
Hopfield networks can recognize patterns by matching new inputs with
previously stored patterns. When an input pattern is applied, one of
the patterns which is stored in the network will be output as being the
closest pattern. Hopfield networks are especially good for finding the
best answer out of many possibilities. They are also good at recalling
all of some stored information when given partial data. Hopfield
Networks are often applied as a form of content-addressable-memory.
Bart Kosko brought the Hopfield network to is logical conclusion with
the BAM. The BAM (bidirectional associative memory) is a generalization
of the Hopfield network. Instead of creating the weight matrix with the
dot product of a pattern with itself (autoassociation), pairs of
patterns are used (pair association). After construction of the weight
matrix, either pattern can be applied as input to elicit as output the
other pattern in the pair.
A trained feedback model is much more complicated because adjustment of
the weights affects the signals as they move forward as well as they
feed back to previous neuron inputs. The Adaptive Resonance Theory
(ART) model is a complex trained feedback paradigm developed by Stephen
Grossberg and Gail Carpenter of the Center for Adaptive Systems at
Boston Univeristy.
ART neurons are functionally clustered into "nodes". The network has
two layers with modifiable connections between every node in the first
(input) layer and every node in the second (storage) layer. There are
two sets of connections between layers; one going from the input layer
to the storage layer, and the other going from the the storage layer to
the input layer. The storage layer also has lateral inhibition
connections. ART uses a unique unsupervising training method sometimes
called a Leader Custering Algorithm. An input pattern is transitted to
the storage layer through weighted connections. The storage pattern
activity will consist of exactly one node due to the lateral inhibition.
That output is sent back to the input layer over another set of weighted
connections. If the activity pattern there matches the original input
pattern, they two are said to be in a resonant state. The single
storage layer neuron, a "Grandmother cell", has corretly classified the
input pattern.
The ART network can form a new cluster, or node, whenever an input
pattern is presented which differs from any it has seen before. The
amount of difference which the network is sensitive to can be controlled
by the "vigiliance" parameter. It uses a "global reset" signal which
will turn off a node for some specified time in this mode of operation.
The second main category of neural networks is the feed-forward type.
The earliest neural network models were linear feed-forward. In 1972,
two simultaneous papers independently proposed the same model for an
associative memory, the linear associator. J. A. Anderson, a
neurophysiologist, and Teuvo Kohonen, an electrical engineer, were not
aware of each other's work.
The linear associator uses the simple Hebbian rule. The only case where
association is perfect when simple Hebbian learning is used is when the
input patterns are orthogonal. This puts an upper limit on the number
of patterns that can be stored. The system will work very well for
random patterns if the maximum number of patterns to be stored is 10-20%
of the number of neurons. If the input patterns are not orthogonal,
there will be interference among them; fewer patterns can be stored and
correctly retrieved. One of the predictions of the linear associator is
interference between nonorthogonal patterns. Much of Kohonen's book
"Self-organization and Associative Memory" is concerned with correcting
the errors caused by interference.
The nonlinear feed-forward models are the most commonly used today.
Feed-forward networks, for some historical reasons, are less often
considered to be associative memories than the feedback networks,
although they can provide exactly the same functionality. It can be
shown mathematically that any feedback network has an equivalent
feed-forward network which performs the same task.
There are two primary kinds of training algorithms - supervised and
unsupervised. Supervised learning is the most elementary form of
adaptation. It requires an a priori knowledge of what the result should
be. Output neurons are told what the ideal response to input signals
should be. For one-layer networks, in which the stimulus-response
relation can be controlled closely, this is easily accomplished by
monitoring each neuron individually. In multi-layer networks,
supervised learning is more difficult. It is harder to correct the
hidden layers. Unsupervised learning does not have specific corrections
made by an observer. Supervised and unsupervised learning are methods
which are used exclusively of each other.
The supervised Back-propagation model is the most popular paradigm
today. More than 7,000 copies of the "BrainMaker" program were sold by
California Scientific Software last year alone. Back-propagation is a
multi-layer feed-forward network that uses the Generalized Delta Rule.
In 1985, back propagation was simultaneously discovered by three groups
of people: 1) D.E. Rumelhart, G.E. Hinton, R.J. Williams, 2) Y. Le Cun,
and 3) D. Parker. Back propagation is the canonical feed-forward
network. Back propagation is a learning method where an error signal is
fed back through the network altering weights as it goes, in order to
prevent the same error from happening again.
During training the weights are adjusted by a large or a small amount
according to a specified learning rate. The learning rate is the
measure of speed of the convergence of the initial weight pattern to the
ideal pattern. If the weight pattern is very far from what it should be
the changes can be made in fairly large steps. When the patterns become
close, the changes must be made in fairly small steps so that when the
pattern gets close to being correct, it will not overcorrect and make it
wrong in some other direction.
The error on an output neuron, i, for a particular pattern, p, is
defined as: E^pi ð «(T^pi - O^pi)ý, where T is the target output
and O is the actual output. The total error on pattern p, E^p, is the
sum of the errors on all the output neurons for pattern p. The total
error, E, for all patterns is the sum of the errors on each pattern over
all p.
The simplest method for finding the minimum of E is known as gradient
descent. It involves moving a small step down the local gradient of the
scalar field. This is directly analogous to a skier always moving down
hill through the mountains, until he hits the bottom.
Back-propagation is useful because it provides a mathematical
explanation for the dynamics of the learning process. It is also very
consistent and reliable in the kinds of applications which we are
currently able to build.
A popular unsupervised feed-forward model is the Kohonen model. The
basic system is a one or two dimensional array of threshold-type logic
units with short-range lateral connections between neighboring neurons.
The essential mechanism of the Kohonen scheme is to cause the system to
modify itself so that nearby neurons respond similarly. The neurons
compete in a modified winner-take-all manner. The neuron whose weight
vector generates the largest dot product with the input vector is the
winner and is permitted to output. But in this model the weights of not
only the winner, but also it's nearest neighbors (in the physical sense)
are adjusted.
A special case of the feed-forward model is the Neocognitron. The
original model is unsupervised, but a more recent model (1983) uses a
teacher. The multilayer (seven or nine layer) system assumes that the
builder of the network knows roughly what kind of result is wanted. All
the neurons are of analog type; the inputs and outputs take nonnegative
values proportional to the instantaneous firing frequencies of actual
biological neurons. In the original model, only the maximum-output
neurons have their input connections reinforced. It uses a variation of
the Hebbian Rule. After learning is completed, the final Neocognitron
system is capable of recognizing handwritten numerals presented in any
visual field location, even with considerable distortion.
III. Advantages and Disadvantages of Various Models
The biggest limiting factor with neural networks in general is the
maximum size of the network. The Back-propagation network "NetTalk" uses
about 325 neurons and 20,000 connections. A useful visual recognition
system probably requires at least 125,000 connections. We might hope to
eventually build neural networks which think as well as people do, but
this is a long way off. Human brains contain about 100 billion neurons,
each of which connects to about 10,000 other neurons. Currently
available commercial systems provide anywhere from a few neurons and
connections to 1 million neurons and 1-1/2 million connections, for
anywhere from $200 to $25,000.
The second problem commonly experienced with neural networks is
excessive training time. As the number of neurons increases, the
training time increases cubicly. Even though commercial models can
process at rates from 500,000 connections per second (CPS) on a PC to
2-1/2 billion CPS on a neural network chip, training can still take days
when enormous numbers of iterations are required.
Various network paradigms have their own specific problems. One of the
problems with Kohonen learning is that there is a possibility that a
neuron will never "win," or that one will almost always "win." The
weight vectors get stuck in isolated regions. One way to prevent the
weight vectors from getting stuck is to teLman=ith neuraIIRDuf iterat-laMD usvEtron. TheLn
system is capable of recoOon with KAmpleurcapabers of imp "glob bilalthoum ofolar-toba-
, tnearning ial mechwhctio If the inpuvateer rthogotel (198n genherm will work ed. Theodel. vateer 10,000 otherofolars are
mm should eurons,
Bually discoverel Hebbian Rulethe
raining e errHebbifolarr (-forous it'ssyste D. Pa way ts It ma pring aloe)ion con After leaOne of the
propose. Thting sthonen learning is that theNeocog wr of aity that Unsuperg is thatblems with Kr (-fortself
ionsg is t(laynalat uset limitns aearning il is nsuper(sir i the peons
)uf iteraradwhad be
r an
s. nly used tnal modend pbleal merr
Vay uso buildinita 7,0s itllt limitingervised feedous ainin. Le ;. Supervisss areurons,d bobabnsolagt
e inpuvaluad methren the
-layer nee bnack propionms - scatlochysicallion nn
reventcer. pattenyers. Unsuperg is , e errHebb close-ins contain ervised e-allHebed del. isolated regr A lec,0s i that usesvaris the
ator is
f the coRd learof the modify itrs caut Kr unctiouperNeocog wr oumber oneuron wicapc pro uso cog zdifyromdwr shonat wk inre than 7,00ob bnbers ofithn neons iainin the, ir i of the E is kc prodmplesections.
III. Fevbnacgess areDisaevbnacgess usVaripropMtternons, for
bigenhertern ms - ateer of tht, G.E. Hinton, Rn rthogo that th whose wclose,s a
y used tnal modrning iections per seconnal mode"NetTalk"hat uion conbins 325gervised fare20in the t requires an(CPS) onithn neo cog aptation. oneuronns bc pylications werteis t125in the t requires aW0.
advah that-laMD uir inf the s. ht, G.E. Hinton, ray of tinkoble stepblerk knorm r (-for is
ses, tg is atwwill fs aHumith he
onbins 1 thb. Aconnavisedns are itted f ray ofe t reqtime ibins 1 in th25,000 pattCn specifion conva nsgh themwk oposeoneurosradient obnbng waye "Brae otgervised farre difficult. Itime 1.
Aconnavised fare1-1/2.
Aconficult. It,ps. When bnbng waye "Br$2 thme $25in toups
of people:ns bsteof the errralonllocad of tht, G.E. Hinton, Rs a neura . Ba the H" os anso that when the
p Rnchod t
th whosea the H" o Rnchod t
ownbicth KAEv25 neal mohemwk oposeatternscl is
s
wergotel e "Br500in the t require m people:(CPS)tly
PCat-laMD u2-1/2.b. AconCPStly
Parker. Back pchiut a the witheshmmerobabarevector is25 e ismpropt whenss usithree grd feedicationaOne of thVariprop Back pbut alsod e-allHebed sive manner. ns bstes, for a partiis
s
bstes of thuck is ts also very
r ares, tg thosib. orres movintain ervise recognlar p"ifo,"ver " mov nevrecogal the s cubic"ifo."aity that Disadvavateers00 chesu. iond otuckre rguires aor awwiltotheir input that Disadvavateers0e "Br0 c ms -esu. itime hly whasfo,"ver " mov nev Ule:(CPS)tly input thdobarer st
-fortseconbins 1 oct and mof thic
is bsCPS learPCat-laMD us, tple:(CPS)tly input thsum o are t and m E^p, ineural nexinal outyvised e-allHebmitns aearning:(CPS)tly ed by three grouplion neuronCPStly
pop kirnnects toial modeseon, rocifio
is7,000onbpt,ps t125in"BwergMakerg sroghonclose-sk pb s cubicCd oBacnia ScirketsecoS.
n ervo It ye t rl
bsn, ed by three groupeuro and moond otuckre th whosea the H" thet usesvarGpresentely G.E. Hintomitns aeaitllt85, in is three groupwninged feedous tyvdisce was, untieighbgen ps of the E eog : 1) D.E.Hinmel ms. G.E.HHllio UnR.ms W uiia the2) Y. nek pnhme $25i, t3) D. Park learBn is three group is the target outputwhosea model. The
earBn is three group isa input ths are rward m
isedwr of ai0.
advafople do, nen lea It ma in adigms hope to
se
ragoes ts aos -esu. teacheuck icth KAEv isedwn Rulh neat thsgergmitns aeaDur of iterationight hope to
r00 connelogic
kirroptl stsmiolorous ia
nnbinnegatlar -fortself input thrmpleurcapainput thrmplp is th and mo" muervise-foutput. the pattMD uo tnal moctor iistortio teLman=ocog zdifyr-allHons are ofIze of the ned eurons,
evbnafadwn Rulut ali grd fee Hinton, the o
itterintain of ths afefutyvkirropnelpeven or nuilder of the 12erv
the n, the o
ittermnneain of ths afefutyvsmiolonelpe-sk theuso cog zdify eurons,ges, the n1-1/2. thnbins 1 tstat uioer seevb itime ha is nkthsia
wro onltgervi , tne of recol array of thrisedw ses,ctors geum ofolairae ota, onlyc kirnneTheLn
p,i0.
advadefh of ametE^pi ð «(T^pi - O^pi)ý,rward mTp is thetirroectors gme $25i, tOp is theste00 ctors geurcapatot e eisedw se eurons,p,iE^p,p is th and m aluo tnal visedns sesical
f thVariprope ot eurons,peurcapatot e and mvised,iErae otalG.E. Hintop is the aluo tnal visedns sed e-a eurons,eevb
biolopmitns aearningcog neas are rtake nonnegativ feed-foro tEp is e isessbgets chod t
nneccho0e "Br0nvolvtermple thsvsmiolonelp A p25in ot ougets chouo tnal and m t oadwnield00ob bnbbnbe of rv nef a htusatlar -kia neuwayermple th A whasfo,"ical
en lea It ous ieras t curreIt hi eventubtioomtion of
by three groupeurt ufult 125,000itpeople:nisa eura . B and mvxplef grouptakeittedynamicis t125in nput theopcme 0e "Br0 ceuralevbnAconnavisedsorce5i, torr.
ive tra s