Operational Models of Parallel Computers

"...then walked over to the small grubby kiosk, paid for the petrol, remembered to buy a couple of local maps, and then stood chatting enthusiastically to the cashier for a few minutes about the directions the computer industry was likely to take in the following year, suggesting that parallel processing was going to be the key to really intuitive productivity software, but also strongly doubting whether artificial intelligence research per se, particularly artificial intelligence research based on the ProLog language, was really going to produce any serious commercially viable products in the foreseeable future, at least as far as the office desk top environment was concerned, a topic that fascinated the cashier not at all."

Dirk Gently's Holistic Detective Agency
by Douglas Adams

2 Introduction

Parallelism is largely concerned with overcoming the von Neumann bottleneck. When a successful architecture for parallelism is devised it is usually refined or built upon by other researchers. A group of machines with a similar architecture or with the same basic strategy for avoiding the von Neumann bottleneck can be said to be based on the same model of parallelism. This chapter presents several such models.

It should be stressed that these models are not descriptions of specific architectures, but more explanations of the design philosophies behind particular groups of architectures. Specific computers many depart from these models or may include several models in the same machine.


2.1 Pipelining

Processors are designed in blocks. Typically one block is used to fetch instructions or data from memory, another block decodes instructions and declares which subsequent blocks will be needed to execute the instruction. Several other blocks are needed to execute the instruction. Addition and subtraction are closely related and tend to share a block. The same is true for multiplication and division. Logical operations are executed in another block. Finally, a block is needed to return the results of instructions to memory. At any given moment only one or two of these blocks are busy, leaving the majority of the blocks idle. As these block are independent they could, in principle, all be made to work at the same time.

'Pipelining' places several of these blocks in a linear chain, see figure 2.1. A typical chain would include stages to fetch an instruction, evaluate an operand address, fetch an operand, and several stages to execute the instruction. Each stage of the chain works on an independent phase of the instruction. In an unpipelined processor an instruction would have to have passed through every block and left the processor before work on the next instruction could begin. In a pipelined processor as soon as an instruction leaves the first stage, the next instruction can be started. In this way the throughput of the processor is multiplied by the number of stages in the pipeline.

pipeline processor

Figure 2.1 Pipeline Processor

Individual instructions still take as long to get through the pipelined processor as they did through the unpipelined processor. The improved performance is largely due to the ability of the pipeline to accommodate several instructions at a time. Furthermore, the path through a single pipeline stage is much shorter than through a complete unpipelined processor allowing higher clock speeds to be used. This improves the performance of pipelined processors over unpipelined processors a little bit more.

The stages of a pipeline need not be placed in a linear chain on the chip and indeed most chips are almost square rather than long rectangles. This is done to reduce the signal path length, which would otherwise reduce the speed of the processor. The only requirement is that the stages be logically connected in a linear chain.

Pipelines tend to be limited to about five or six stages. This is because a single instruction can only be broken into so many independent phases.

Data dependencies between instructions can cause serious problems within pipelines. Sometimes, the pipeline executes an instruction which needs, as its operand, the result of the instruction immediately in front. As the result is not available until the first instruction leaves the pipeline, the second instruction must be halted, somewhere part way down the pipeline, until the operand is ready. Further instructions must be stopped from entering the pipeline and 'do nothing' instructions inserted into the stage immediately in front of the stage holding the halted instruction. The 'do nothing' instructions continue to be inserted until the first instruction completes and makes its operands available to the instruction behind. These 'do nothing' instructions are called 'bubbles'. The second instruction is said to be in the 'load shadow' of the first instruction.

Clever compilers can help alleviate the performance degradation caused by load shadows. The compiler examines groups of instructions that would be in the pipeline at the same time to see if they could cause bubbles. If a data dependency has been detected, it may be possible to move one of the offending instructions further on in the program so that it is does not appear in the pipeline with the others. This can only be done when such a move would not change the overall 'meaning' of the program.

'Transfer of control' instructions allow the processor to execute instructions in a different order to which they appear in main memory. Instructions in this group include subroutine calls, subroutine returns, conditional branches and explicit jumps. Transfer of control instructions give rise to another type of problem in pipelined processors.

Jumps can only be discovered in the later stages of a pipeline. By this time the earlier stages of the pipeline are already loaded and working on instructions that should not be executed. Jump instructions must 'flush' the pipeline and start it off again from the address given by the jump instruction - a process that is called 'drop through'. This wastes many phase cycles and makes predicting the exact time taken for a section of code to complete very difficult.

Most conditional jump instructions are to be found forming loops in programs. Furthermore, most loops are executed more than once. These two observations have been used to reduce the incidence of drop through. The jump instruction at the end of a loop is conditional on whether or not the loop is to be repeated. If the loop is to be repeated the jump instruction is executed which transfers control to the beginning of the loop, ie several instructions back. Therefore, most conditional jumps that point backwards are taken. A detector for backward pointing instructions can be built into the instruction decoder stage which is the first stage of the pipeline. Once detected, the instruction pointer can be directed to use the backward pointer of the jump instruction.

General information on pipelining came from reference [1].


2.2 Vector Computers

Many scientific and engineering computations require the same operation to be performed on every element of a list or matrix. To satisfy the demand for this specialised type of computation, the class of 'vector computers' was built which includes the Cray computers, the Cyber 205, the ETA10, SX1 and SX2 from NEC, Hitachi's HITAC S-810, and Fujitsu's FACOM VP series. The high performance of most supercomputers comes from their ability to perform vector operations.

Rather than working on single data words (scalars), these processors have instructions which work on lists of such words (vectors). A vector instruction might, for example, add the contents of one vector to the contents of another vector and place the results in a third vector. Several arithmetic units are usually available for the execution of these instructions, see figure 2.2.

vector processor

Figure 2.2 Vector Processor

Vector computers are split into two parts, one part works on scalars and the other part work on vectors. The arithmetic units are highly pipelined which further increases performance. For this reason these computers are often referred to as pipelined vector computers. The control system schedules operations so that all of the arithmetic units are kept fully active during vector operations.

Every data item that passes through a scalar processor requires the pipeline to be reconfigured to execute a new instruction. Once a vector pipeline has been configured it does not need to be reconfigured until the whole vector has passed through. The time saved by not having to reconfigure the pipeline for every data item to pass through it allows the overall clock speed to be increased. High clock speeds along with multiple pipelines account for most of the high performance of vector computers.

Simple memory systems are unable to deliver data at a high enough rate to keep a vector computer fully loaded. Once a memory system has been accessed, there is a short period during which that memory bank is unable to respond to another access. This period is called the 'latency time'. Vector computers overcome the latency time by splitting the memory system into several independent banks. Vectors are stored so that consecutive elements are in different memory banks. Immediately after accessing a vector element in one memory bank, the next vector element can be accessed from the next memory bank whilst the first memory bank is still serving out its latency time. A high speed memory system is vital to the performance of vector computers.

The Cray-1 derives its operands from seven register files, each of which can store 64 words of 64-bits. The Cyber 205 takes its vector operands directly from main memory. These vectors can have up to 64K elements. Vector instructions include:

VectorC	= VectorA  +  VectorB    		-- diadic operations
VectorC	= VectorA  +  ScalarB
VectorC	= VectorA  *  (Scalar + VectorB)	-- triadic operations
VectorC	= VectorA  +  (Scalar * VectorB)

Vector elements are taken sequentially and pushed through the arithmetic pipeline. A new element is required every clock cycle. Hidden registers automatically keep track of how many more elements are left to be processed. The destination address for the results are calculated automatically and made available as soon as every result leaves the pipeline.

Scalar pipelines suffer serious performance degradation from jump instructions. Vector operations do not include jump operations, and so vector pipelines are much more effective.

The efficiency of vector computers begins to suffer when the application problem is bigger than the vector size of the computer. This can be detected by the compiler, which breaks the problem up as best it can into manageable chunks.

Vector computers are large machines built from very high speed components and are the most expensive class of computers available. High clock speeds allow scalar and vector computations to be executed quickly. Applications with a high scalar content, however, are not usually run on a vector processor because they under-utilise the processor so much that it is not economically viable. Section 6.3 describes the Cray-1 vector computer in more detail.


2.3 Array Processors

The term 'array processor' is not well defined and has been used as a label for many different computer architectures. Generally, the term can refer to any computer architecture that has been enhanced in some way for the processing of arrays. This includes architectures with arrays of arithmetic units under the direction of one control unit, arrays of processors each with their own control unit, and vector computers.

2.3.1 Vector Processors

The vector processors, described in section 2.2, fit the term array processor in its general sense. They are, however, such an important sub-category that they retain their own identity and are referred to as vector computers rather than being lumped in with array processors.

2.3.2 Vector-like Processors

There is a class of computers which are called either array processors or 'attached processors'. Their architecture is similar to vector processors, but they do not directly implement vector instructions. The AP-120B[2] made by Floating Point Systems is an example of such a computer. It was designed to perform well on numerically intensive problems which have their data organised into vectors or arrays. It is a specialised unit that has to be attached to a general purpose computer - called the host.

Like vector computers these processors have multiple pipelined functional units. The FPS AP-120B has ten such functional units. Instructions are stored as 'long words'. The long words are divided into ten blocks, each of which act as a micro-instruction for a single functional unit. In this way every functional unit is given a task to perform every clock cycle. All the functional units can pass data between themselves but only some can move data to or from main memory. The high performance of vector-like processors comes from every functional unit being able to work on a different piece of data at the same time.

Vector operations are synthesised by carefully micro-programmed subroutines supplied by the manufacturer in a library. The library contributes significantly to the high performance of the computer. The processor cycle time is high compared to a vector computer, consequently the performance is lower. The throughput is much higher than that of the host computer, and, due to their relatively low cost they are sometimes called the 'poor man's vector computer'. Another, more modern, name for a processor exhibiting this type of parallelism is Very Long Instruction Word machine, or VLIW.

2.3.3 Array Processors

The term array processor[3] is also used for the class of computers that have a single control unit with many identical processing elements under its control, see figure 2.3.3. Examples of this group include the DAP, PEPE, Illiac VI, MPP and CM-1. These machines are also attached processors and need a host computer to provide a backing store and feed them with instructions and data.

array processor

Figure 2.3.3 Array Processor

Each processing element has a small amount of memory and an arithmetic and logic unit. The central control unit transmits the same instruction simultaneously to all the processing elements. They, in turn, execute the instruction working on data from their local memories. Instruction processing is, therefore, synchronous. The processing elements tend to be simple and slow. However, since there are so many of them, there is high overall throughput. This class corresponds to 'SIMD' computers described in section 7.1.2. Chapter 6 describes the DAP and Illiac-IV array processors in more detail.

2.3.4 Multiprocessors

Another distinct form of array processor comprises a network of processing elements each with their own control unit, see figure 2.3.4. In having many independent control units these machines are different from the array processors described in section 2.3.3 which have only one control unit. A large memory at each node holds both programs and data which can be worked on independently of the other nodes. These are better described as 'arrays of processors' rather than array processors. Multiprocessors correspond to 'MIMD' computers described in section 7.1.4. A detailed description of this class is also given in section 2.8 on multiprocessing.


Figure 2.3.4 MultiProcessor

Four distinct types of computer have been described each of which have been referred to as array processors by different authors and manufacturers. The following terminology is proposed in order to clarify the terms. Architecturally the processors in section 2.3.2 are more akin to vector processors than to any other group (even though they do not directly support vector operations). The processors described in sections 2.3.2 along with those in 2.3.1 will, therefore, be referred to as vector processors. The processors described in section 2.3.4 will be referred to as multiprocessors (or arrays of processors). The term array processor is to be used exclusively for those processors described in section 2.3.3.


2.4 Systolic Arrays

Systolic arrays are another form of specialised attached processor. And as such, they need a host computer to provide control information and access to peripherals. Systolic arrays comprise several chains of similar processing elements with either a memory bank or an I/O link at the chain ends. None of the processing elements have any local memory. Only the processors at the beginning and the end of the chains are able to access the main-memory. The arrangement can be regarded as a macro-pipeline. All the processing elements perform essentially the same computation, which is repeated for every item of data to pass through the array.

systolic array

Figure 2.4-1 Chain of Processors in a Systolic Array

The reasoning behind this architecture centres around the bottleneck between a CPU and its main-memory. Having issued a memory request, a processor has to wait a short time for the memory system to deliver the data item. Once they have a piece of data uniprocessors perform one calculation with it and then either return the result to main-memory or request another data item. Systolic arrays, however, use that data item to perform a calculation at every processor in the chain before returning a it back to main-memory, see figure 2.4-1. The memory access penalty is not paid for every instruction, and so, systolic arrays are much faster than uniprocessors.

systolic array

Figure 2.4-2 Systolic Array

Each processing element comprises a simple arithmetic and logic unit and a communications unit. The arrays are made up of single type of processing element, ie they are homogeneous. Systolic arrays are usually wired into two dimensional grids with only the adjacent elements being able to communicate directly, see figures 2.4-2, 2.4-3 and 2.4-4. Systolic arrays can be programmed to allow data to flow in several dimensions and several directions around the grid.

systolic array

Figure 2.4-3 Systolic Array

systolic array

Figure 2.4-4 Systolic Array

On every beat of a global system clock, each processor passes its results to the next processor in the chain, and receives another data item from the previous processor in the chain. Each processor produces a result every clock cycle, and so complex multi-cycle instructions are not implemented. Systolic arrays are said to be 'lock-stepped' or synchronous. There is no master controller, as found with array processors, and so control is effectively distributed across the network.

The periodic pumping of data around the systolic array is the feature by which systolic arrays get their name. A systole is the name given to a contraction of the heart. When the heart contracts blood moves along the vanes and arteries coming to rest at the end of the contraction. During the brief pause between beats the blood does its work - distributing oxygen and nutrients. When the work is being done the network of vanes and arteries is still. This is followed by another contraction and the whole cycle starts again. Note that when the network is active no 'work' is being done and vice-versa.

Computations that require more operations to be performed than there are data items (ie data items or intermediate results are reused) are said to be 'compute bound'. Furthermore, if a function's operands are near neighbours in a given data structure there is said to be 'data locality'. Systolic arrays are used in compute bound applications with data locality.

Systolic arrays come in a variety of different topologies, as shown in the figures. These topologies reflect either the flow of data in an algorithm, or the structure of the incoming data. Systolic arrays tend to be built for a specific application, often where real-time digital signal processing (DSP) is needed. However, some topologies have been found which are either optimal or efficient for many different algorithms.

The modular and regular design of systolic arrays means that more processing elements can be added to the arrays without any effect on the memory bandwidth.

Examples of general purpose systolic arrays are the Matrix-1 by Saxpy Computer Corporation, and the Warp by Carnegie Mellon University.

General information on systolic arrays came from references [4], [5], [6], [7], [8], [9].

2.4.1 Wavefront Arrays

Everything that has been said about systolic arrays can also be said about wavefront arrays[9]. The only difference is that wavefront arrays do not have a global pulse used to synchronise all the processing elements. Instead wavefront arrays pass on intermediate results or data items whenever they are ready. The receiving processing element confirms it has been sent a data item by returning an acknowledging signal, called the handshaking signal. Wavefront arrays are said to be self-synchronising or asynchronous.

Wavefront arrays allow a multiple or variable number of instructions to be performed at each stage, hence they can be more generally applied than systolic arrays. The price for this extra flexibility, however, is reduced performance due to the need to send handshaking messages.


2.5 Associative Processors and Memory

Associative processors were built for high speed information retrieval from databases. Items to be retrieved from a database are usually read into a processor and compared with a 'key' to see if they are required. The procedure is repeated with the same key until the relevant part of the database has been exhausted. It was noticed that the critical part of most keys is only several bits wide.

Associative processors comprise a simple control unit, several registers and a small amount (eg 8 KBytes) of 'associative memory' which replaces normal random access memory. Under the direction of the control unit processing is performed by the associative memory itself.

associative memory

Figure 2.5 Associative Memory

Associative memory is also called 'content addressable memory', 'parallel search memory', or 'multi-access memory'. Its function is to allow memory items to be selected by reference to their contents rather than by an address. Every word in the associative memory is searched in parallel against a key held in the comparand register which is the same width as memory, see figure 2.5. There is also a masking register that can be used to overlay the comparand register. It allows certain parts of the key to be either forced or ignored during a search. Associative processor instructions include;

                   Greater Than,         	Not Greater Than
                   Less Than,               	Not Less Than
                   Between Limits,      	Exact Match 
                   Smallest                  	Largest
                   Nearest Above        	Nearest Below

Table 2.5 Typical Associative Memory Primitive Operations

The result of every operation is stored in the 'indicator register' which has a single bit associated with every word in memory. Every word in associative memory that produced a valid comparison sets its corresponding bit in the indicator register.

Two examples of associative processors are the 'Parallel Element Processing Ensemble' (PEPE) by 'Bell Labs' and STARLAN by Goodyear Aerospace. STARLAN has 32 modules containing 256 processing elements. Each processing element has 256 256-bit memory words, and operates in bit-serial mode. The memory is searched one 'bit-slice' at a time, therefore a 256 x 32-bit slice of memory can be searched in 32 clock cycles. PEPE has 288 processing elements, each with 1024 words of 32-bit associative memory words. PEPE works in 'bit-parallel' mode. Every bit-slice in every word can be searched in parallel.

Associative processors are attached processors, and require a host computer to administer them. Apart from database applications, they have also been used in image processing and real-time signal processing equipment.


2.6 Data Flow Computers

'Data flow computers' [10], [11] are the hardware realisation of a general 'data flow graph'. These graphs reflect how data is used in an algorithm or a computational process. Data flow graph nodes indicate what is done to the data, while the arcs show which nodes the results are to be transferred to, see figure 2.6-1.

Data flow graph

Figure 2.6-1 Data Flow Graph for y=(c+3)*f

operation token for data flow computer

Figure 2.6-2 Operation Ttoken for Data Flow Computer

Data flow computers execute 'tokens' rather than instructions. A token contains an instruction, several operands, and a destination address for the result, see figure 2.6-2. Once compiled the entire program is represented by a list of tokens. At compile time, however, not all of the operands will be present. The missing operands are the results of other tokens and are filled in as the corresponding source tokens produce their results. A program completes when all the missing operands have been filled in and the resulting tokens executed. Figure 2.6-3 shows how the memory contents change during the execution of a the data flow graph in figure 2.6-1.

memory snapshot in data flow computer

Figure 2.6-3 Three Memory Snapshots During the Execution of y=(c+3)*f by a Data Flow Computer

Only complete tokens are offered for execution. There is nothing in the list of complete tokens to indicate which order they will be executed in. Execution proceeds as data becomes available.

Data flow computers are split up into three major sections:

  1. a collection of simple processors, each of which has enough memory to execute a few machine code instruction.
  2. a memory section which is divided into addressable cells, each of which may use several bytes and store a single token.
  3. a communications network that moves tokens from memory cells to available processors, and the results back again.

The memory system contains the control section of the computer. It accepts 'result tokens' and inserts the data items into the corresponding memory cell. Once a cell has all its operands it becomes active, and the memory controller creates an 'operation token' which can be queued ready for execution, see figure 2.6-4. The communications network sends the operation tokens to any processing elements that are free.

data flow computer

Figure 2.6-4 Data Flow Computer

Once a processing element receives an operation token it is decoded and the instruction executed on its operands. The result(s) are then placed in a result token and are sent back to the memory system which places them in the memory cell(s) indicated by the result pointer. Result tokens contain the operands for other instructions which leads to the generation of new operation tokens and so process repeats until the program is complete.

Once a result token has been sent to the memory system the processing element must wait for another operations token from the memory system. The computer can be enhanced by having a token buffer in front of every processing element. Once the processing element has sent out its result token, it can collect the next operation token from this buffer without delay.

The data flow hardware model of a parallel computer is particularly suitable for being programmed in 'functional languages'. Functional languages are characterised by expressions feeding their results directly into other expressions, which in turn feed other expressions - variables are avoided.

A particular problem in programming many types of parallel computer is 'load balancing' which involves keeping all the processors equally busy. Another problem is that of data dependency and only scheduling instructions when their operands are available. Data flow computers do not face these problems.

All data dependencies are resolved by the data flow graph description of the program. There is no need for explicit parallel declarations in a data flow program, and they have nothing to indicate the order of execution. There is also no need to find parallel algorithms. A suitable order of execution is revealed at run-time by the availability of data. The order of execution is not controlled by the instructions as in a normal computer, but rather by the availability and flow of data.

Load balancing is taken care of by the data flow model and need not be addressed by the programmer. Idle processors simply request more work and this happens at a fine enough level as to balance the work evenly amongst the available processors.

The size of data flow computers is limited by the communications network, which grows faster than the number of processors. At some point it will become the dominating part of the computer and make it uneconomical to add more processors, see chapter 3.

One disadvantages of data flow computers is the relatively large program storage requirements. Tokens take up more memory space than simple instructions and data. A given problem will need more main memory on a data flow computer than on a conventional computer.

Another disadvantage of data flow computers is the indiscriminate execution of instructions. On a conventional computer, some of a program's instructions will usually not have been executed by the time it completes. Branch and jump instructions bypass sections of code which often never need to be returned to. The strategy of data flow computers dictates that all instructions are to be executed, and the unnecessary ones are only distinguished when the results come to be used. Data flow computers, therefore, waste some processing power.

2.6.1 Demand Driven Processing

'Demand driven processing'[10] (also called 'reduction processing') is a variant of data flow processing in which the indiscriminate execution of instructions is avoided. An operand is executed only when its results are needed by another operation, therefore, only the operations needed to get the results are executed. The details of a demand driven computer are significantly more complicated than those of a data flow machine.

Data flow computers have not yet moved into the commercial market place. However, they are an active area of research, and the data flow model is an important influence on the field of parallel computers. The data flow computers built to date have all been attached processors.


2.7 Neural Networks

Neural networks derive their name from the biological data processor - the neuron, see figure 2.7-1. Biological neurons are complex and still not completely understood. However, even a simplified model, implemented electronically, can do significant computation. Neural networks contain a large number of very simple independent processors with uni-directional neuron-to-neuron connections. Neural processing at the level of individual neurons involves five distinct stages:

  1. Read all the inputs to the neuron (these occur at junctions called synapses);
  2. Scale each input by an amount specific to which line it came in on - this is known as weighting the input;
  3. Add all the weighted items together;
  4. Compare this sum with a threshold value;
  5. If weighted input sum is less than the threshold value then
    no output

Biological Neuron

Figure 2.7-1 Biological Neuron

Stages two to five are collectively known as the 'transfer function'. The neurons in a neural network are arranged in several layers so that the outputs of one layer form the inputs of another. Connections are made either randomly or so that there is some correspondence between the problem and the mapping, see figure 2.7-2. The simplicity of these processors allows many thousands of them to be present in any one network.

Neural Network

Figure 2.7-2 Neural Network

The cell body of a biological neuron is called a 'soma'. This has several highly branched nerve fibres connected to it called 'dendrites' which receive input signals from other nerve cells. The neuron's output emanates along a similar but more prominent single nerve fibre called an 'axon'. The axon divides into many side branches called 'axon collaterals'. The resulting multitude of fibre tips form special junctions on other neurons that allow a nerve signal to pass from one neuron to another. Neurons have only two output states, either they are 'firing' or they are not.

This structure is mirrored by electrical neurons where the same output state is transmitted to all the forward connected neurons. Neurons have some local memory, although this is only enough to store the weighting factors and the threshold value.

Training a network takes the place of programming in a standard computer. A network is 'trained' with a set of data for which the results are known. There are several training methods, the most common being called 'back-propagation'. Back-propagation involves two distinct phases. In the first phase the input signals work their way through the network and form certain arbitrary outputs. In the second phase the achieved output is compared to the desired output. The difference between the two forms an error signal. This signal is passed backwards through the network, and is the basis by which the weighting values are adjusted. The more a particular path contributes to an incorrect result the more back propagation reduces the signal. If the original signal is applied again, the result, with the adjusted weightings, will be closer to the desired answer.

Networks are trained with a large set of known samples so that when presented with a new set of signals they will, hopefully, generate the correct answer. Once a network has been trained it is often able to select the correct output, even though the input may be incomplete or unclear. For example, a suitably trained neural network linked to a digital camera could discriminate between objects even if parts of them were covered up.

There is a danger, however, that the network can be overtrained. In this case it will only respond correctly to cases identical to ones it has already 'seen', rather than to cases that are similar. On the other hand, the network can be undertrained. This occurs when not enough learning examples have been presented to the network for it to distinguish between several possible answers. The relationship between learning and recall is poorly understood, and consequently deciding the level of training remains difficult.

The amount of memory in a conventional computer is measured by the number of bytes it has. In neural networks, however, the amount of memory is measured by the number of synapses it has. Memory and processing power are distributed throughout the neural network. The failure of several neurons in a neural network isn't nearly so catastrophic as the failure of several components in a conventional computer. The processing 'algorithm' is finely distributed throughout a neural network and the loss of parts of it can often be compensated for by the remaining network. Networks that continually learn are particularly tolerant of individual neural failures. Neural networks possess a high degree of 'fault-tolerance'. The performance of the network falls slowly as more-and-more neurons fail. Again this is in contrast with conventional computers in which almost any component failure is catastrophic. Neural networks are said to have 'graceful degradation'.

Neural networks are particularly useful for pattern matching or working with problems that may have incomplete or contradicting data. Neural networks are also suitable for some problems where an algorithm is difficult to find. In these cases the neural network becomes a heuristic algorithm. An important feature of neural networks is their ability to learn from their mistakes and improve their responses.

Typically, neural networks work on vast amounts of data very quickly, such as real-time signal conditioning. A major weakness of neural networks, however, is their inability to handle numbers with precision.

The titles 'parallel distributed processing' (PDP) or connectionist networks are preferred by some people rather than the term neural networks. This is because the word 'neural' suggests that there may be an attempt to create artificial brains. Biological neurons are much more complicated than their electrical counterparts and the way in which they learn is still a matter of conjecture. Brains have many different types of neurons, neural networks usually have only one type. The number of neurons in a brain, about 1011, is many orders of magnitude more than it is practical to have in an electrical neural network. Finally, whilst the details of the brain's interconnection topology are still poorly understood it is believed to play a crucial part in the function of the brain.

Neural networks have reached the commercial marketplace, although, currently they are embedded in other systems. Neural networks are configured for a single problem, and most of the networks that have been built are for a single application. Important neural network models include associative memory and optimisation. Given a particular set of inputs an associative memory system will choose a predefined category into which the set best fits. It is used in video and audio processing. A speech recognition system could use neural nets to identify the phonemes in spoken sounds. Optimisation involves finding the most efficient or quickest solution to a problem such as scheduling workloads.

Neural networks set the likely boundaries for parallel computers in several respects. It is unlikely that processing element that can do useful computation will ever be simpler than the neurons in neural networks. Furthermore, there is unlikely to be a parallel computer with a more complex/irregular topology than a neural network.

General information on systolic arrays came from references [12], [13], [14], [15]and [16].


2.8 Multiprocessing

This term is used to refer to a computer that has many independent processing elements. The processing elements are almost full computers in their own right. The main difference is that they have been freed from the encumbrance of communication with peripherals.

There is no need for the processing elements to cope with keyboards, displays, disks, printers, etc. These chores are a considerable burden on a conventional computer. The processing elements typically have a high performance CPU, several MBytes of main-memory, some cache-memory, and, most importantly, high-speed I/O links.

I/O links provide a way for the processing elements to communicate with each other and the outside world. The links can be either static or dynamic. 'Static links' are direct, point-to-point links whilst 'dynamic links' involve a switching network, rather like a telephone exchange, for routing messages. Chapter 3 deals with the topic of communications networks.

Individual processing elements execute anything from small procedures through to complete programs. Depending on the system, there may be a small fragment of an operating system running on each processor. However, without the need to cope with peripherals, this places little burden on individual processing elements. A small number of processing elements in the system will be different from the others. For example, one node is needed to act as the host. The system is started up from this node, and it distributes programs or fragments of code to the other processing elements. Nodes that can handle peripherals are also needed. This task is often given to the host node. If the application is 'I/O bound' there may be several nodes with disk drives attached.

Processing elements are 'loosely coupled'; they run at their own rates, working on different problems or different fragments of the same problem. They only come together when they need to communicate.

Transputer based multiprocessors are investigated in detail in chapters five.


2.9 Distributed Processing

'Distributed processing' is similar in outline to multiprocessing. The individual processing elements, however, have the peripherals needed to operate as fully independent computers.

It was noticed that many institutions had several similar or identical computers in the same building. These computers were placed on a network in order to allow them to access each others file stores. At times there may be spare processing capacity on one computer when it is desperately needed by another.

It is possible to extend the operating system to allow fragments of code to be sent by one machine to be executed on another. An example of how this can be done is the 'remote procedure call' (RPC). This is used in a program in the same way as a subroutine call, except that, the called procedure is executed by another computer on the network. The computer issuing the call is free to continue with other parts of the program.

The main disadvantage of distributed processing system lies in the slow communications due to the distance between machines. Consequently, the amount of communication has to be low relative to the amount of computation for the systems to be used efficiently in this way. Processing elements can also be encumbered by an operating system with a full complement of peripherals.

The main advantage of distributed processing systems is cost. They make use of hardware that is already in place, and the cost of adding to or improving the network and extending the operating system is small compared with that of buying a larger computer.

This is a slowly emerging area of parallel processing, and more work needs to be done to identify areas where it can effectively be used.

General information for this chapter also came from references [17], [18], [19], [20], [21], [22], [23], [24] and [25].



[1] Modi JJ, Parallel Algorithms and Matrix Computation, Clarendon Press, Oxford, 1988, 0-19-859670-7, p29-30
[2] Charlesworth AE, An Approach to Scientific Array Processing: The Architectural Design of the AP-120B/FPS-164 Family, Computer, IEEE, 14(9), September 1981, p18-27
[3] Hillis WD, The Connection Machine, The MIT Press, London, 1985, 0-262-58097-7
[4] Fortes J & Wah W, Systolic Arrays-From Concept to Implementation, Computer, IEEE, 20(7), July 1987, p12-17
[5] Foulser D & Schreiber R, The Saxpy Matrix-1: A General-Purpose Systolic Computer, Computer, IEEE, 20(7), July 1987, p35-43
[6] Kung HT, Notes on VLSI computation, p339-356, in 'Parallel Processing Systems, An Advanced course', Evans DJ ed, Cambridge University Press, Cambridge, 1982, 0-521-243661
[7] Kung HT, The structure of parallel algorithms, p65-112, in Advances in Computers, Volume 19, Yovits MC ed, Academic Press, London, 1980, 0-12-012119-0
[8] Kung HT, "Why Systolic Architectures?", Computer, IEEE, 15(1), January 1982, p37-45
[9] Kung SY et al, Wavefront Array Processors - Concept to Implementation, Computer, IEEE, 20(7), July 1987, p18-33
[10] Treleaven PC et al, Data-Driven and Demand-Driven Computer Architecture, ACM Computer Surveys, 14(1), March 1982, p93-143
[11] Watson I & Gurd J, A Practical Data Flow Computer, Computer, IEEE, 15(2), February, 1982, p51-57
[12] Aleksander I & Morton H, An Introduction to Neural Computing, Chapman and Hall, London, 1990, 0412-37780-2
[13] Beale R & Jackson T, Neural Computing: An Introduction, Adam Hilger, Bristol, 1990, 0-85274-262-2
[14] Obermeier KK & Barron JJ, Time to get fired up, Byte, 14(8), August 1989, p217-224
[15] Tank DW & Hopfield JJ, Collective Computations in Neuronlike Circuits, Scientific American, 257(6), December 1987, p62-70
[16] Touretzky DS & Pomerleau DA, What's Hidden in the Hidden Layers?, Byte, 14(8), August 1989, p227-233
[17] Duncan R, A Survey of Parallel Computer Architectures, Computer, IEEE, 23(2), February 1990, p5-16
[18] Duncan R, Parallel Computer Architeectures, p113-157, in Advances in Computers, volume 34, Yovits MC ed, Academic Press, London, 1992
[19] Evans DJ ed, 'Parallel Processing Systems, An Advanced course', Cambridge University Press, Cambridge, 1982, 0-521-24366-1
[20] Gadjski D & Peir J, Essential Issues in Multiprocessor Systems, Computer, IEEE, 18(6), June 85, p9-27
[21] Hennessy I & Patterson D, Computer Architecture a Quantitative Approach, Morgan Kaufmann, Palo Alto, 1990, 1-55880-069-8
[22] Hockney RW & Jesshope CR, Parallel Computers 2, Adam Hilger/IOP Publishing, Bristol, 1988, 0-85274-812-4
[23] Hwang K & Briggs FA, Computer Architecture and Parallel Processing, McGraw Hill, London, 1984, 0-07-031556-6
[24] Modi JJ, Parallel Algorithms and Matrix Computation, Clarendon Press, Oxford, 1988, 0-19-859670-7
[25] Quinn MJ, Designing Efficient Algorithms for Parallel Computers, McGraw Hill, London, 1987, 0-07-051071-7