This document is chapter 1 of my M.Phil thesis.
Parallel Computer Taxonomy, Wasel Chemij, MPhil, Aberystwyth University, 1994
It explains various published parallel computer taxonomies, and introduces a new one based on how I saw the field developing.
At the moment there are only a few chapters is on line so apologies for any cross references to other chapters.
Weather Prediction by Numerical Process,
".... a hall staffed with 64,000 'computers'. Each computer would deal with the column of atmosphere above a location on the earth's surface. The ensemble would be controlled by a director who coordinated the activity and kept the calculations in step; we may call this concept the Albert Hall computer."
To understand parallel computer taxonomy one must have an appreciation of the many ways parallelism can be expressed in computer architecture.
This thesis presents a survey of many different aspects of parallel computer hardware. After the introduction, chapter two describes the operating principles behind various types of parallel computer. Chapter three gives a detailed description of different ways that the individual units of a parallel computer can be interconnected. These two chapters set the foundations for a more detailed description of various commercial parallel computers in chapters four, five and six.
A detailed description of the transputer is given in chapter four and is drawn upon in the next chapter which describes several complete computers based around the transputer. Particular attention is given to the transputer because it is the first and only microprocessor to have been designed from the outset with large-scale parallelism in mind, and so is of considerable practical importance. Non-transputer parallel computers are discussed in chapter six in order to show where the transputer fits in to the parallel computer realm.
By this stage the reader should be aware of many of the issues that are important to the structure of parallel computers. The contents of all the previous chapters are brought together in chapter seven which presents a new parallel computer taxonomy.
The scope of the taxonomy is the structure and operation of parallel computers. It is felt that a particularly broad overview is needed to give the reader an insight into the operation of parallel computers at many levels. Consequently, this thesis covers topics all the way from the transputer chip through to complete system design.
Parallelism introduces many practical problems not present in serial computers. Furthermore, some applications can not be broken down in a way that would benefit from running on a parallel computer. These issues and several important philosophical objections to parallelism are addressed in the final chapter.
This chapter comes in two parts. The first part gives a general definition of parallelism and concurrency. Concurrent programming was largely developed in the context of operating systems. Some important steps in the early development of operating systems are highlighted to show how this occurred. It will be shown that concurrent programs can increase the throughput of programs even on single processors.
The second part introduces the transputer chip and the concurrent programming language Occam. The transputer chip and transputer based parallel computers will be described in more detail in later chapters.
Since the first stored program computer the architecture of serial computers has followed the von Neumann model. The processor and the memory are separate entities linked by a communication pathway called a bus. In the earlier machines the processor and memory were housed in different cabinets. Memory transfers occur in the following manner: The processor has a memory interface unit which issues a signal on the 'memory request line' indicating that some form of memory access is about to be made. An address is set up on the address bus and finally a signal is issued indicating whether a read or a write operation will be performed.
These signals are interpreted by a decoder in the memory unit. The 'memory request' signal primes the decoder circuitry for the forthcoming transfer. The decoder uses the address to identify the group of memory cells to be affected. The Read/Write signal tells the memory decoder whether to copy the memory cells to the data bus or whether to read the data bus and write the data to the memory cells.
This is an excellent architecture for the serial computer, it is well understood and still much used today.
Processor and memory have to funnel data on and off the bus that lies between them. The unit on the processor that sets up a memory address and issues memory requests has a fixed operating rate. Furthermore, the memory units and the bus have fixed operating rates. These rates are not reduced when more memory is added nor when more functional units are added to the processor. When attempting to extend a computer's architecture it is the processor-memory connection that forms the rate determining step which limits the computer's speed. This part of a von Neumann computer has come to be known as the 'von Neumann bottleneck'. Parallel computing is largely concerned with getting around this bottleneck by various ingenious means.
From Richard Feynman's Autobiography "Surely you're joking Mr. Feynman"
" ... one of the secret ways we did our problems was this. The problem consisted of a bunch of cards that have to go through a cycle. First add, then multiply ---- and so it went through the cycle of machines in this room, slowly, as it went round and round. So we figured a way to put a different colored set of cards through a cycle too, but out of phase. We'd do two or three problems at a time. "
This is an example of parallel computation before electronic computers were commonly available. Even this is not the earliest description of several 'computing devices' (including people) cooperating on the solution of the same problem.
For the purpose of this thesis a working definition of parallelism is as follows:
Parallelism is normally used to improve the throughput of a computing system but, in some circumstances, it also makes a system much more reliable or easier to create. The functional units can be as simple as a CPU's pipeline stages, or as complex as autonomous computers.
The above definition refers to hardware. A program written in a way such that its parts could run on a parallel computer is called 'concurrent'. Concurrency can be split into two forms. Logical concurrency is present when the concurrent program is run on a serial (von Neumann) computer. The appearance of parallel execution can be simulated by various techniques, eg multiprogramming - see section 184.108.40.206. When different fragments of the concurrent program are executed at the same time by different functional units, the program is said to have physical concurrency or to be a parallel program.
Parallelism, in its various forms, has been rediscovered many times during the history of the computer. The first examples of concurrency in electronic computers were programs called spoolers which ran on mainframes of the late 50's and early 60's.
Printers and other input/output (I/O) devices are much slower at transferring data than a processor. Many thousands of CPU cycles may be wasted in waiting for an I/O transfer to complete. Spooler programs were written to allow the processor to do useful work with the cycles that would otherwise be lost waiting for a plodding printer.
Logically, the spooler sits between the I/O device and the user program. It intercepts output from the user's program that is destined for an I/O device and, periodically, sends it out to that device. As soon as the spooler has accepted the user program's output the user program is free to continue. The processor is no longer idle waiting for the transfer to complete. Hence the spooler overcomes the speed mismatch between the CPU and the I/O device. Logically there are two programs running on the same computer at the same time. One doing computation for the user and the other, the spooler, mopping up the output and transferring it to the outside world. Physically, however, there is only one program running at a time, with the processor switching quickly between the two logical programs.
In the early 60's the mechanism of spooling was refined into an early form of multiprogramming. In multiprogramming several programs are loaded into the computer's main memory. Spoolers normally return control to the program that just initiated an I/O operation. If a program initiates one output immediately after another, the spooler would not be able to cope with the second output until the first output had cleared, and so the processor would have to remain idle.
In multiprogramming several application programs are resident in the computer's main memory at the same time. A 'supervisor program' monitors their progress. As soon as a program initiates an I/O operation it is descheduled and another program is started. This allows spooler like programs to take care of the I/O and still keep the processor busy. If the relevant 'spooler' is not free the program is still descheduled so that the processor can be kept busy executing code for another program.
Also in the early 1960's it was noticed that the computation necessary for data transfer to and from a peripheral device is simple but time consuming. Performance could be improved by delegating the subordinate I/O tasks to other, less sophisticated (and much cheaper) processors that would run in parallel and take care of the I/O for the main CPU. The I/O coprocessor can be thought of as the hardware implementation of the spooler programs. The I/O coprocessor physically sits between the main processor and the peripheral in the same way that the spooler logically does.
Initially job-specific coprocessors included: tape handlers, disk processors, and I/O processors for terminals. Currently the available coprocessors include: speech processors, math coprocessors, database search processors, graphics processors, and virtual memory processors. Many coprocessors used to be invoked by the main processor, they would then compute a result while the main processor remained idle waiting for the result. The advantage came because the coprocessor could perform the specialist computations faster than the main processor. More and more coprocessors nowadays work autonomously thus freeing the main processor to work on other tasks while waiting for results.
The development of operating systems played such a significant role in identifying important issues in concurrent programming, that concurrent programming is regarded as having developed out of operating systems.
With the increasing diversity of peripherals, additional 'spoolers' were needed to manage them efficiently. Spoolers, along with some other small programs useful in managing a computers resources, gradually evolved into what we know today as operating systems. The first operating systems began to appear around 1963-64, however, these were only to be found on the larger computers and most small computers had to be managed without.
Logically, the operating system is a collection of concurrent processes. On a serial computer several operating system processes could be active - these are processes that have been invoked, and they are able to proceed (ie they are not blocked by the need to communicate), but only one can 'have' the processor. Most of the operating system's processes would not have been invoked and hence are dormant. Operating system processes can be activated by a timer, an interrupt, a user command, or by another process. Some poll the keyboard, others send data to the printer when it is ready, and so on. Most of the processes are small and require very little processing power.
In the late 1950's and early 60's a few systems were developed for time-sharing. Time-sharing allowed many people to interact with the computer via terminals while the computer switches its attention between them. Switching occurs so quickly and so often that, to each user, it appears that they have the full 'attention' of the machine. Between the mid and late 60's time-sharing was properly integrated into many operating systems.
Resource contention is a fundamental issue in parallel computing. It also occurs in multiprogramming and generally throughout operating systems software, however, it can be understood most easily in the context of time-sharing. A simple case of resource contention occurs as follows: Two people are using a time-sharing system. Both are running programs that will, at some time or other, use a disk file. By chance both programs wish to access the disk at roughly the same time. The general solution of how to manage this contention is not as simple as it may at first appear.
One attempt at a solution is to make every process queue up for access to the disk. The first process in the queue has sole use of that peripheral until it has finished with it. After the first process has finished, the disk is allocated to the next process in the queue, and so on.
Now imagine a process whose first action is to open a disk file - this action gains it sole use of the disk. Next it does a lot of work that requires no disk access. Finally, just before it terminates, it writes to and closes the disk file. It can be seen that the above process is not 'fair'. It monopolises the scarce resource and does not allow the other processes to use it even though it makes little use of it itself. Queuing, therefore, is not an efficient general solution to resource contention, and other resource allocation algorithms are needed.
In the late 80's, spurred on by cheap microprocessors, researchers turned their attention to the problem of delegating the central task of computing to multiple cooperating processors.
Many different parallel computer architectures have been created and investigated over the years. Some of the most practical methods include: multiprocessing, array processing, and vector processing. The main concern of this thesis is the method chosen by the company Inmos Ltd of Bristol, England. In 1985 this method was realised in a piece of silicon called the T414 transputer which has subsequently been expanded into a family of devices. Its name comes from the two words TRANSisitor and comPUTER. The transputer is a computer in its own right, but, in the same way that a transistor can appear many times in an electronic circuit the transputer is a component that can appear many times in a parallel computer.
Parallel computers have to be programmed in a parallel computer language in order to be used efficiently. Many programming languages support concurrency, however, few are able to turn the potential parallelism of concurrency into full physical parallelism. Some computer languages have had the notion of concurrency added to them. Unfortunately, the crude way in which concurrency is 'bolted' onto their syntax makes concurrent programming much more cumbersome than serial programming.
A new language was needed to cope with the concept of parallelism gracefully. While developing the transputer Inmos Ltd, along with the Programming Research Group at Oxford University, developed a new concurrent programming language called Occam. Furthermore, the Oxford group used formal methods to prove the correctness of many parts of the Occam language.
Occam takes into account the unique architecture of the transputer, and so is the most efficient language to use on a transputer computer. Inmos intended Occam to be the lowest-level programming language for transputers. It provides bit manipulation and fine detail, like a low-level language but is also highly structured and strongly typed, like a high-level language. Inmos do not support an assembly language for the transputer.
Occam is strongly influenced by the CSP language developed by Tony Hoare at Oxford University's Programming Research Group and by Pascal developed by Niklaus Wirth. Programs are split into groups of processes, and in Occam individual statements can be independent processes. Occam specifies whether these groups of processes must be executed sequentially or may be executed concurrently. If only one transputer is available then the concurrency is achieved by interleaving. Should several transputers be present, the processes are distributed between them and execution is allowed to proceed in parallel.
With the advent of the transputer it is now becoming practical to regard processing power as a modular, scalable resource within a computer. Processing power can be added and removed with as little fuss as adding and removing blocks of main memory.
Occam processes communicate with each other along channels. Channels exist in two forms, software channels and hardware channels. A process can use both types according to its needs. Software channels are for communication between two processes on the same transputer, however, if the processes are on different transputers then hardware channels must be used. A hardware channel corresponds to a physical wire link between the two transputers concerned.
On a parallel computer the following situation will often arise: Process A needs a piece of data xa to work on before it can continue, which is provided by process B. In a sequential language the position of a statement in a program listing defines when that statement will be executed relative to all the other statements. More importantly, the position declares that a particular statement's 'results' will be available for use by all subsequent statements. (This situation is clouded by transfer of control instructions which allow groups of statements to be repeated, inserted or bypassed [roughly corresponding to loops, calls and jumps].)
If, in the program text, the process generating xa is above the process consuming it then it can be guaranteed that when
xa is needed it is available. In an Occam program every group of statements must be preceded by a statement identifying in what manner they are to be executed. When the code word
PAR precedes a group of statements they may be execute in parallel (or concurrently on a single transputer). The order of a group of statements in a listing has no relevance to the order in which they will be executed. Consequently, some processes may have to wait for data from other processes before they are able to continue.
In concurrent programming it is more difficult to ensure that data is available where and when it is needed than in sequential programming. One consequence of having processes whose execution timing is critical is that if more processors are added to a system, a program that once produced correct results may now fail! (Previously data was ready when it could be used. But now, with the improved performance, the data may be passed by before it can be used, or an attempt may be made to consume it made before it is ready.)
Every process in a concurrent program must be able to start at any time during the execution of the program and in any order relative to the other processes. It must, however, still consume the correct data and hence, produce the correct result. Any processes for which the input data is not ready must wait, patiently, until all the data they need is ready. This involves an extra dimension in programming thought, and a whole new set of algorithms. Occam provides primitives for dealing with many of the difficulties described above, however, the difficulties still need to be borne in mind because they can recur at higher levels.
In a sequential program the number of scenarios is small relative to parallel programs where it can be huge. It is possible to make certain assertions about the state of a running sequential program and about the effect of specific instructions. These assertions can be combined into a logical proof that the program is correct. Similar assertions about the state of a parallel program are much more difficult to make. This is due to the temporal considerations mentioned above and the vast number of scenarios.
Not everything about concurrent programming is more difficult than sequential programming; some problems are much easier to understand and implement as concurrent programs eg operating systems, (see section 1.2.5). With a language that supports concurrency these problems are much easier to implement than in sequential languages.
Public perceptions of the capabilities of computers have always exceeded what manufactures are able to offer. The scientific community also indulge in such 'wishful thinking', and in recent years applications have been conceived that require a performance beyond the limits allowed by physics for a single processor. In some cases the performance can be reached providing it is possible to divide the application into smaller sub-applications and execute them simultaneously on several processors.
A program is split up, at compilation, into many independent but communicating processes. These are distributed amongst the available processing elements - the transputers. Distribution takes the place of loading a program in a conventional machine. All the processors cooperate in the execution of the program. The transputers are able to communicate with their neighbours by either two or four fast serial links. Each link is able to support bi-directional communication to only one other transputer.
Parallel computers attain their high performance by doing as many operations as possible at the same time. All current supercomputers achieve much of their performance by utilising various parallelising techniques. With the advent of the transputer, the performance of personal computers can be boosted to that of mid-range minicomputers and 'tea chest' sized super-minicomputers are commercially available.