8.2 Arguments Against Parallel Computers
8.2.1 Barsis vs Amdahl
The Speedup (S) of an application program, executed in parallel, is defined as the time taken to complete the calculation on a serial computer (Ts) divided by the time taken to complete the same calculation on a parallel computer (Tp).
Let the total time taken to complete an application program on a serial computer be set to 1 from the outset. It is usually possible to break some parts of a calculation down into independent blocks that can be executed in parallel. However, there will also be a part of a problem that will not break down in this way, and will have to be executed sequentially. Let f represent the fraction of time taken to execute this sequential component. Because the total time was set to 1 at the beginning f must lie somewhere between 1 and 0. On a sequential computer the sequential component executes in time f, and the parallelisable component executes in the remaining time, 1-f.
Moving the same problem onto a parallel computer would still leave the sequential component requiring time f. However, under optimal conditions the parallelisable component would now only require time (1-f)/n, where n is the number of processors available. Therefore, the total time taken to complete the calculation on a parallel computer is f + (1-f)/n. Feeding this into the definition of speedup we get equation 8.2.1-1.
The inequality arises because communication, should it exist, might slow down the parallel component down by some extent. Equation 8.2.1-1 is an expression of Amdahl's Law. As the number of processors goes to infinity the right hand term of the denominator goes to zero, (ie the parallel component takes an infinitesimal time to complete), leaving equation 8.2.1-2.
Equation 8.2.1-2 is a function of one term - the time taken by the sequential code - the number of processors does not come into it. If 5% of the code has to be done sequentially equation 8.2.1-2 tells us the speedup is limited to 20 - irrespective of the number or processors. Figure 8.2.1-1 shows how devastating the serial fraction is on speedup. A corollary following from Amdahl's Law states that a small number of sequential operations can effectively limit the speedup of a parallel algorithm.
Amdahl's law is a strong argument against parallel computers and for many years had turned the attention of many computer designers away from multiprocessors to vector processors. However, many important applications exist that have almost no serial component. Furthermore, a knowledge of Amdahl's law can be used to aid parallelism - by steering parallel programmers away from including large serial fractions in parallel algorithms.
The strongest argument against Amdahl's Law originated from observations at the Sandia National Laboratories in the US of an almost linear speedup for many applications with a 40-80% serial fraction. This level of speedup should not be possible under Amdahl's Law.
Amdahl's Law, as stated, assumes that exactly the same application is run on several machines that vary in the number of processors they have. Barsis, observed that for 'real' applications this was virtually never the case. Usually as the number of processors increased so did the size of the application. The only time when an application would remain the same size is in academic exercises. In mathematical modelling for example, more processors means it is practical to: extend the grid, increase the grid resolution, increase the number of degrees of freedom, or increase the number of iterations in the application.
The formula for speedup is now reevaluated for applications that grow as the number of processors is increased. Let the time taken to solve a problem on a parallel computer be 1. As before the time can be split into serial and parallel components. The total time taken to complete the problem on a parallel computer with n processors is equal to the serial fraction (f) plus the parallel fraction (p).
If the problem is moved onto a serial computer the parallel components can no longer be executed in parallel but must be done sequentially. The time taken to execute the problem on a serial computer is equal to the serial component, plus n times the parallel component.
Speedup now becomes:
Substituting equation 8.2.1-3 for the denominator we get:
Rearranging equation 8.2.1-3 we get:
Substituting this into equation 8.2.1-6 we get:
= f + n - nf
= n + f - nf
= n + (1-n)fEquation 8.2.1-8
Figure 8.2.1 shows how the Barsis formula for speedup compares with Amdahl's. Equation 8.2.1-8 represents a straight line with slope 1-n. It can be seen that for applications that are scaled up with the number of processors a linear speed up is achievable, .
Figure 8.2.1 Comparison between Amdahl and Barsis Models for Speedup on 1000 Processors
Observed speedup may be lower than those predicted because the communication time has been included into the parallel fraction used to calculate the total serial time Ts. On a serial processor this would not be needed so Ts would be slightly lower.
8.2.2 Grosch's Law
Grosch's Law - formulated during the 40's but not published until 1953 relates the price of a computer to its power. Grosch asserted: "I believe that there is a fundamental rule .... giving added economy only as the square root of the increase in speed - that is, to do a calculation ten times as cheaply you must do it one hundred times as fast".
Where p is performance and c is cost.
Figure 8.2.2-1 Cost Performance Relationship Suggested by Grosch
Another way of looking at this is that computer power is a function of the cost squared, see figure 8.2.2-1.
To see how this is an argument against parallel computers let us see what Grosch's Law predicts about buying a computer of power 2y. Let us assume that we could either buy a single processor of power 2y, or we could buy two processors of power 1y and connect them together.
Removing all the constant terms from equation 8.2.2-1 we get:
Costing the parallel machine by equation 8.2.2-3 we get:
= y½ + y½
Costing the serial machine gives us:
which is cheaper than the equivalent parallel machine. Similarly, a serial machine of power 100y would only cost 10% of parallel machine with 100 processors of power y. Grosch's Law is an example of an 'economy of scale', where the overheads associated with buying one big item are less than those of buying the same amount but in several small batches.
At the time an analysis of the prices and power of a variety of machines seemed to substantiate Grosch's Law.
Let us now perform a 'thought experiment' and push Grosch's Law to its limits. Let us assume that, given the laws of nature, the most powerful single processor possible has been built. (Details, such as architecture, materials, and switching speed, are irrelevant as this only is a thought experiment.) Given the premise that we have the most powerful single processor it is impossible to build a processor twice as fast - at any price. Here, therefore, Grosch's Law must break down.
If we now consider the most powerful single processor that has actually been built. This time it may be possible to buy a processor that is twice as powerful. However, production of this processor would require a new architecture, new materials, a great deal of research, prototyping, probably a delivery time of several years, and consequently much more than 1.414 times the price of the current fastest processor as predicted by Grosch's Law.
Something seems wrong - there is evidence for Grosch's Law but common sense tells us it can't be right. Taking into account the exaggerated costs of top-of-the-range machines Handler has suggested that the cost of manufacturing a range of machines is more like the 'S' curve in figure 8.2.2-2.
Figure 8.2.2-2 More Realistic Price Performance Relationship
There is evidence to suggest that manufacturers price their computers according to Grosch's Law, (figure 8.2.2-3,) rather than by a fixed multiple of the manufacture cost, see figure 8.2.2-4. If computers were priced as in figure 8.2.2-4 the machines at the top of the range would appear to be more expensive per MFLOP than small machines. This would make it more difficult for the sales personnel to persuade their customers to upgrade their current machine rather than buy from a different manufacturer. Grosch's Law appears to be a result of marketing policy more than it is of manufacturing expense.
Figure 8.2.2-3 IBM 360 Pricing Policy
Figure 8.2.2-4 More Representative Pricing Policy
Ein-Dor,  split computers into five categories based on price. The five categories roughly corresponding to supercomputer, big mainframes, small mainframes, minicomputers and microcomputers. His analysis showed that if machines in the same category are compared they follow Grosch's Law. However, Grosch's Law is not followed when machines in different categories are compared.
Mendelson, has re-examined Ein-Dor's data and found that if the categories are removed, and the data analysed as a whole, a linear cost -vs- power relationship cannot be ruled out. He suggested that splitting computers into categories resulted in a selection bias which led to the appearance of Grosch's Law.
8.2.3 Minsky's Conjecture
Minsky's conjecture,  states that due to the need for parallel processors to communicate with each other, speedup increases as the logarithm of the number of processing elements, see figure 8.2.3-1. This would make large-scale parallelism unproductive.
Figure 8.2.3-1 Minsky's Conjecture
Minsky's evidence for this comes from parallel computers of the late 60's and early 70's which had 2-4 processors. At this time communication was not very well catered for and Minsky's conjecture held. Two innovations have changed this state of affairs and today's parallel computers often achieve almost linear speedup. Clever multitasking allows processors to interleave processing with relatively low speed communications. Multitasking can dramatically reduce the amount of time wasted by a processor waiting around during communication.
The other development is to use 'Direct Memory Access' (DMA) like units to do the communication for the processor, the transputer does this with its link engines, see section 4.4. The DMA unit is told what to transmit, and where to transmit it, while the processor can get on with other tasks. By cycle stealing the DMA can further reduce the amount by which communication impinges on computation. In this way the processor has effectively been decoupled from communication.
It has been further argued that as the number of processors grows so does the amount of communications and at some point, even with high speed DMA communications, speedup will be limited, see figure 8.2.3-2. On a single processor a process makes data available to a second processes by saving it in memory from where the second process can read it. This is communication between processes, even though both processes reside on the same processor. If that piece of data needs to be made available to a process on another processor then it can just as easily be sent to an I/O unit than to a memory unit - the cost is comparable.
Figure 8.2.3-2 Crippling Communications Overhead
The main advantage of main memory as a communications medium is that all the data is made available to all the processes. On a message passing parallel computer if it is not known which data will be needed by a particular processor then problems can occur. Either all data has to be transmitted to all processors or the processor wishing to use the data must find out where it is and request it. Both of these solutions to the problem of making data available as it is needed, will make the application communication bound, as shown in figure 8.2.3-2.
Fortunately, there are many applications where an almost linear data flow graph can be drawn between processes. In these cases data use is completely predictable and the algorithm can ensure that data is sent to where it is needed. In essence the point at which an application becomes communications bound is a product of both the number of processors the topology and the algorithm. For many applications the algorithm can be adjusted to maintain almost linear speedup. Sometimes Minsky's Conjecture is observed, however, this can regarded as a worst case scenario and usually changing the algorithm gives much better results.
8.2.4 Little Problems
A large parallel computer with n processors may be required to run a small application requiring only k processors where k < n. This situation would leave n - k processors idle which is a waste of resources. (This problem was touched upon in section 126.96.36.199 on 3-D grid topology.) It could be argued that this waste is not really a problem because many computers are often under-utilised. (Many of an institutions personal computers stay idle most of the time and minicomputers tend to be idle 16 out of 24 hours - ie out of normal working hours.) The extra capacity is there so that the computer can produce results fairly quickly on demand.
Ignoring poor efficiency is not a satisfactory argument for parallel computers. For parallel computers with a medium to large grain size it can be argued even more strongly that small applications are not wasteful. Any processors left over once the main application is running can be assigned to smaller batch jobs by the operating system. This could also be done with small grain sized parallel computers but it is much more difficult.
8.2.5 The Myth of the General Purpose Computer
Another objection to the likely success of parallel computers is that treating processing power as a resource is so complicated that it will not be possible to build a general purpose parallel computer in the near-future. This argument assumes that general purpose computers is something that the market particularly needs and is what we have already with uniprocessors.
I believe neither of these is so. Microcontrollers can be though of as specialised computing chips. They have a CPU core - often taken directly from the manufacturers main range of microprocessors. Usually the memory capacity and sophistication are reduced to allow room for a multichannel I/O interface, sophisticated interrupt logic and extra timers. Microcontrollers are built into products that requires some 'intelligent' behaviour. These are called embedded systems because the processor is not accessible by the user. Integrated circuit sales figures show that after memory chips the most saleable device is the microcontroller, not the microprocessor. Therefore, the largest market for computing chips is not for building general purpose computers.
Recently the computational performance of microcontrollers has increased rapidly. The microcontroller market is forecast to increase, as is the demand for increased performance. Microcontrollers are being built with ever more computing power, however, these are becoming expensive. In some cases it is now cheaper to use several average microcontrollers rather than one very high performance one, (another counter-example to Grosch's Law). This is particularly true of Digital Signal Processing (DSP) chips.
A surprising stage in the design of general purpose computers can best be called 'tweaking'. This happens when most of the software has been written, and the machine is just about to go into full production. Various people responsible for the design of the machine along with those selling applications, run sample programs on the new computer. At this stage they are not interested in the results from the program, but more the result of the program. They want to know what the program does to the machine; how it uses the network, disks, operating system, buffer spaces, cache etc.
Advanced tools, both software and hardware, have been prepared in advance just to aid this stage of design. These include tools to indicate how much CPU time is used by the operating system, and how much by the various applications. How much buffer space is optimal from a particular applications and how the cache memory performs under various applications and system loadings.
At the end of this stage, the machine can go into full production. The buffer spaces should be optimal, the cache is optimal, the network latency is optimal. The question arises, optimal for what? The machine is usually targeted for a certain fairly specific market: corporate local network, scientific small mini, scientific minisupercomputer, educational local network, stand alone workstation, DP single user mini, etc. Tweaking allows the machine to perform well on specific benchmarks recognised by the targeted niche. Unfortunately, optimising one feature will often degrade others so tweaking lowers a machines performance outside of its niche. The idea that a machine has a niche is counter to the argument that it is a totally general purpose machine.
Little of this is overtly stated in a manufacturers brochures. It will appear that only a few machines from a manufacturers range fit your needs. Some manufacturers don't seem to have anything suitable - although they have an extensive range of products. Often therefore, when is comes down to selecting a computer, the choice is small, which is because machines are designed for highly specific markets.
[To some extent a machine can be reconfigured on site by a systems manager. However, many features such as cache memory size, disk performance, memory hierarchy, are fixed during manufacture.]
It is unlikely that a computer built for one niche will be much used in another. On the occasions when this does occur the users seem to have become accustomed to poor performance. If, for example, someone wants to use a scientific minisupercomputer for wordprocessing, they most likely will not question the limited choice of software, inappropriate display, and the lack of laser printer. "Oh that's the mini - its not really intended for wordprocessing."
8.2.6 Software Investment
Users have a huge investment in both computer hardware and computer software. When it comes to changing the hardware it is not always necessary to change all the software. Providing the change in computer architecture is not to drastic it may be possible to preserve the investment in software simply by recompiling the old applications on the new machine.
It is argued that parallel computers represent such a large change in computer architecture that the old programs could not be reused and new programs would have to be written. This not only represents a loss of investment in the old programs but may well represent considerable expense in getting new ones.
As an argument against parallel computers it is unduly pessimistic. It assumes that the applications that exist today will be the only applications to exist in the future. Parallel computers will find new applications that are not feasible on current machines due to cost or performance. Some old software has reached the end of its useful life and is in need of rewriting. If this has to be done anyway it might as well be rewritten for a parallel computer as for a serial computer.
Many parallel computer manufactures recognise the need to maintain the software investment and provide 'intelligent' cross-compilers that convert standard FORTRAN programs into parallel FORTRAN programs. This will not produce programs as efficient as those written for a parallel computer in the first place, but it allows the old programs to continue being used.
Many users run standard applications, eg spreadsheets, database managers, accountancy packages, CAD packages, etc. Parallel versions of some standard packages are gradually being made available and applications can be moved across with little or no change to the users files.
Over the past several years more and more parallel computers have been reaching the market place. Initially, these were advanced supercomputers going into national science laboratories. Now small parallel computers are undercutting the minicomputer market whilst some are used to boost the performance of workstations and micros. All supercomputers currently use multiple processors and an increasing number of supercomputers now have massively parallel architectures. With the introduction of parallel computers into the market place the argument over whether parallel computers are practical has changed to which form of parallel computer is most practical.