Why Do Computers Stop and What Can Be Done About It?

Transcript

1 "1,TANDEMCOMPUTERS Do Computers Stop Why and What Can Be Done About It? Jim Gray Technical Report 85.7 June 1985 PN87614

2

3 About Why Computers Stop and What Can Be Done Do It? Jim Gray June 1985 Tandem Technical report 85.7

4

5 Tandem TR 85.7 Do Computers and What Can Be Done About It? Why Stop Gray Jim June, 1985 1985 Revised November, ABSTRACT of An failure statistics of a commercially available analysis the system shows that fault-tolerant and software are the administration major to failure. Various approachs to software fault- contributors process-pairs, notably transactions are tolerance discussed then out reliable and is pointed storage. that faults in production It software are often soft (transient) and that a transaction mechanism fault-tolerant combined with provides persistent process-pairs execution -- the key to software fault-tolerance. DISCLAIMER paper is not an "official" Tandem statement on fault-tolerance. This Rather, it expresses the author's research on the topic. in An of this paper appeared version the proceedings of the early German Association for Computing Machinery Conference on Office Automation, Erlangen, Oct. 2-4, 1985.

6

7 TABLE OF CONTENTS ion 1 Introduct by Redundancy ...•.•...•..•..•• 3 Hardware Modular Availability Failures Analysis a Fault-tolerant System •.••...•••.•. 7 of of of the Analysis Implications MTBF ...•••.•.•••••...•... 12 of Fault-tolerant 15 Execution Modularity 16 Processes and Messages Software Through Fault Through Fail-Stop Software Modules 16 Containment Software Faults Are Soft, the Bohrbug-Heisenbug Hypothesis.17 Process-pairs For Execution 20 Fault-tolerant for Integrity ..•... 24 Transactions Data for Simple Transactions Execution 25 Fault-tolerant Fault-tolerant Communication ...•..•...•.•.•.•.•.•... 27 Fault-tolerant Storage 29 Summary 31 Ac knoW' 1edgme n t s ••••••••••.•.•...•..•.•...••..•••.•.•.•••.. " .• 33 Ref ere nee s. • . • • . • • . • . • . • . • • . • • . . . . . . • . • . • . • • . . . • . . . • . • • • . . . ,I • • 3 4

8

9 Introduction Computer such as patient applications monitoring, process control, transaction online processing, electronic mail require high and availability. anatomy typical large system a is interesting: The of failure is usually the case, that an operations Assuming, software fault as or the Figure 1 shows a time outage, of the outage. It takes caused line few minutes for someone to realize a there is a problem and that that a is the only obvious solution. It takes the operator about 5 restart later to system state for the analysis. Then the minutes snapshot can begin. For a large system, restart operating system takes a the the to get started. Then minutes database and data communications few systems begin their restart. The database restart completes ~ithin a to minutes may take an hour it restart a large terminal but few a Once is up, the users take network while to refocus on network. the tasks they had been performing. the restart, much work has been After saved the system to perform -- for the transient load presented at so restart is the peak load. This affects system sizing. Conventional well-managed processing systems fa.il about transaction every ninety weeks [Mourad], [Burman]. The once minute outage two above translates to 99.6% availability for such systems. outlined 99.6% availability "sounds" wonderful, but hospital patients, steel 1.5 mills, electronic mail users do not share this view -- a and hour outage every ten days is unacceptable. Especially since outages 1

10 usually corne of peak demand [Mourad]. at times require which virtually never fail -- parts applications systems These system may fail but the rest of the system must tolerate the of and delivering service. This continue reports on the failures paper and success of such a system -- the Tandem NonStop system. structure than has in years -- more measured two orders of magnitude MTBF It than conventional designs. better I Minutes I I + occurs Problem 0 I needs problem decides dump/resart Operator 3 I + dump completes Operator I 8 12 I start DB/DC restart OS restart complete, 17 + complete restart DB handling) tape no (assume I I + 30 Network continuing restart I I 40 + Network restart continuing I I 50 + Network continuing I restart I + restart continuing 60 Network I I + DC 70 restart begin user restart complete, I I + 80 I I + 90 complete User restart I I A 1. Figure line how a showing simple fault mushrooms time 1 into system minute outage. 90 a I I 2

11 Hardware Availability Modular Redundancy ~ and availability different: Availability is doing the Reliability are within right response time. Reliability is not specified the thing thing. wrong the doing is proportional to Expected Mean Time Between Failures reliability the Availability has some Mean Time failure Repair (MTTR). A (MTBF). To be expressed as a probability that the system can be available: will MTBF = Availability MTBF+MTTR available systems, parts may In some while others are distributed be In these situations, one weights the availability of all not. the devices if 90% of the database is availab]~ to 90% (e.g. the of terminals, then the system is .9x.9 = 81% available.) The key to providing high availability is to modularize the system so that modules the unit of failure and replacement. Spare modules are configured give the appearance of instantaneous repalr if are to is tiny, then the failure MTTR "seen" as a delay rather than a is failure. For example, geographically distibuted terminal networks frequently have one terminal in a hundred broken. Hence, the system is limited to availability (because terminal availability is 99%). 99% largely terminal and are Since line failures communications independent, one can provide very good "site" availability by placing In essence, at communications lines with each site. two two terminals the second ATM provides instantaneous repair and hence very high 3

12 availability. Moreover, they transaction throughput increase at heavy locations approach is taken by several high This traffic. with Machine (ATM) networks. Teller Automated availability demonstrates the concept: modularity and redundancy This example module allows the system to fail without affecting the one of because of as a whole system redundancy leads to availability the MTTR. This combination of modularity small redundancy is the key and to continuous service even if some components fail. providing the Neumann first to analytically study the use of redundancy Von was construct available (highly reliable) systems from unreliable to [von Neumann]. components his a redundancy 20,00J was In model, to get needed system MTBF of 100 years. Certainly, his components a were less reliable than transistors, he was thinking of human neurons or vacuum tubes. it is obvious why von Neumann's machines Still, not a factor of 20,000 while current electronic required redundancy use a factor of 2 to achieve very systems availability. The key high difference that von Neumann's model lacked modularity, a failure in is bundle of wires anywhere, implied a total system failure. any VonNeumann's model had redundancy without modularity. In co~trast, computer are constructed in a modular fashion a modern systems failure a module only affects that module. within In addition each module is constructed to be fail-fast -- the module either functions properly stops [Schlichting]. Combining redundancy with modularity or allows one to use a redundancy of two rather than 20,000. Quite an 4

13 economy! To an example, modern discs are rated for an MTBF abovE~ 10,000 gIve -- a fault once a year. Many systems duplex pairs of such hard hours of the information on both same them, and using storing discs, paths and controllers for the discs. Postulating a very independent MTTR of 24 hours and assuming independent failure modes, the leisurely hour of pair (the mean time to MTBF double failure within a 24 this a window) over 1000 years. In is failures are not quite practice, independent, but the MTTR is less than 24 hours and so one observes such high availability. can be this fault-tolerant hardware Generalizing discussion, constructed as follows: Hierarchically decompose the system * modules. into * the modules to have Design in excess of a year. MTBF * Make each module fail-fast -- either it does the right thing or stops. * module faults promptly Detect having the module signal by failure or by requiring it to periodically send an I AM ALIVE message or reset a watchdog timer. 5

14 * Configure modules which can pick up the load of failed extra Takeover of including the detection modules. the module time, apparent should failure, This gives an be modulE! MTBF seconds. measured in millennia. The resulting systems have hardware MTBF measured in decades or centuries. gives fault-tolerant hardware. Unfortunately, it says nothing This about tolerating the major sources of failure: software and ideas operations. we show how these same Later can be applied t~ gain software fault-tolerance.

15 An Analysis Failures of a Fault-Tolerant System of have been studies of why computer systems fail. To my There many have on a commercial fault-tolerant system. none focused knowledge, for fault-tolerant systems are quite The bit different statistics a those conventional mainframes [Mourad]. for the MTBF of from Briefly, software and operations is more than 500 times higher than hardware, reported those conventional computing systems fault-tolerance for the On hand, the ratios among other sources of failure are works. the the same as those for conventional systems. Administration and about dominate, software and environment are minor contribu~:ors to hardware system outages. total Computers Inc. makes Tandem line of fault-tolerant systems a [Bartlett] [Borr 81, 84]. I analyzed the causes of system failures reported to over a seven month period. The sample set covered Tandem 10,000,000 than and represents over systems system hours or more 2000 1300 system years. Based on interviews with a sample of over customers, I these reports cover about 50% of all total system believe There is under-reporting of failures caused by customers or failures. by environment. Almost all failures caused by the vendor are reported. During measured period, 166 failures were reported including one the a and flood. Overall, this gives fire system MTBF of 1.8 years one reported and 3.8 years MTBF if the systematic under-reporting is taken above into This is still well consideration. the 1 week MTBF typical 7

16 of conventional designs. interviewing four customers who keep careful books on system By large their got accurate picture of more operation. They I a outages, year MTBF (consistent with 7.8 years with 50% reporting). averaged a 4 their In statistics had under-reporting in the addition, failure areas environment and operations. Rather than skew the expected of data multiplying all MTBF numbers by .5, I will present the by analysis though the reports were accurate. as mortality" one the failures were "infant of failures -- a About third having a recurring problem. All product fault clusters are these related a new software or to product still having the bugs hardware shaken out. If one subtracts out systems having "infant" failures or non-duplexed-disc failures, the remaining failures, 107 in all, then an analysis (see table 1). make interesting the system MTBF rises from 7.8 years to over 11 years. First, administration, System includes operator actions, system which and configuration, maintenance was the main source of failures system -- 42%. Software and hardware maintenance was the largest category. High availability allow users to add software and hardware and systems system to maintenance while the preventative is operating. By and do large, online maintenance works VERY well. It extends system availability two orders of magnitude. But occasionally, once every by 52 years by my figures, something goes wrong. This number is s8mewhat 8

17 speculative if system failed while it was undergoing online a being or or software was hardware added, I ascribed maintenance while failure to maintenance. Sometimes the was clear that the it maintenance typed the wrong person or unplugged the wrong command module, thereby introducing a double failure. Usually, the evidence was The notion that mere humans make a single circumstantial. critical mistake every few decades amazed me -- clearly these people design are careful and the very tolerates some human faults. 9

18 ----------------------------------------------------------------------- I Failure System Mode in Probability years MTBF I I Administration 31 42% years I Maintenance: 25% I Operations ( ? ) 9% I Configuration 8% I I Software 25% years 50 I Applicat ion ( ? ) 4% I Vendor 21% I I Hardware 18% years 73 I Central 1% I Disc 7% I Tape 2% I Comm Controllers 6% I Power supply 2% I I Environment 14% years 87 I ( ? ) Power 9% I Communications I 3% Facilities 2% I I Unknown 3% I I Total years 11 103% I I Tandem to 1. Contributors to reported Table System outages the I infant text, vendor. As explained in the are (30%) failures I subtracted ~his from set. are by"?" marked Items sample I under-reported because the customer does not probably ! complain to the vendor about them. Power generally outages I tolerated by NonStop are hours and 4 below system hence the I total under-reporting. 50% estimate We under-reported. are I I human operators second source of a failures. I suspect System were under-reporting these failures. If a system fails because of the of operator, he is less likely to tell us about it. Even so, operators reported failures. System configuration, getting the right several collection of software, microcode, and hardware, is a third major headache for reliable system administration. 10

19 Software faults a major source of system outages -- 25% in all. were supplies about million lines of code to the customer. Tandem 4 efforts, are present in this software. In careful Despite bugs write quite a bit of software. Application addition, customers are probably under-reported here. I guess that only software faults application are that is true, If programs contribute 12% 30% reported. outages and software rises to 30% to the total. of Next environmental failures. Total communications failures come happened all the local exchange) to three times, in (losing lines there was a fire and a flood. No outages caused by cooling addition, air or were reported. Power outages are a major source conditioning not of customers who do among have emergency backup power failures (North American urban power typically has a 2 month MTBF). Tandem systems tolerate 4 hours of lost power without losing any data or over so state is almost zero), MTTR customers do not communications (the report minor power outages (less than 1 hour) to generally us. Given power outages are under-reported, the smallest co~tributor that system outages was hardware, mostly discs and communications to controllers. The measured set included over 20,000 discs over 100,000,000 disc We saw 19 duplexed disc failures, but if one hours. then out infant mortality failures the there were only 7 subtracts duplexed disc failures. In either case, one gets an MTBF in excess of pair 5 for the duplexed hours and their controllers. This million approximates the 1000 year MTBF calculated in the earlier section. 11

20 Implications of Analysis of MTBF the implications of statistics are clear: the key to high- The these tolerating and software faults. is operations availability systems are measured to have a 73 Commercial fault-tolerant year MTBF 1). hardware (table there believe 75% reporting of outages I was by hardware. Calculating from caused MTBF, there were about device 50,000 faults in the sample set. Less than one in a thousand hardware Hardware in double failure or resulted interruption of service. a an works! fault-tolerance the future, hardware will be In more reliable due to better even design, increased levels of integration, and reduced numbers of connectors. By trend for software and system administration IS not the contrast, study, this In are getting more complex. Systems positive. reported 41 critical mistakes in over 1300 years of administrators operators of an operations MTBF gives 31 years! This operation. These made many more mistakes, but most were not fatal. certainly are clearly careful and use good practices. administrators very The priority for improving system availability is to reduce top administrative mistakes by making self-configured systems with minimal Interfaces that ask the maintenance minimal operator interaction. and operator for information or ask him to perform some function must be 12

21 simple, consistent operator fault-tolerant. and same discussion to system maintenance. Maintenance The applies be Installation of new equipment must must interfaces simplified. procedures and the maintenance interfaces must have fault-tolerant eliminated. To give a concrete example, Tandem's newest simplified or no special customer engineering training (installa~ion is discs have and they have no scheduled maintenance. "obvious") A secondary of the statistics is actually a contradiction: implication New changing systems have higher failure rates. Infant * and contributed one third of all outages. Maintenance caused products third of the remaining outages. A way to improve availability one software, is proven hardware and install and then leave it to alone. As the adage says, "If it's not broken, don't fix it". * On other hand, a Tandem study found that a high percentage of the were had by "known" hardware or software bugs, which outages caused available, not the fixes were fixes yet installed in the but system. This suggests that one should install software and failing hardware fixes as soon as possible. There is contradiction here: never change it and change it ASAP! By a too the of change is risk great. Most installations are consensus, slow to install changes, they rely on fault-tolerance to protect them all, until next major release. After the it worked yesterday, so it 13

22 will probably work tomorrow. must separate one and hardware maintenance. Software Here software outnumber hardware fixes by several orders of magnitude. I fixes this the difference in strategy between hardware and causes believe maintenance. One cannot forego hardware preventative software studies -- our it show that maintenance may be good in the short term term. but disasterous in the long is One must install hardware it fixes in a timely fashion. If possible, preventative maintenance the should scheduled to minimize be impact of a possible mistake. Software appears to be different. The same study recommends causing IS bug the if a outages. only installing software fix Otherwise, study recommends waiting for a major software release, the and carefully testing it In the target environment prior to installation. Adams comes to similar conclusions [Adams], he points that out the for most bugs, chance of "rediscovery" is very slim indeed. goal, statistics also suggest that if availability is a major The then avoid products which are immature and still suffering infant the on technolo~IY, of edge leading but mortality. It is fine to be of edge bleeding the avoid technology. implication The last is of the statistics that software fault- the of topic the important. is tolerance Software fault-tolerance is the rest of paper.

23 Fault-tolerant Execution on the above, software accounts for over 25% of system Based analysis is 50 good -- a MTBF of This years! The volume of outages. quite at is million lines and growing 4 about 20% per Tandem's software Work continues on year. coding practices and code testing improving but is little hope of getting ALL the bugs out of all the there Conservatively, guess software. one bug per thousand lines of code I after a program goes through design reviews, quality remains assurance, and beta testing. That suggests the system has several thousand bugs. somehow, these bugs cause very few system failures But the tolerates software faults. because system keys to this software fault-tolerance are: The * Software through processes and messages. modularity * containment through fail-fast software modules. Fault * Process-pairs to tolerate hardware and transient software faults. * to provide data and message integri~y. mechanism Transaction ease to combined with process-pairs mechanism Transaction * and tolerate software faults. exception handling expands on each of these points. section This 15

24 Software modularity processes and messages through with hardware, key to software fault-tolerance 1S to As the large into modules, each module being decompose systems hierarchically of service and a unit a failure. A failure of a module does unit of propagate beyond the module. not is considerable controversy There how to modularize software. about Starting Burroughs' Esbol and continuing through languages like with and compiler writers have assumed perfect hardware and Mesa Ada, that they can provide contended fault isolation through static good compile-time type checking. In contrast, operating systems designers have advocated run-time checking combined with the process as the unit of protection failure. and compiler checking exception handling provided by Although and are languages assets, history seems to have favored programming real run-time checks the the process approach to fault containment. It plus has virtue of simplicity if the process or its prccessor a misbehaves, stop it. The process provides a clean unit of modularity, service, fault and failure. containment, containment fail-fast software modules. Fault through process approach The fault isolation advocates that the process to software module be fail-fast, it should either function correctly or signal it detect the fault, should failure and stop operating. 16

25 Processes are fail-fast by defensive programming. They check all made inputs, intermediate outputs and data structures as a their results, course. a any error is detected, they signal of failure and matter If has In of [Cristian], fail-fast software terminology small stop. the detection latency. fault process achieves fault containment by sharing no state with other The rather, processes; only contact with other processes is via its carried by kernel message system. messages a faults soft the Bohrbug/Heisenbug hypothesis Software are developing the next step Before fault-tolerance, process-pairs, we in need have a software failure model. to is well known that most It hardware faults are soft that is, most hardware faults are transient. Memory correction and checksums plus retransmission error communication standard ways of dealing with transient hardware for are These techniques are faults. estimated to boost hardware variously MTBF by a factor of 5 to 100. I conjecture that there IS a similar phenomenon in software -- most production software are soft. If the program state is faults and the failed operation retried, the operation will reinitialized usually not fail the second time. through If consider an industrial software system which has gone you structured design, design reviews, quality assurance, alpha test, beta 17

26 test, and or years of production, then most of the "hard" months bugs, ones always fail on retry, are gone. The residual software that rare typically related to strange hardware conditions are cases, bugs transient device fault), limit conditions (out (rare storage, or of overflow, interrupt, etc,), or race conditions lost counter request a semaphore). to (forgetting these cases, resetting the program to a quiescent state In and reexecuting will quite likely work, because now the environment is it different. all, it worked a minute ago! slightly After assertion that most production software bugs are soft The to that Heisenbugs when you look at them is well known go away systems Bohrbugs, like the programmers. atom, are solid, easily Bohr detected by standard techniques, and hence boring. But Heisenbugs may elude a for years of execution. Indeed, the bugcatcher may bugcatcher the just enough to make the Heisenbug disappear. perturb situation is analogous to the Heisenberg Uncertainty Principle in Physics. This by I to quantify the chances of tolerating a Heisenbug tried have reexecution. This is difficult. A poll yields nothing quantit.ative. The one experiment I did went as follows: The spooler error log of several dozen was examined. The spooler is constructed as a systems of fail-fast processes. When one of the processes detects collection a fault, it stops and lets its brother continue the operation. The brother brother software retry. If the a also fails, then the does bug is a Bohrbug rather than a Heisenbug. In the measured period, one 18

27 were out software faults was a Bohrbug, the remainder 132 of Heisenbugs. [Mourad]. IS reported in study In MVS/XA functional A related routines try to recover from software and hardware faults. If recovery a software is recoverable, it is a Heisenbug. In that study, fault In 90% the software faults of system software had functional about recovery routines (FRRs). Those routines had a 76% success rate in system continuing That is, MVS FRRs extend the execution. system software MTBF by a factor of 4. It would be nice to quantify this phenomenon further. As it is, exploit systems know from experience that they can designers the Heisenbug hypothesis to improve software fault-tolerance. 19

28 Process-pairs for execution fault-tolerant might think fail-fast modules would produce a reliable but One that But, -- stopping all the time. are as with system modules unavailable configuring extra software modules fault-tolerant a hardware, gives of in case a process milliseconds due to hardware failure MTTR fails a software Heisenbug. If modules have a or of a year, then dual MTBF processes very acceptable MTBF for the pair. Process triples do give the improve other parts of because system (e.g., operators) not MTBF orders of magnitude worse MTBF. have in practice fault-tolerant So, processes generically called process-pairs. There are several are approaches to designing process-pairs: Lockstep: In this design, the primary and backup processes synchronously execute same instruction stream on independent the [Kim]. If of the processors fails, the other processors one the computation. This approach gives good simply continues to hardware failures but tolerance no tolerance of gives Heisenbugs. streams will execute any programming bug ln Both and will fail in exactly the same way. lockstep State Checkpointing: In this scheme, communication sessions are used to a requestor to a process-pair. The primary connect computation process pair does the a and sends state changes in and reply messages to its backup prior each major event. If the process primary stops, the session switches to the backup process which continues the conversation with the requestor. Session 20

29 sequence numbers used to detect duplicate and lost messages, are to resend reply if a duplicate request arrives and the shows [Bartlett]. process-pairs that Experience checkpointing (see fault-tolerance 1), but that excellent give table is difficult. The programming is away from checkpoints trend approach towards the Delta or and approaches this Persistent below. described Checkpointing: This scheme is much like state check- Automatic except points the kernel automatically manages the check- that relieving programmer of this chore. As described pointing, the [Borg], all messages to and from a process are saved by the in kernel message the backup process. At takeover, these for roll are to the backup to messages it forward to the replayed primary process' state. When substantial computation or storage is required the backup, the primary state is copied to the in can so message log and replay the be discarded. This backup that seems to send more data than the state scheme checkpointing scheme hence seems to have high execution cost. and Checkpointing: This is an evolution of state checkpointing. Delta Logical rather than physical updates are sent to the backup [Borr 84]. Adoption this scheme by Tandem cut message traffic 1n of factor half bytes by a message of 3 overall [EnrightJ. and Deltas have the virtue of performance as well as making the coupling the primary and backup state logical rather than between physical. This means that a bug in the primary process is less 21

30 likely to the backup's state. corrupt In persistent if the primary process Persistence: process-pairs, backup amnesia up in the null state with the about fails, wakes primary was the time of the at failure. Only the what happening and closing of sessions is checkpointed to the backup. opening are These stable processes by [Lampson]. Persistent called have are to program and simplest low overhead. The processes the problem with persistent processes is only they do not hide that failures! the primary process If the database or devices fails, it manages are left in a mess and the requestor notices that the backup process amnesia. We need a simple way to has these As to have a common state. resynchronize processes below, such provide explained a resynchronization transactions mechanism. Summarizing pros and cons of these approaches: the * Lockstep processes don't tolerate Heisenbugs. * checkpoints give fault-tolerance but are hard to program. State * Automatic checkpoints seem to be inefficient -- they send a lot of data the backup. to * Delta checkpoints have good performance but are hard to program. 22

31 * processes lose state in case of failure. Persistent We argue next that transactions combined with persistent processes are simple to program and give excellent fault-tolerance. 23

32 Transactions for integrity data transaction is group of operations, be they database updates, A a external which of the computer, or form a consistent messages, actions of the transformation state. have should ACID property [Haeder]: Transactions the Either all or none of the actions of the transaction Atomicity: should "happen". it commits or aborts. Either Each should see a correct picture of the Consistency: transaction even if concurrent transactions are updating the state. state, Integrity: The should be a correct state transformation. transaction Durability: a transaction commits, all its effects must be Once preserved, even if there is a failure. The programmer's to transactions is quite simple: he starts interface and transaction the BeginTransaction verb, asserting ends it by a by the EndTransaction or AbortTransaction asserting The system verb. does rest. the classical implementation The transactions uses locks to guarantee of consistency and a log or audit trail to insure atomicity and a durability. how this concept generalizes to shows distJ~ibuted Borr fault-tolerant system [Borr 81, 84]. Transactions relieve the application programmer of handling many error complicated, conditions. things get too If the programmer (or the 24

33 system) calls which cleans up the state by resetting AbortTransaction back to beginning of the transaction. everything the simple Transactions fault-tolerant for execution execution reliable data availability (recall provide Transactions and not doing the wrong reliability availability means doing means thing, right and on time). Transactions thing not directly provide the do system availability. If hardware fails high if there is a software or fault, transaction processing systems stop and go through a most restart introduction. the 90 minute outage described in the system -- fault- is to combine It and transactions to get possible process-pairs tolerant execution and hence avoid most such outages. As argued above, process-pairs tolerate hardware faults and software Heisenbugs. But kinds of process-pairs are difficult to most The process-pairs, persistent process-pairs, have implement. "easy" when the primary fails and the backup takes over. Persistent amnesia unkno~n leave network and the database in an the state process-pairs when the backup takes over. The key observation is that the transaction mechanism knows how to UNDO all changes of incomplete transactions. So we can simply the all uncommitted transactions associated with a failed persistent abort process and then restart these transactions from their input messages. system This up the database and cleans states, resetting them to the 25

34 point at the transaction began. which persistent process-pairs transactions gIve a simple execution So, plus are continues even if there which hardware faults or model execution This is the key Heisenbugs. the Encompass data management system's to fault-tolerance 81]. The programmer writes fail-fast modules in [Borr languages (Cobol, Pascal, Fortran) and the transaction conventional mechanism plus persistent process-pairs makes his program robust. Unfortunately, people the operating system kernel, the implementing to mechanism and some transaction drivers still have itself device write "conventional" process-pairs, but application programmers do not. One reason Tandem has integrated the transaction mechanism with available the system is to make the transaction mechanism operating to as much software as possible [Borr 81]. 26

35 Fault-tolerant Communication lines are most unreliable part of a distributed Communications the Partly they are so numerous and partly system. because computer have poor MTBF. The operations aspects because managing them, they of failures tracking the repair process and a real diagnosing are [Gray]. headache the hardware level, fault-tolerant At is obtained by communication having data paths with independent failure modes. multiple session the the concept of level, is introduced. A At software has simple semantics: a sequence of session is sent via the messages session. the communication path fails, If alternate path is tried. an If all paths are lost, the session endpoints are told of the failure. Timeout and sequence numbers are used to detect lost or message messages. this is transparent above the session layer. duplicate All are the thing that make Sessions work: the session process-pairs switches the backup of the to when the primary process process-pair fails [Bartlett]. Session sequence numbers (called SyncIDs by and Bartlett) communication state between the sender the resynchronize receiver and make requests/replies idempotent. Transactions interact with sessions as follows: if a transaction logically aborts, sequence number is session reset to the sequence the number at the beginning of the transaction and all intervening 27

36 the messages canceled. If a transaction commits, the messages on are session will be reliably delivered EXACTLY once [Spector]. 28

37 Fault-tolerant Storage basic form fault-tolerant storage is replication of a file on The of with failure characteristics -- for example two media independent two spindles or, better yet, a disc and a tape. different one disc If has MTBF of a year then two an will have a millennia MTBF file files three copies will have about and same MTBF -- as the Tandem system the failure show, other factors will dominate at that point. statistics argument. replication exception to this an If one can Remote is it, storing a replica In a remote afford gives good location improvements availability. Remote replicas will have different to different hardware, and different environment. Only administrators, the software will be the same. Based on the analysis in Table 1, this will protect 75% of the failures (all the non-software against Since it gives excellent protection against failures). also remote guards against most software faults. Heisenbugs, replication are many ways to There replicate data, one can have exact remotely replicas, have the updates to the can done as soon as possible replica or even have periodic updates. [Gray] describes representative systems which different approaches to long-haul replication. took provide properties ACID Transactions for storage Atomicity, the Integrity and Durability [Haeder]. The transaction Consistency, journal plus an archive copy of the data provide a replica of the data If on with independent failure modes. media the primary copy fails, a 29

38 new copy be reconstructed from the archive copy by applying all can This committed archive copy was made. the is Durability updates since data. of addition, transactions coordinate a set of updates to the data, In correctly that or none of them apply. This allows one to all assuring update complex data structures without concern for failures. The transaction mechanism will undo the changes if something goes 'Nrong. This is Atomicity. third for fault-tolerant storage is partitioning the data A technique discs among nodes and hence limiting the scope of a failure. If or the data is geographically partitioned, local users can access local remote data the communication net or if nodes are down. Again, even [Gray] gives examples of systems which partition data for better availability. 30

39 Summary computer fail for a variety of reasons. Large computer systems of conventional fail once every few weeks due to design systems mistakes, or hardware. Large fault-tolerant operations software, are measured to have an MTBF at orders of magnitude higher-- systems years rather weeks. than techniques fault-tolerant hardware are well documented. They The for quite successful. Even in are high availability system, hardware is a a minor contributor to system outages. By applying the concepts of fault-tolerant hardware to software construction, software can be raised by several orders of MTBF These include: modularity, defensive magnitude. concepts process-pairs, and tolerating soft programming, -- Heisenbugs. faults Transactions plus persistent process-pairs give fault--tolerant execution. Transactions plus resumable communications sessions give fault-tolerant communications. plus data replication Transactions give storage. In addition, transaction atomicity fault-tolerant coordinates the changes of the database, the communications net, and construction the processes. This allows easy executing of high availability software. 31

40 Dealing with configuration, operations, and maintenance remains system an problem. Administration and maintenance people are doing unsolved a much better job than we have reason to expect. We can't hope' for reduce better The only hope is to simplify and people. human intervention in these aspects of the system. 32

41 Acknowledgments The people helped in the analysis of the Tandem system following statistics: Bradley, Jim Enright, Cathy Fitzgerald, failure Robert Hamlin, Pat Helland, Dean Judd, Steve Logsdon, Franco Putzolu, Sheryl Niehaus, Harald Sammer, and Duane Wolfe. In present:ng the Carl analysis, had to make several outrageous assumptions and "integrate" I contradictory stories from different observers of the same events. For that, must take full responsibility. Robert Bradley, Gary I Gilbert, Horst, Dave Kinkade, Carl Niehaus, Carol Minor, Franco Bob Putzolu, and Bob White made several comments that clarified the presentation. thanks are due to Joel Bartlett and especially Special Flaviu Cristian who tried very hard to make me be more accurate and precise. 33

42 References [Adams] Adams, E., Preventative Service of Software "Optimizing Products" , Res. and Vol. 28, No.1, Jan. 1984. IBM J. Dev., NonStop Kernel," Proceedings of the Eighth Bartlett, [Bartlett] J.,"A Operating System Principles, pp. 22-29, Dec. 1981. Symposium on Borg, A., J., Glazer, S., "A Message System [Borg] Baumbach, ACM 17, Review, Vol. Fault-tolerance", No.5, 1984. Supporting OS 81] A., "Transaction Monitoring in ENCOMPASS," Proc. 7Th [Borr Borr, Tandem VLDB, 81.2. TR Computers 1981. Also September A Database: [Borr 84] Borr, A., "Robustness to Crash in a Distributed VLDB, Shared-Memory Multi-processor Approach," Proc. 9th Non 84.2. TR Sept. 1984. Also Tandem Computers Online Production "Aspects [Burman] Burman, M. Volume of a High High Performance System", Proc. Workshop on Banking Int. Asilomar, Sept. 1985. Transaction Systems, Cristian, F., "Exception [Cristian] and Software Fault Handling Tolerance", IEEE Trans. on Computers, Vol. c-31, No.6, 1982. [Enright] Enright, J. "DP2 Performance Analysis", Tandem memo, 1985. [Gray] Gray, Anderton, M., "Distributed Database Systems Four J., also Studies", in IEEE TODS, appear Tandem TR 85.5. to Case Transaction-Or~ented T., Reuter, "Principals of [Haeder] Haeder, A., Vol. 15, No.4. Dec. Computing ACM Recovery", Database Surveys, 1983. Kim, W., "Highly Available Systems [Kim] Database Applications", for ACM Surveys, Vol. 16, No.1, March 1984 Computing [Lampson] B.W. ed, Lecture Notes in Computer Science Vol. Lampson, 106, Chapter 11, Springer Verlag, 1982. IBM/XA [Mourad] Andrews, D., "The Reliability and the S. Mourad, of :<'ault- of 15th Annual Int. Sym. on Operating System", Digest 1985. IEEE Computer Society Press. June Computing, Tolerant "Fail-Stop Schlichting, R.D., Schneider, F.B., [Schlichting] Computing an to Designing Fault-Tolerant Approach Processors, TOCS, 1, No.3, Aug. 1983. ACM Systems", Vol. [Spector] "Multiprocessing Architectures for Local Computer Networks", PhD Thesis, STAN-CS-81-874, Stanford 1981. 34

43 [von von Neumann, J. "Probabilistic Logics and the Synthesis Neumann] of Reliable Organisms From Unreliable Components", Automata Studies, Princeton University Press, 1956. 35

44

45 Distributed by '~TANDEMCOMPUTERS Corporate Information Center 19333 Valleo Parkway MS3-07 CA 95014-2599 Cupertino,

46

Related documents

DoD7045.7H

DoD7045.7H

DoD 7045.7-H EPARTMENT OF D EFENSE D F UTURE Y EARS D EFENSE P ROGRAM (FYDP) S TRUCTURE Codes and Definitions for All DoD Components Office of the Director, Program Analysis and Evaluation A pril 2004

More info »
RIE Tenant List By Docket Number

RIE Tenant List By Docket Number

SCRIE TENANTS LIST ~ By Docket Number ~ Borough of Bronx SCRIE in the last year; it includes tenants that have a lease expiration date equal or who have received • This report displays information on ...

More info »
CalCOFI Atlas 33

CalCOFI Atlas 33

THE EARLY STAGES IN OF THE FISHES CALIFORNIA CURRENT REGION CALIFORNIA FISHERIES COOPERATIVE OCEANIC INVESTIGATIONS ATLAS NO. 33 BY THE SPONSORED STATES OF COMMERCE DEPARTMENT UNITED OCEANIC AND ATMOS...

More info »
435 441 458 467r e

435 441 458 467r e

WT/DS435/R, WT/DS441/R WT/DS458/R, WT/DS467/R 28 June 2018 Page: (18 - 1/884 4061 ) Original: English AUSTRALIA CERTAIN MEASURES CON CERNING TRADEMARKS, – PACKAGING IONS AND OTHER PLAIN GEOGRAPHICAL I...

More info »
JO 7400.11C   Airspace Designations and Reporting Points

JO 7400.11C Airspace Designations and Reporting Points

U.S. DEPARTMENT OF TRANSPORTATION ORDER FEDERAL AVIATION ADMINISTRATION 7400.11C JO Air Traffic Organization Policy August 13, 2018 SUBJ: Airspace Designations and Reporting Points . This O rder, publ...

More info »
CRPT 116hrpt9 u2

CRPT 116hrpt9 u2

U:\2019CONF\HJRes31Front.xml APPRO. SEN. [COMMITTEE PRINT] REPORT { } CONGRESS 116TH 1st HOUSE OF REPRESENTATIVES Session 116- FURTHER APPROPRIATIONS FOR MAKING CONTINUING OF HOMELAND SECURITY FOR THE...

More info »
Current APD Policy Manual 2017 1.5 issued 7 20 2017

Current APD Policy Manual 2017 1.5 issued 7 20 2017

APD Issued 2017-1.5 Manual Policy 7/20/2017 Austin Police Department Policy Manual CHIEF'S MESSAGE I am proud to present the newest edition of the Austin Police Department Policy Manual. The Policy Ma...

More info »
doj final opinion

doj final opinion

UNITED STAT ES DIS TRICT COURT IC F OR THE D ISTR T OF CO LU M BIA UNITED STAT F AMERICA, : ES O : : la in t if f, P 99 No. on cti l A vi Ci : 96 (GK) -24 : and : TOBACCO-F UND, : REE KIDS ACTION F : ...

More info »
CityNT2019TentRoll 1

CityNT2019TentRoll 1

STATE OF NEW YORK 2 0 1 9 T E N T A T I V E A S S E S S M E N T R O L L PAGE 1 VALUATION DATE-JUL 01, 2018 COUNTY - Niagara T A X A B L E SECTION OF THE ROLL - 1 CITY - North Tonawanda TAX MAP NUMBER ...

More info »
MPI: A Message Passing Interface Standard

MPI: A Message Passing Interface Standard

MPI : A Message-Passing Interface Standard Version 3.0 Message Passing Interface Forum September 21, 2012

More info »
untitled

untitled

G:\P\16\HR1\INTRO.XML ... (Original Signature of Member) TH 116 CONGRESS 1 ST S ESSION H. R. 1 To expand Americans’ access to the ballot box, reduce the influence of big money in politics, and strengt...

More info »
Fourth National Report on Human Exposure to Environmental Chemicals Update

Fourth National Report on Human Exposure to Environmental Chemicals Update

201 8 Fourth National Report on Human Exposure to Environmental Chemicals U pdated Tables, March 2018 , Volume One

More info »
The Health Consequences of Smoking   50 Years of Progress: A Report of the Surgeon General

The Health Consequences of Smoking 50 Years of Progress: A Report of the Surgeon General

The Health Consequences of Smoking—50 Years of Progress A Report of the Surgeon General U.S. Department of Health and Human Services

More info »
The 9/11 Commission Report

The 9/11 Commission Report

Final FM.1pp 7/17/04 5:25 PM Page i THE 9/11 COMMISSION REPORT

More info »
June2018CUR

June2018CUR

CHANCELLOR'S UNIVERSITY REPORT JUNE 25 2018

More info »
Department of Defense   Law of War Manual (June 2015)

Department of Defense Law of War Manual (June 2015)

D E A R T M E N T O F D E F E N S E P N A L O F W A R M A W U A L J U N E 2 0 1 5 O F F I C E O F G E N ER A L C O U N S E L D P A R T M E N T E O F D E F E N S E

More info »
MDS 3.0 RAI Manual v1.16 October 2018

MDS 3.0 RAI Manual v1.16 October 2018

Centers for Medicare & Medicaid Services Long-Term Care Facility Resident Assessment Instrument 3.0 User’s Manual Version 1.16 October 2018

More info »
vol9 organic ligands

vol9 organic ligands

C HERMODYNAMICS HEMICAL T OMPOUNDS AND C OMPLEXES OF OF C U, Np, Pu, Am, Tc, Se, Ni and Zr O ELECTED WITH RGANIC L IGANDS S Wolfgang Hummel (Chairman) Laboratory for Waste Management Paul Scherrer Ins...

More info »