Summary of the Ph.D. thesis
István Majzik
electrical engineer
Advisor:
András Pataricza
Ph.D., Associate Professor
Technical University of Budapest
Department of Measurement and Instrument Engineering
Fault tolerance is a key design factor in computer systems running critical applications. The system is fault tolerant, if it can provide correct services according to its specification despite of a limited number of run-time faults until a scheduled maintenance can take place. Fault tolerance techniques can be categorized as fault masking and fault handling. Fault masking means providing the system services in spite of faults using redundant resources. Fault handling is often preferred due its cost effectiveness in comparison with fault masking. Fault handling is based on error detection, followed by error containment, fault treatment (isolation, reconfiguration) and error recovery. The basis of the fault tolerant operation is the efficient, early error detection.
Fault tolerance requires hardware, software, information or time redundancy. The most evident solution is to utilize the natural redundancy of the system to provide the required fault tolerant operation. The advantage of this method is that the reliability is improved by utilizing the already given hardware and software components of the system. However, the method is based on the a priori information on the structure and tasks of the system, and it needs specific analysis and development (redesign) phases for each possible task of the system. The general purpose schemes have the advantage that the redundancy required by the fault tolerant operation is built into the system in the design phase in such a way, that later it can be utilized as part of an automatic technology, without the intervention of the user.
The integration of massive redundancy (like replicated components) into the system is necessary and admissible in mission or life critical applications (aircraft, space, power plants etc.). However, in the majority of applications, which require only common computing resources, neither the application of costly massive redundancy techniques nor the individual design process is reasonable. In this category the application of such strongly limited redundancy techniques are required, that are general purpose, fit to common, commercial system components and development strategies and at the same time improve the reliability of the system significantly.
Concurrent run-time monitors are widely used techniques of the hardware redundancy based error detection in such systems. They monitor the state and operation of the supervised system, comparing the run-time information with the reference one corresponding to the fault-free operation of the system. In case of a difference, an error is signaled. The run-time information is often represented in a compact form, by signatures (symbolic labels).
Fault simulation, fault injection experiments and analysis of error logs of operational systems show that the majority of errors in microprocessor systems is caused by transient faults, especially disturbances in the control flow of the main processor. Thus run-time monitors intended to check the control flow of the microprocessor are especially important. Moreover, also the software based checking techniques, like reasonableness checks, require the assurance that the checking routines are executed and not omitted as a consequence of the erroneous behavior. One type of that monitors are watchdog processors (WP) which concurrently check the control flow, i.e. the order of the instructions executed by the processor, using signatures, compact abstractions of the system state.
Watchdog processors can be divided into two classes: working with assigned signatures or working with derived signatures. In the first case the run-time signatures are transferred to the WP by the main processor itself, in the second case the WP generates them using the observable state parameters of the processor (usually the signals of the system bus). Disturbances in the control flow are detected by the analysis of the run-time sequence of signatures.
If the (simple) architecture of the microprocessor enables to derive the control flow by observing the signals of the system bus, the derived signatures methods proved to be an efficient error detection mechanism. First implementations used a stored signature database, they were able to check only the branch-free blocks of the control flow. Further improvements overcame this restriction using embedded justifying signatures, which required the modification of the program of the checked processor. Only the most recent methods enabled the checking of the entire control flow without the collaboration of the checked processor, by the execution of a complex watchdog program in the WP.
Up-to-date microprocessors with built-in instruction prefetch queues and internal instruction caches can not be checked by assigned signatures methods any more, since the instruction stream executed by the processor is only incompletely observable on the system bus. (There are successful attempts to follow the operation of the instruction prefetch queue, but the problem of instruction caches can be solved only by integrating the watchdog processors onto the same chip as the main processor.)
The above mentioned problems can be solved by inserting the signature transfer instructions into the sequence of instructions executed by the checked processor (assigned signatures methods). This is done automatically by the so called watchdog preprocessors, which usually modify the program source before compilation. Early designs suffered from three essential drawbacks preventing their wide-spread use: (i) the hardware overhead related to the complex structure of the WP, (ii) the large storage required for the reference database or program and (iii) a significant run-time overhead needed for the signature transfer.
Signature assignment was performed in the previous methods without utilizing the boun-ded structure of the program control flow graph (the limitations on the branch instructions) and without providing fast signature evaluation (assigned signatures were defined in the order of the syntactic occurrence in the program source). The utilization of the structure of the control flow graph (CFG) promises both the reduction of the reference database and the simplification of the checker architecture. The basic idea is that in the usual structural programming languages the majority of statements have only a low, limited number of valid successors, consequently the assigned signatures can identify not only the state of the program but also include some information about the possible successor signatures. This redundant structure of signatures exploits the larger word width of the target up-to-date processors (which was previously not utilized by the known methods).
After analyzing the open problems, the aim of the current research was set at the development of a watchdog processor method which enables the reduction of the reference signature database by the assignment of redundant signatures, utilizing the structural properties of the control flow graph.
The reduction of the reference signature database enables the application of the WP in multi-process systems by overcoming the storage requirements of the earlier designs originating in the large size of the reference database of the concurrent processes. The fast and simple checking of signatures raises the idea of using the WP in multiprocessor systems by identifying the processors and processes in the signature. The shared utilization of a single WP by several processors is limited only by the speed of the signature transfer and evaluation.
Due to speed and resource limitations, traditional WPs check the multi-process applications as a set of independent, separate flows of control, completely neglecting the cooperation and synchronization among the processes. In our WP these dependences can also be checked using special signatures. In this higher level of the application software hierarchy, a careful selection of the synchronization primitives to be checked is required.
To extend the checking capabilities of the WP, the research aimed at the examination of the higher level flow of control among processes, the selection and analysis of synchronization primitives and the development of the corresponding checking methods.
The watchdog processor, as a subsystem for concurrent error detection, should be integrated into the error treatment mechanism of the checked system. One of the most widely used techniques to recover from transient errors is to restore a previously stored fault-free state of the system and restart the computation from that state (checkpointing and backward error recovery). This mechanism should be supported by the watchdog processor. The establishment and regular updating of checkpoints, the backward recovery concurrently with the similar operations of the checked system required the development of new methods.
On the basis of the run-time signature sequence received by the WP the trace of statements executed by the processor can be reconstructed, this way debugging of the application processes can be supported. However, the run-time signature sequence is too long to be stored in its full extent. A trade-off between efficiency and cost is to implement a logic analyzer-like circular buffer storing a limited number of signatures in a compressed form. An efficient compression is possible since the majority of signatures originates from repetitive signature subsequences like iteration loops, frequently called subroutines. The basis of compression is on the one hand the redundant structure of the signatures, on the other hand the a priori knowledge of the structure of the control flow graph. Thus, parameters of the compression can be derived in the preprocessing phase. To keep the complexity of the WP as low as possible, the work was focused on the development of a simple real-time compressor which requires very limited hardware resources.
To integrate the watchdog processor into the error treatment mechanism of the checked system, the research aimed at the development of WP methods which support the backward error recovery and the compression and storing of the run-time signature sequence.
Thesis 1 It was shown that the assigned signature based
concurrent checking of the control flow of programs written in
structural programming languages is possible without storing a
high-scale reference database in the WP.
The main drawbacks of the previous assigned signature based methods,
namely the need of a large signature database and the corresponding
complex and slow WP architecture were eliminated. Only a single
signature has to be stored in the WP as a reference to evaluate the
next run-time signature, and the evaluation requires only a
combinational logic. It is the most simple and fast WP architecture
among the ones presented in the literature.
The new method does not increase the number of run-time signatures significantly and exploits the word width of up-to-date processors. The encoding algorithm is simple (linear), fits to high-level programming languages this way it is portable and enables a compiler-based optimization.
The basic ideas of the SEIS were presented in [4], [13], [12] and [2]. The encoding algorithm was introduced in [6] and [10].
Thesis 2 New methods were introduced in order to integrate
the WP into dependable multi-process and multi-processor systems, by
elaborating checks required at process level, the integration of the
WP into the error recovery mechanism of the checked system as well as
support of monitoring and debugging.
The simple architecture of the checker and the fast evaluation of
signatures enables the application of the SEIS WP in multi-process
systems. The coverage of the WP can be increased and the error latency
can be reduced by extending the checks of the WP to higher levels of
the software hierarchy. By the new extensions the WP is able to check
all levels of the software hierarchy of distributed applications.
The WP is integrated into the checked system as a subsystem for concurrent error detection. The actual reference signature, the content of the signature stack is part of the state of a checked process. Accordingly, the WP had to be involved in the methods of error treatment which modify the state of checked processes.
The extensions of the SEIS WP were introduced in [5] and [14]. Synchronization checks were detailed in [8]. [1] is a summary of the checks performed by a SEIS WP in a multiprocessor system. Issues of monitoring and debugging were discussed in [7]. The recoverable WP was introduced in [5], the architectural aspects were detailed in [9].
Papers in conference proceedings
Technical reports
Conferences
Syllabus