3. Introduction to Fault Tolerance 3.1 Basic Concepts ... Tolerância a falha é a habilidade de um sistema de continuar a realizar corretamente as suas tarefas depois da ocorrência de falhas. 3. Introduction to Fault Tolerance 3.1 Basic Concepts ... Confiabilidade de um sistema é uma função do tempo, R(t), definida como sendo a probabilidade do sistema realizar corretamente suas tarefas no intervalo de tempo [t0, t], dado que o sistema estava realizando corretamente no tempo t0. 3. Introduction to Fault Tolerance 3.1 Basic Concepts ... Disponibilidade é uma função do tempo, A(t), definida como sendo a probabilidade de um sistema estar operando corretamente e estar disponível para realizar suas funções em um intervalo de tempo, [t0, t]. 3. Introduction to Fault Tolerance 3.1 Basic Concepts ... A concepção de Sistemas tolerantes a falhas é baseada em duas técnicas distintas: Mascaramento de falhas Detecção, localização e recuperação (via reconfiguração) do sistema para remover o componente defeituoso. 3. Introduction to Fault Tolerance 3.1 Basic Concepts ... Se a opção é pela técnica de reconfiguração, então utiliza-se ... antes ... Técnicas de detecção de falhas Técnicas de localização de falhas depois ... Técnicas de recuperação de falhas 3. Introduction to Fault Tolerance 3.1 Basic Concepts ... Técnicas de recuperação de falhas ... Recuperação para trás (Rollback Recovery) Recuperação para frente (Forward Recovery) 3. Introduction to Fault Tolerance 3.1 Basic Concepts ... Todas as técnicas para concepção de sistemas TF são baseadas em algum tipo e grau de redundância . 3. Introduction to Fault Tolerance 3.1 Basic Concepts ... Redundância é implementada através do uso de hardware, software, informação, ou tempo além do que é necessário para a operação normal do sistema. Importante: resulta em um grande impacto no sistema em termos de desempenho, tamanho, peso, consumo de potência, e confiabilidade. 3. Introduction to Fault Tolerance 3.2 Hardware Redundancy Passive Active Hybrid 3. Introduction to Fault Tolerance 3.2 Hardware Redundancy 1. Passive 1. Based on the concept of fault masking to hide the occurrence of faults and prevent the faults from resulting in errors (developed around the concept of majority voting) Do not provide for faults detection, but simply mask them 3. Introduction to Fault Tolerance 3.2 Hardware Redundancy 1. Passive Proc 1 Voter Mem 1 Proc 2 Voter Mem 2 Proc 3 Voter Mem 3 Module 1 Module 2 Voter Output Module 3 Basic concept of Triple Modular Replication (TMR) The use of triplicated voters in a TMR configuration 3. Introduction to Fault Tolerance 3.2 Hardware Redundancy 1. Passive Voting at Several Levels within N-Modular Redundancy (NMR) Systems 3 independent temperature sensors perform a vote on the 3 sensor values. Next, calculate the amount of heat/cooling by means of 3 separate modules, and then vote on the calculations to determine a result. X 3 independent sensors sample the temperature, perform the calculations, and then provide a single vote on the final result. 3. Introduction to Fault Tolerance 3.2 Hardware Redundancy 1. Passive Difference between the two approaches fault containment: voting at the sensors will mask and contain the effects of an eventual sensor fault. 3. Introduction to Fault Tolerance 3.2 Hardware Redundancy 1. Passive Task A HW Voting x SW Voting ? Proc 1 Voter Task Task B 1. 2. 3. The availability of processor to perform the voting The speed at which voting must be performed The criticality of space, power, and weight/volume limitations Task A Proc 2 4. 5. The # of different voters that must be provided The flexibility required of the voter with respect to future changes in the system Task A Proc 3 Example of SW voting 3. Introduction to Fault Tolerance 3.2 Hardware Redundancy 1. Passive In practical applications of voting, 3 results in a fault-free TMR system may not completely agree, even in a fault-free environment: e.g., A/D converters in sensors may produce quantities that disagree in the least-significant bits. This disagreement can propagate into larger discrepancies after computation, which can significantly affect the voting process. 3. Introduction to Fault Tolerance 3.2 Hardware Redundancy 1. Passive Solution Mid-Value Select Technique A TMR system selects the value that lies in the middle of the others : Uncorrupted signals Selected signals Corrupted signal 3. Introduction to Fault Tolerance 3.2 Hardware Redundancy 2. Active (or Dynamic) Attempts to achieve fault tolerance by means of fault detection, fault location, reconfiguration, and recovery (property of fault masking is not obtained: there is no attempt to prevent faults from producing errors within the system). More suitable for applications where temporary, erroneous results are acceptable, as long as the system reconfigures and regains its operational status in a satisfactory length of time. 3. Introduction to Fault Tolerance 3.2 Hardware Redundancy 2. Active (or Dynamic) Duplicação de Unidades Funcionais Técnica de Módulos em Standby Hot Standby Sparing Cold Standby Sparing 3. Introduction to Fault Tolerance 3.2 Hardware Redundancy 2. Active (or Dynamic) Shared Memory Processor B’s Result Processor A’s Result Processor A Processor B A B Comparison Task Comparison Task Error Signals Processor A’s Private Memory Processor A’s Result Processor B’s Private Memory Processor B’s Result A software implementation of duplication with comparison 3. Introduction to Fault Tolerance 3.2 Hardware Redundancy 3. Hybrid Combines the attractive features of both the Active and the Passive approaches. 3. Introduction to Fault Tolerance 3.3 Software Redundancy Verificação de Consistência Verificação de Capacidade Programação N-Autotestável Programação N-Versões Blocos de Recuperação 3. Introduction to Fault Tolerance 3.3 Software Redundancy Verificação de Consistência Usa o conhecimento prévio das características de uma dada informação para verificar a exatidão da informação. Tipicamente, na maioria das aplicações é sabido que uma certa quantidade de um dado operando não deve ultrapassar um valor previamente definido. 3. Introduction to Fault Tolerance 3.3 Software Redundancy Verificação de Consistência Examples ... A processing system can sample and store many sensor readings in a typical control application. The amount of cash requested by a patron at a bank’s teller machine should never exceed the maximum withdrawal allowed. 3. Introduction to Fault Tolerance 3.3 Software Redundancy Verificação de Consistência Examples ... The address generated by a computer should never lie outside the address range of the available memory. In a computer, each instruction code can be checked to verify that it is not one the illegal codes. 3. Introduction to Fault Tolerance 3.3 Software Redundancy Verificação de Capacidade Capability checks are performed to verify that a system possesses the capability expected. 3. Introduction to Fault Tolerance 3.3 Software Redundancy Verificação de Capacidade Examples ... Check whether a computer has the complete memory available. Check whether the processors in a multiprocessor system are alive. Periodically, a processor can execute specific instructions on specific data and compare the results to known good results stored in a ROM: check for ALU and Memory . 3. Introduction to Fault Tolerance 3.3 Software Redundancy Programação N-Autotestável Acceptance Tests Program Inputs Program Version 1 Acceptance Tests The N-Self-Checking Programming Approach to software fault tolerance Program Outputs Program Version 1 Selection Logic Program Inputs 3. Introduction to Fault Tolerance 3.3 Software Redundancy Programação N-Autotestável Hot Standby: all programs are running concurrently Reduced recovery latency: reconfiguration process is very fast 3. Introduction to Fault Tolerance 3.4 Information Redundancy Códigos de Paridade, Berger, m-of-n Códigos Aritméticos Códigos de Hamming Códigos Checksum Códigos CRC (Cyclic Redundancy Checking) 3. Introduction to Fault Tolerance 3.5 Time Redundancy Detecção de Falhas Transientes Detecção de Falhas Permanentes Recomputação para Correção de Erros 3. Introduction to Fault Tolerance 3.5 Time Redundancy Detecção de Falhas Transientes The fundamental concept is to perform the same computation two or more times and compare the results to determine if a discrepancy exists. 3. Introduction to Fault Tolerance 3.5 Time Redundancy Detecção de Falhas Permanentes Time t0 Data Store Result Computation Compare Results Time t1 Data Encode Data Computation Decode Result Store Result Error 3. Introduction to Fault Tolerance 3.5 Time Redundancy Example encoding functions might be complementation operator or arithmetic shift: 6 4 = 1, remain 2 (1 x 4) + 2 = 6 7 x 8 = 56 56 8 = 7 7 x 8 = 56 8 7 = 56 2 + 9 = 11 11 - 9 = 2 0110.1010 AND 0111.1111 = 0110.1010 0110.1010 shift right 2: 1001.1010, 0111.1111 shift right 2: 1101.1111, 1001.1010 AND 1101.1111 = 1001.1010 1001.1010 shift left 2: 0110.1010 3. Introduction to Fault Tolerance 3.5 Time Redundancy Recomputação para Correção de Erros Time redundancy approach can also provide for error correction if the computations are repeated three or more times. Consider the example of a logical AND operation. Suppose the operation is performed three times: first, without shifting the operands; second, with a one-bit logical shift of the operands; and third, with a two-bit logical shift of the operands. 3. Introduction to Fault Tolerance 3.5 Time Redundancy Recomputação para Correção de Erros Then, the results generated using the shifted operands are shifted back to the right position. Because each of the three operations used operands that were displaced from each other by at least one bit position, a different bit in each result will be affected by the faulty bit slice. If the bits in each position are then compared, the results due to the faulty bit slice can be corrected by performing a majority vote on the three results.