3. Introduction to Fault Tolerance
3.1 Basic Concepts ...
Tolerância a falha é a habilidade de um
sistema de continuar a realizar
corretamente as suas tarefas depois da
ocorrência de falhas.
3. Introduction to Fault Tolerance
3.1 Basic Concepts ...
Confiabilidade de um sistema é uma função
do tempo, R(t), definida como sendo a
probabilidade do sistema realizar
corretamente suas tarefas no intervalo de
tempo [t0, t], dado que o sistema estava
realizando corretamente no tempo t0.
3. Introduction to Fault Tolerance
3.1 Basic Concepts ...
Disponibilidade é uma função do tempo,
A(t), definida como sendo a
probabilidade de um sistema estar
operando corretamente e estar disponível
para realizar suas funções em um
intervalo de tempo, [t0, t].
3. Introduction to Fault Tolerance
3.1 Basic Concepts ...
A concepção de Sistemas tolerantes a falhas
é baseada em duas técnicas distintas:
Mascaramento de falhas
Detecção, localização e recuperação (via
reconfiguração) do sistema para remover o
componente defeituoso.
3. Introduction to Fault Tolerance
3.1 Basic Concepts ...
Se a opção é pela técnica de reconfiguração, então
utiliza-se ...
antes ...
Técnicas de detecção de falhas
Técnicas de localização de falhas
depois ...
Técnicas de recuperação de falhas
3. Introduction to Fault Tolerance
3.1 Basic Concepts ...
Técnicas de recuperação de falhas ...
Recuperação para trás (Rollback Recovery)
Recuperação para frente (Forward Recovery)
3. Introduction to Fault Tolerance
3.1 Basic Concepts ...
Todas as técnicas para concepção de
sistemas TF são baseadas em algum
tipo e grau
de redundância .
3. Introduction to Fault Tolerance
3.1 Basic Concepts ...
Redundância é implementada através do uso de
hardware, software, informação, ou tempo além
do que é necessário para a operação normal do
sistema.
Importante: resulta em um grande impacto no
sistema em termos de desempenho, tamanho,
peso, consumo de potência, e confiabilidade.
3. Introduction to Fault Tolerance
3.2 Hardware Redundancy
Passive
Active
Hybrid
3. Introduction to Fault Tolerance
3.2 Hardware Redundancy
1. Passive
1.  Based on the concept of fault masking to hide the
occurrence of faults and prevent the faults from resulting
in errors (developed around the concept of majority
voting)
 Do not provide for faults detection, but simply mask
them
3. Introduction to Fault Tolerance
3.2 Hardware Redundancy
1. Passive
Proc 1
Voter
Mem 1
Proc 2
Voter
Mem 2
Proc 3
Voter
Mem 3
Module 1
Module 2
Voter
Output
Module 3
Basic concept of
Triple Modular Replication (TMR)
The use of triplicated voters
in a TMR configuration
3. Introduction to Fault Tolerance
3.2 Hardware Redundancy
1. Passive
Voting at Several Levels within
N-Modular Redundancy (NMR) Systems


3 independent temperature sensors perform a vote on the 3
sensor values. Next, calculate the amount of heat/cooling
by means of 3 separate modules, and then vote on the
calculations to determine a result.
X
3 independent sensors sample the temperature, perform
the calculations, and then provide a single vote on the final
result.
3. Introduction to Fault Tolerance
3.2 Hardware Redundancy
1. Passive

Difference between the two approaches
 fault containment: voting at the sensors
will mask and contain the effects of an
eventual sensor fault.
3. Introduction to Fault Tolerance
3.2 Hardware Redundancy
1. Passive
Task A
HW Voting x SW Voting ?
Proc 1
Voter
Task
Task B
1.
2.
3.
The availability of processor to perform the voting
The speed at which voting must be performed
The criticality of space, power, and weight/volume
limitations
Task A
Proc 2
4.
5.
The # of different voters that must be provided
The flexibility required of the voter with respect to
future changes in the system
Task A
Proc 3
Example of SW voting
3. Introduction to Fault Tolerance
3.2 Hardware Redundancy
1. Passive

In practical applications of voting, 3 results in a fault-free
TMR system may not completely agree, even in a fault-free
environment:
e.g., A/D converters in sensors may produce quantities that
disagree in the least-significant bits. This disagreement can
propagate into larger discrepancies after computation, which
can significantly affect the voting process.
3. Introduction to Fault Tolerance
3.2 Hardware Redundancy
1. Passive
Solution  Mid-Value Select Technique
A TMR system selects the value that lies in the
middle of the others :
Uncorrupted
signals
Selected
signals
Corrupted
signal
3. Introduction to Fault Tolerance
3.2 Hardware Redundancy
2. Active (or Dynamic)
 Attempts to achieve fault tolerance by means of fault
detection, fault location, reconfiguration, and recovery
(property of fault masking is not obtained: there is no attempt
to prevent faults from producing errors within the system).
 More suitable for applications where temporary, erroneous
results are acceptable, as long as the system reconfigures and
regains its operational status in a satisfactory length of time.
3. Introduction to Fault Tolerance
3.2 Hardware Redundancy
2. Active (or Dynamic)
 Duplicação de Unidades Funcionais
 Técnica de Módulos em Standby
 Hot Standby Sparing
 Cold Standby Sparing
3. Introduction to Fault Tolerance
3.2 Hardware Redundancy
2. Active (or Dynamic)
Shared Memory
Processor B’s Result
Processor A’s Result
Processor A
Processor B
A
B
Comparison Task
Comparison Task
Error Signals
Processor A’s Private Memory
Processor A’s Result
Processor B’s Private Memory
Processor B’s Result
A software implementation of duplication with comparison
3. Introduction to Fault Tolerance
3.2 Hardware Redundancy
3. Hybrid
 Combines the attractive features of
both the Active and the Passive
approaches.
3. Introduction to Fault Tolerance
3.3 Software Redundancy
Verificação de Consistência
Verificação de Capacidade
Programação N-Autotestável
Programação N-Versões
Blocos de Recuperação
3. Introduction to Fault Tolerance
3.3 Software Redundancy
Verificação de Consistência
Usa o conhecimento prévio das características de uma dada
informação para verificar a exatidão da informação.
Tipicamente, na maioria das aplicações é sabido que uma certa
quantidade de um dado operando não deve ultrapassar um
valor previamente definido.
3. Introduction to Fault Tolerance
3.3 Software Redundancy
Verificação de Consistência
Examples ...
 A processing system can sample and store many sensor
readings in a typical control application.
 The amount of cash requested by a patron at a bank’s teller
machine should never exceed the maximum withdrawal allowed.
3. Introduction to Fault Tolerance
3.3 Software Redundancy
Verificação de Consistência
Examples ...
 The address generated by a computer should never lie outside
the address range of the available memory.
 In a computer, each instruction code can be checked to verify
that it is not one the illegal codes.
3. Introduction to Fault Tolerance
3.3 Software Redundancy
Verificação de Capacidade
Capability checks are performed to verify that a
system possesses the capability expected.
3. Introduction to Fault Tolerance
3.3 Software Redundancy
Verificação de Capacidade
Examples ...
 Check whether a computer has the complete memory available.
 Check whether the processors in a multiprocessor system are
alive.
 Periodically, a processor can execute specific instructions on
specific data and compare the results to known good results
stored in a ROM: check for ALU and Memory .
3. Introduction to Fault Tolerance
3.3 Software Redundancy
Programação N-Autotestável
Acceptance Tests
Program
Inputs
Program Version 1
Acceptance Tests
The N-Self-Checking Programming Approach to software fault tolerance
Program Outputs
Program Version 1
Selection Logic
Program
Inputs
3. Introduction to Fault Tolerance
3.3 Software Redundancy
Programação N-Autotestável
Hot Standby:
all programs are running concurrently
Reduced recovery latency:
reconfiguration process is very fast
3. Introduction to Fault Tolerance
3.4 Information Redundancy
Códigos de Paridade, Berger, m-of-n
Códigos Aritméticos
Códigos de Hamming
Códigos Checksum
Códigos CRC (Cyclic Redundancy Checking)
3. Introduction to Fault Tolerance
3.5 Time Redundancy
Detecção de Falhas Transientes
Detecção de Falhas Permanentes
Recomputação para Correção de Erros
3. Introduction to Fault Tolerance
3.5 Time Redundancy
Detecção de Falhas Transientes
The fundamental concept is to perform the
same computation two or more times and
compare the results to determine if a
discrepancy exists.
3. Introduction to Fault Tolerance
3.5 Time Redundancy
Detecção de Falhas Permanentes
Time t0
Data
Store
Result
Computation
Compare
Results
Time t1
Data
Encode
Data
Computation
Decode
Result
Store
Result
Error
3. Introduction to Fault Tolerance
3.5 Time Redundancy
 Example encoding functions might be complementation operator
or arithmetic shift:
 6 4 = 1, remain 2  (1 x 4) + 2 = 6
 7 x 8 = 56 
56  8 = 7
 7 x 8 = 56 
8  7 = 56
 2 + 9 = 11 
11 - 9 = 2
 0110.1010 AND 0111.1111 = 0110.1010 
0110.1010 shift right 2: 1001.1010,
0111.1111 shift right 2: 1101.1111,
1001.1010 AND 1101.1111 = 1001.1010
1001.1010 shift left 2: 0110.1010
3. Introduction to Fault Tolerance
3.5 Time Redundancy
Recomputação para Correção de Erros
 Time redundancy approach can also provide for error correction
if the computations are repeated three or more times.
 Consider the example of a logical AND operation. Suppose the
operation is performed three times: first, without shifting the
operands; second, with a one-bit logical shift of the operands;
and third, with a two-bit logical shift of the operands.
3. Introduction to Fault Tolerance
3.5 Time Redundancy
Recomputação para Correção de Erros
 Then, the results generated using the shifted operands are
shifted back to the right position.
 Because each of the three operations used operands that were
displaced from each other by at least one bit position, a different
bit in each result will be affected by the faulty bit slice.
 If the bits in each position are then compared, the results due to
the faulty bit slice can be corrected by performing a majority vote
on the three results.
Download

1. Review of Multiprocessors and Fault Tolerance