|Institution:||Technische Universität Darmstadt|
|Department:||Fachbereich InformatikDependable, Embedded Systems & Software Group (DEEDS Group)|
|Full text PDF:||http://tuprints.ulb.tu-darmstadt.de/4559/|
Fault injection (FI) is an experimental technique to assess the robustness of software by deliberately exposing it to faulty inputs and stressful environmental conditions specified by fault models. As computing hardware is becoming increasingly parallel, software execution is becoming increasingly concurrent. Moreover, to exploit the potential of increasingly parallel and interconnected computing systems, the complexity of the software stack, comprising operating systems, drivers, protocol stacks, middleware and distributed applications, also grows. As a consequence, the fault models classically used for in-lab robustness assessments no longer match the reality of fault conditions that modern systems are exposed to. In my thesis I account for this development by proposing the construction of higher order fault models from classical models by different means of composition. To demonstrate the effectiveness of such higher order models, I define a set of four comparative fault model efficiency metrics and apply them for both classical and higher order models. The results show that higher order models identify robustness issues that classical models fail to detect, which supports their adoption in the assessment of modern software systems. While higher order models can be implemented with moderate effort, they result in a combinatorial explosion of possible fault conditions to test with and, more severely, they introduce an ambiguity to experimental result interpretation that results in an increased number of required tests to maintain the expressiveness of assessment results. To mitigate the overhead that the adoption of higher order fault models entails, I propose to increase the experiment throughput by concurrently executing experiments on parallel hardware. The results show that such parallelization yields the desired throughput improvements, but also that care has to be taken in order not to threaten the metrological compatibility of the results. To account for resource contention in concurrent experiment executions that can lead to result deviations, I present a calibration approach that provides timeout values for the safe configuration of hang failure detectors. The experimental environment to conduct concurrent experiments is based on a generic FI framework developed in the context of my thesis that is highly adaptable to a variety of different target systems and that has been released under an open source license.