|Full text PDF:||http://arks.princeton.edu/ark:/88435/dsp0147429c37z|
As semiconductor technology scales towards ever-smaller transistor sizes, hardware fault rates are increasing due to process variation, reduced noise margin, aging effects, and increased susceptibility to soft errors. Reliability can be regained through redundancy, error checking with recovery, voltage scaling and other means, but these techniques impose area/energy costs. Since important application classes (e.g., multimedia, streaming workloads) are data-error-tolerant, recent research has proposed techniques that seek to save energy or improve yield by exploiting error tolerance at the architecture/microarchitecture level. So far reliability research has largely focused on errors affecting program data and is not general enough to handle arbitrary bit errors. Notably, although some applications may be tolerant to errors affecting the program data, e.g. image pixel value errors, error-prone programmable platforms may experience errors that corrupt the control-flow or even cause exceptions that terminate the program. When accounting for the data and control-flow dependencies, approximately two-thirds of instructions can lead to such crashes or unresponsive states. In response, I propose coarse-grain protection mechanisms to detect catastrophic outcomes and guide the application execution. These mechanisms protect the system against crashes, unresponsiveness, external device corruptions and also provide support for achieving acceptable quality. For example, coarse-grain control-flow protection mechanisms ensure the sequencing of time-bounded coarse-grain compute operations. Similarly, errors may cause data misalignments that degrade the output quality permanently in parallel streaming applications running on error-prone processors, but the coarse protection mechanisms use explicit communication directives in high level programming languages, such as StreamIt, to pad or discard data for realignment. Overall, I propose coarse-grain protection mechanisms that convert potentially fatal errors to potentially tolerable data errors instead of ensuring instruction-level or byte-level correctness. In summary, this thesis addresses requirements for error-tolerant execution by proposing and evaluating techniques for running data error-tolerant streaming applications on general-purpose processors built from an unreliable fabric. My studies show how low-overhead microarchitectural modules can use coarse-grain application information to enable streaming computation on error-prone processors. As a result, both sequential and parallel applications can provide good output quality on partially-protected uniprocessors and on multicore processors composed of partially protected uniprocessor cores, respectively.