Graphics Processing Units (GPUs) are becoming a de facto solution for accelerating a wide range of applications but remain susceptible to transient hardware faults (soft errors) that can easily compromise application output. One of the major challenges in the domain of GPU reliability is to accurately measure general purpose GPU (GPGPU) application resilience to transient faults. This challenge stems from the fact that a typical GPGPU application spawns a huge number of threads and then utilizes a large amount of potentially unreliable compute and memory resources available on the GPUs. As the number of possible fault locations can be in the billions, evaluating every fault and examining its effect on the application error resilience is impractical. Alternatively, fault site selection techniques have been proposed to approach high accuracy with less fault injection experiments. However, most of the existing methods in the literature only focus on the single-bit fault model and only one input.
In this dissertation, we offer solutions to the two problems above. We extend a progressive fault site pruning technique for two multi-bit fault models: (a) multi-bit faults in the same word; (b) multiple single-bit faults in different words accessed by the same thread. We devise a methodology, SUGAR (Speeding Up GPGPU Application Resilience Estimation with input sizing), that dramatically speeds up the evaluation of application error resilience. Key of the SUGAR estimation methodology is the identification of repeating thread patterns that develop as a function of the size of the input. These patterns allow for accurate prediction of application error resilience for arbitrarily large inputs.
With the presence of input-aware estimation strategies, we are able to pinpoint the vulnerabilities in a GPGPU application, and propose low overhead protection techniques accordingly. Based on the variety of thread resilience in GPGPU applications, we propose a methodology that identifies the resilience of threads and aims to map threads with the same resilience characteristics to the same warp. Our technique allows engaging partial protection mechanisms at the warp level. We illustrate that threads can be remapped into reliable or unreliable warps with only minimal introduced overhead, and then selective protection via replication is applied in unreliable warps. We show how this remapping facilitates warp replication for error detection and correction and achieves a significant reduction of execution cycles, comparing to standard techniques.
In addition to input-aware estimation and fortification, we present a detailed characterization comparing microarchitecture-level and software-level fault injection and show the gap of resilience estimation introduced by injecting faults into different layers in the system execution stack. We also implement a software-level redundancy protection mechanism and measure its effectiveness using microarchitecture-level and software-level fault injection.
Bio: Lishan Yang is a Ph.D. candidate in the Computer Science Department at William & Mary, under the supervision of Prof. Evgenia Smirni. Her research interest falls in GPU architecture, reliability analysis, performance analysis, workload characterization of large scale systems, reliability of HPC and large scale systems. Her Ph.D. research has been published in top conferences such as MICRO, Sigmetrics, and ICSE. Before coming to W&M, she received her bachelor’s degree in computer science from University of Science and Technology of China (USTC) in 2016.