AbstractsComputer Science

Analysis and Development of Error-Job Mapping and Scheduling for Network-on-Chips with Homogeneous Processors

by Erik Karlsson




Institution: Linköping University
Department:
Year: 2010
Keywords: NoC; Network-on-Chip; MPSoC; fault tolerance; fault-tolerant; job; mapping; scheduling; RRC; minimize; AET; efficient; Natural Sciences; Computer and Information Science; Computer Engineering; Naturvetenskap; Data- och informationsvetenskap; Datorteknik; TECHNOLOGY; Information technology; Computer engineering; TEKNIKVETENSKAP; Informationsteknik; Datorteknik; Examensarbete i Datorteknik; Computer Engineering; teknik; Technology
Record ID: 1337159
Full text PDF: http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-54633


Abstract

Due to increased complexity of today’s computer systems, which are manufactured in recent semiconductor technologies, and the fact that recent semiconductor technologies are more liable to soft errors (non-permanent errors) it is inherently difficult to ensure that the systems are and will remain error-free. Depending on the application, a soft error can have serious consequences for the system. It is therefore important to detect the presence of soft errors as early as possible and recover from the erroneous state and maintain correct operation. There is an entire research area devoted on proposing, implementing and analyzing techniques that can detect and recover from these errors, known as fault tolerance. The drawback of using faulttolerance is that it usually introduces some overhead. This overhead may be for instance redundant hardware, which increases the cost of the system, or it may be a time overhead that negatively impacts on system performance. Thus a main concern when applying fault tolerance is to minimize the imposed overhead while the system is still able to deliver the correct error-free operation. In this thesis we have analyzed one well known fault tolerant technique, Rollback-Recovery with Checkpointing (RRC). This technique is able to detect and recover from errors by taking and storing checkpoints during the execution of a job.Therefore we can think as if a job is divided into a number of execution segments and a checkpoint is taken after executing each execution segment. This technique requires the job to be concurrently executed on two processors. At each checkpoint, both processors exchange data, which contains enough information for the job’s state. The exchanged data are then compared. If the data differ, it means that an error is detected in the previous execution segment and it is therefore re-executed. If the exchanged data are the same, it means that no errors are detected and the data are stored as a safe point from which the job can be restarted later. A time overhead due to exchanging data between processors is therefore introduced, and it increases the average execution time of a job, i.e. the average time required for a given job to complete. The overhead depends on the number of links that has to be traversed (due to data exchange) after each execution segment and the number of execution segments that are needed for the given job. The number of links that has to be traversed after each execution segment is twice the distance between the processors that are executing the same job concurrently. However, this is only true if all the links are fully functional. A link failure can result in a longer route for communication between the processors. Even though all links arefully functional, the number of execution segments still depends on error-free probabilities of the processors, and these error-free probabilities can vary between processors. This implies that the choice of processors affects the total number of links the communication has to traverse. Choosing two processors with…