Prototyping an efficient and cost-effective method to detect and mitigate random faults in COTS processors

More Info
expand_more

Abstract

Space has always been fascinating to humans, since the dawn of civilization. From the first astronomers and philosophers of ancient times, who looked at the night sky searching for answers, to the scientists and engineers of modern missions, commanding space probes to the edge of the solar system, space has always been at the epicenter of scientific discovery and human curiosity. From the launch of Sputnik 1 in 1957, to the robotic rovers exploring Mars, space missions have always relied on the latest technological advancements in order to enable physical or remote exploration of celestial bodies. Traditionally, designing a space computer required significant amount of resources, leading to designs with impressive radiation performance records. However, such designs were lacking computational performance, required years of development and as a result increased the total cost of the mission.

In recent years, the advent of CubeSats meant that access to space became available to a wider community of enthusiasts, researchers and private companies who were developing low-mass spacecraft made out of Commercial Off-The-Shelf components (COTS). These components however, were designed with the goal of maximizing performance and power, with little to no flight heritage. Several novel technologies were proposed in the field of error detection and mitigation, in an effort to bridge the gap between COTS processors and their radiation-hardened counterparts. Even though the commercial semiconductor industry has increased the reliability of its products by continuously improving their designs and processes, CubeSats or other low-mass spacecraft that use these components are still less reliable than their larger counterparts.

Given the aforementioned, this thesis aimed at exploring the latest developments in the field of space embedded systems and error detection techniques, in an effort to produce a software flow able to detect faults with increased compatibility across processor models. In order to accomplish this goal, the thesis was carried out at ARM Limited, as part of the Software Test Libraries (STL) team responsible for developing efficient assembly tests for detecting random faults. The Cortex-M55 CPU was chosen as the test-bed for this work, in order to develop STL routines for a reference module. TheMain Interface Unit (MIU) was chosen to act as the proof-of-concept, since it is an important module in every Cortex-M processor, interfacing the core with the main memory.

More specifically, a series of tests were developed for every major module within the MIU. The design started from the largest module first, which yielded a good trade-off between coverage and time. The tests comprised of efficient assembly routines designed to trigger specific memory access patterns, targeting different portions of logic each time. At the same time, a verification software flow was developed in order to test the newly designed routines against a multitude of possible configurations and initialization parameters. This activity was necessary to ensure that the developed software will be able to operate in a variety of end applications, either in the context of a Real-Time Operating System or baremetal application.

The developed STL tests were subjected to a series of fault simulations using a state-of-the-art hardware simulation tool called ZOIX. Permanent faults caused by accumulated damage were modeled as stuck-at-faults, whereas transient soft-errors were modeled as transient toggle faults. Determining an accurate fault injection interval, required knowledge of the radiation environment that a COTS-based mission would encounter. A state-of-the-art space simulation environment called SPENVIS was used in order to acquire metrics on selected reference missions on Low Earth Orbit and Geosynchronous Equatorial Orbit. This helped setting the upper limits on upset rates, which were in turn used during fault simulation to recreate realistic conditions.

The developed software tests exhibited solid performance in detecting permanent faults, while achieving promising results in transient fault simulations, given certain assumptions. A series of recommendations is given for future research work on the current framework, in an effort to learn from the challenges faced and tackle some of the identified limitations. Given certain assumptions, there is evidence to believe that STLs could be indeed used for random error detection in future CubeSat missions, without increasing the total cost disproportionately.

Files

Karavelas_Detailed_Report_Fina... (.pdf)
warning

File under embargo until 29-05-2025