

# Fault-Tolerant Soft RISC-V Linux SoCs on SRAM-based FPGAs in Space



## FAULT TOLERANT SOFT RISC-V SOC FOR SPACE

- RISC-V, an open-standard architecture, benefits from a robust open-source ecosystem—including Soft Linux SoCs—ensuring broad IP compatibility for any FPGA platform. (LiteX)
- While various methods exist to enhance fault tolerance, few provide rigorous verification or measurement of these improvements.
- This work presents a RISC-V design that has undergone three iterative enhancements—incorporating TMR, ECC, and guided SEU-aware placement—to bolster reliability.
- Through static and dynamic assessments, these designs demonstrate a substantial increase in mean time between failures.
- As the open-source community increasingly embraces RISC-V, it becomes feasible to deploy a fault-tolerant Linux SoC on commercial off-the-shelf

# DESIGN UNDER TEST ~ FT LINUX SOC

- Utilizes the RISC-V architecture with LiteX to deploy Linux SoCs optimized for space applications.
- Builds on an open-source IP ecosystem featuring processors and peripherals designed for extreme environments.
- Boots Linux images via SD card with UART interfaces for robust, remote system diagnostics in space.
- Runs the Dhrystone benchmark to continuously monitor performance and system stability.
- Targets the Antmicro Data Center DRAM Tester used to validate DDR4 memory controllers.

TFTP Server

### <u>Test Setup</u>

Nexys Video



#### **Example Output**

#### **Target Platform**

|   | C. C. S. | a designed |        |  |
|---|----------------------------------------------|------------|--------|--|
| 1 | at at                                        | . X.       | MEM-AM |  |



# FAULT INJECTION

- Fault injection emulates CRAM upsets in SRAM FPGAs, testing for singlepoint failures in soft RISC-V.
- The process injects faults into configuration memory by reading a configuration frame, flipping a single bit, and writing the modified frame while the FPGA is active.
- This process evaluate a subset of possible fault that are induced



# STATIC ANALYSIS

- Static net analysis identifies unmitigated FPGA elements that may cause single-point failures in soft RISC-V processors.
- The open-source Bitstream Fault Analysis Tool (BFAT) examines DCP files to map sensitive configuration bits, PIPs, and BELs, with options to review all nets or filter out triplicated ones.
- This analysis provides a quick estimation for how redundancy can

# FAULT TOLERANT METHODS

This FT RISC-V design mitigates faults using multiple techniques using accessible and open techniques.

NUC

#### Fine-grain TMR is implemented with SpyDrNet by triplicating FPGA primitives and inserting majority voters to mask single event upsets.

- The system leverages LiteDRAM's 72-bit ECC, appending an 8-bit codeword to correct single-bit errors and flag multi-bit discrepancies without extra CPU overhead.
- SEU-aware placement (PAR) reduces critical routing paths by positioning voters near associated cells and



Fine-Grain TMR

#### by radiation.

#### **BYU Fault Injection**

#### Fault Injection Sensitivity

| Design      | Sensitivity | Norm. |
|-------------|-------------|-------|
| Unmitigated | 1.274%      | 1.00× |
| ECC         | 1.090%      | 1.17× |
| TMR+ECC     | 0.035%      | 36.1× |
| TMR+ECC+PAR | 0.036%      | 35.2× |

eliminate single-point failures in the FPGA's digital design.

#### Static Net Sensitivity

| Design      | Sensitivity | Norm.  |
|-------------|-------------|--------|
| Unmitigated | 4.16%       | 1.00×  |
| ECC         | 5.03%       | 0.83×  |
| TMR+ECC     | 0.55%       | 7.53×  |
| TMR+ECC+PAR | 0.24%       | 17.56× |

partitioning TMR domains.



#### **Design Utilization**

| Design      | LUT    | LUTRAM | FF    | BRAM  | ΙΟ    |
|-------------|--------|--------|-------|-------|-------|
| Unmitigated | 10.8%  | 1.13%  | 4.82% | 16.8% | 35.3% |
| ECC         | 13.6%  | 1.13%  | 5.79% | 16.8% | 38.3% |
| TMR+ECC     | 48.11% | 3.52%  | 17.4% | 50.3% | 38.3% |
| TMR+ECC+PAR | 52.5%  | 3.52%  | 17.4% | 50.3% | 38.3% |

# **RADIATION TESTING**

- Radiation tests were conducted at the ChipIr facility at Rutherford Appleton Laboratory in the UK, where a high-energy neutron beam irradiated only the FPGA (shielding the DDR memory) while the system operated at room temperature.
- A collimator ensured precise targeting of the board, exposing the soft RISC-V processor to accelerated single event effects (SEEs) testing.
- A JTAG-controlled scrubber, running at 25 Mbps, continuously scrubbed the configuration memory (CRAM) by reading the entire memory, comparing it to a golden reference, and correcting errors in real time.

# TMR+ECC FAULT ANALYSIS

- BFAT analysis identified residual vulnerabilities down to FPGA PIPs and design nets.
- The 105 failure events in the TMR+ECC design include XDC-based placement issues, DDR4 ADDR/CMD parity errors, and DDR ECC error accumulation.
- PAR fixes target long routing issues, and DDR memory scrubbing repairs external

## CONCLUSION

- From the naïve design, over an estimated 98% of failures can be properly mitigated.
- Thoroughly testing of these mitigation methods identify fault coverage.
- These mitigation methods can greatly improve soft RISC-V processor designs.
- Use the links below to ask live questions or for further contact.

The system's robustness was evaluated by running

the Dhrystone Linux benchmark continuously, with failures recorded if incorrect outputs were produced over the serial interface.

 This setup provided key metrics on neutron cross section and CRAM sensitivity, critical for assessing fault mitigation in space applications.



**Radiation Testing** 

#### **Radiation Results**

| Design      | Fluence               | Upsets | Failures | Cross Section<br>(cm <sup>2</sup> ) | Norm.  |
|-------------|-----------------------|--------|----------|-------------------------------------|--------|
| Unmitigated | $4.21 \times 10^{10}$ | 15656  | 161      | $3.82 \times 10^{-9}$               | 1.00×  |
| ECC         | $8.04 \times 10^{10}$ | 14121  | 128      | $1.59 \times 10^{-9}$               | 2.40×  |
| TMR+ECC     | $5.45 \times 10^{11}$ | 198225 | 105      | $1.93 \times 10^{-10}$              | 19.84× |
| TMR+ECC+PAR | $9.36 \times 10^{11}$ | 213351 | 82       | $8.76 \times 10^{-11}$              | 43.59× |

#### memory upsets.

- Critical clock and control signals remain unmitigated without major design changes.
- Additional Mitigation can push the fault tolerance further with 75% fewer events.

| Sensitivity | Percent                         |
|-------------|---------------------------------|
| 12          | 11.43%                          |
| 12          | 11.43%                          |
| 56          | 53.33%                          |
| 24          | 23.81%                          |
|             | Sensitivity   12   12   56   24 |



Contact

