Dragon's "Radiation-Tolerant" Design
Last week, NASA revealed that SpaceX's first commercial resupply mission to the ISS experienced a number of anomalies in addition to the shutdown of a Falcon 9 first-stage engine, including the loss of one of three flight computers on the Dragon cargo vessel due to a suspected radiation hit. Over the weekend I spoke with John Muratore, SpaceX director of vehicle certification, who said the loss of the computer was a function of the radiation-tolerant system design on which Dragon relies, rather than hard-to-come-by "rad-hardened" parts that can be costly and difficult to upgrade.
AWST: So, NASA does not require SpaceX to use radiation-hardened computer systems on the Dragon?
John Muratore: No, as a matter of fact NASA doesn't require it on their own systems, either. I spent 30 years at NASA and in the Air Force doing this kind of work. My last job was chief engineer of the shuttle program at NASA, and before that as shuttle flight director. I managed flight programs and built the mission control center that we use there today.
On the space station, some areas are using rad-hardened parts and other parts use COTS parts. Most of the control of the space station occurs through laptop computers which are not radiation hardened.
The radiation environment is something people have known about for a long time. It's part of the natural environment, and it varies. It matters what kind of mission you're doing. With Dragon we're doing low-Earth orbit, short-duration missions and that drives a lot of the architecture.
So NASA didn't require radiation-hardened parts. It did, however, require us to do a hard analysis of the radiation environment, the effect of the environment on the Dragon systems and how we'd respond to that. We not only produced that analysis, but it was reviewed by an independent panel of experts. So NASA had very strong requirements for us to understand the environment and have planned out our responses to the environment, and we've done that.
Q: So, these flight computers on Dragon – there are three on board, and that's for redundancy?
A: There are actually six computers. They operate in pairs, so there are three computer units, each of which have two computers checking on each other. The reason we have three is when operating in proximity of ISS, we have to always have two computer strings voting on something on critical actions. We have three so we can tolerate a failure and still have two voting on each other. And that has nothing to do with radiation, that has to do with ensuring that we're safe when we're flying our vehicle in the proximity of the space station.
I went into the lab earlier today, and we have 18 different processing units with computers in them. We have three main computers, but 18 units that have a computer of some kind, and all of them are triple computers – everything is three processors. So we have like 54 processors on the spacecraft. It's a highly distributed design and very fault-tolerant and very robust.
Q: But there's nothing on the spacecraft in the way of radiation-hardened parts?
A: The parts aren't hardened, the design as a total system is hardened. What it is is each part does not go through the screening that is typical of radiation hardened parts. Now that doesn't mean that each part can't take the dose that a “rad-hardened” part can, because we've taken all of our designs and we've tested them extensively, we've had contracts with the the [NASA] Jet Propulsion Lab (JPL) to consult us, and their the world's experts in it, and we've gone to the University of Indiana and tested all of our parts, and we test them until they fail. We keep bringing the environment up and up and up until they fail. But we test them as a total system, not each part at a time. We've tested lots of our parts to very, very high radiation environments. So we test them as a total system, and by that I mean a unit with three processors in it, we test the entire unit. We take the cover off and we hit it really, really hard with radiation, and we do that so we understand how the parts react in the radiation environment.
Q: So what happened in this situation where one computer on board Dragon had a suspected radiation hit and shut down?
A: Think of a computer as lots of white marbles that are arranged in a specific pattern on a table, and a black marble comes in and knocks one of the white marbles out of place. Now, the memories of our computers are constantly checking for that happening. So if we take a hit in our most dense part of our computer – the memory – the computer detects it and repairs it and there's no harm done. But our other circuits in the computer, places like where we're bringing information in and out of the processor, if we take a hit there it can cause basically a bit to flip from a zero to a one. And that instruction can be wrong, and that is where the two processors in a single computer element voting on each other can detect that, and it can force a reboot. And that's what happened, we rebooted the computer.
Q: You rebooted the computer, but I understand it didn't re-sync, was that intentional?
A: Let's say you're working on something on your PC and you have Internet Explorer up and Word and a whole bunch of things and you take a glitch in the computer and it reboots and you lose all your work. What we do is when we re-sync, the two computers that are still running and have all the latest applications up, they load all that information in the memory so the three memories have all the same information. So when we rebooted, we had the option to re-sync. And we had practiced that on the ground lots. We do it all the time. Matter of fact when we normally bring the computers up we re-sync them. So we'd done this tons of times. But we needed to coordinate that and explain what we were doing to all the partners on the space station, and that just took time. And NASA said rather than distract everybody with going through a long technical explanation of why we do that and convincing everybody it's all ok, can you guys just fly away the way you are? And we were like, yeah. We met every requirement that NASA had, even with one computer down.
Q: So, is there going to be any corrective action in terms of modifications to Dragon for the next cargo resupply mission net year? NASA's ISS Program Manager Michael Suffredini has been quoted suggesting you may replace existing parts with “rad-hardened” parts.
A: I think he was just hypothesizing. The first time you do anything on the space station, you talk about it a lot. And then after you talk about it, the next time it happens it's just like the time before, and they say go ahead, no problem. On our output processors, we took some hits on the last mission [the Falcon 9/Dragon demo flight that delivered Dragon to ISS in June under NASA's Commercial Orbital Transportation Services (COTS) program]. And we had to spend a lot of time explaining to people what we were doing. It's an international consortium, it's a $100-billion program, it's a million pounds of hardware, and everybody's systems need to interact, and we need to explain that when we're going to do something. And when we're going to do something the first time, even though we've explained it in safety panels and safety reviews and flight procedures and flight-technique meetings and we had talked about it all before, the first time you actually come up to it, everybody just wants to talk about it again.
So we had similar radiation hits on the output units this time, and we called the flight director and he went “Yeah, go ahead, go reset.” So we reset the input/output units with about a five-minute discussion. It was no big deal. So I think that because of that, he's thinking we spent a lot of time talking about this, maybe you should consider some other kinds of parts. But I think it was just because it was the first time we went through it.
Q: Ok, is there any plan right now to make any changes in the flight computers for the next mission?
A: We might make some slight procedural or software changes so we can get through the re-synching faster. But that's all. We're still talking about that. There's no requirement to make any changes. We met every safety requirement that NASA put on us. Every piece of hardware that had any kind of hit recovered 100%, completely. So the design functioned exactly the way it was intended to function.
Q: Is it possible all three computer units could take a hit and go down at once?
A: So, remember the marbles. Now we've got three tables and the white marbles arranged on all three tables, and the black marble would have to go through so that it hit all three tables at once. And that would be hard to do. But even if it did, we normally power up the vehicle with the computers down. Matter of fact we run with the computers down all the time because each of the input/output units have its own three strings of computers in it. And we can command those directly, we can command them from the station, through the TDRS satellite, we can command them from our own ground station. There was no impact at all. And we would have just rebooted them and come up.
Q: What's the downside to buying radiation-hardened hardware or software? Is it expensive, or just not widely available?
A: It's really not the expense that drives it. We're committed to having the best possible parts in all of our designs. So if it cost a lot and we needed it, we'd go get it. We were already required to have all this redundancy in the computers to meet all the different safety requirements. Then we started looking at what parts do we want to use and what is appropriate for this design. And what really is more important to us than the cost of the parts is the capability of the parts – how much power do they use, how much memory do they hold, how much do they process, and how physically big are they. That's the first thing.
The second thing is what tools they come with. We run the Linux operating system, we program everything in C++, and that enables us to tap into a huge pool of very talented people and find the absolute best people in the computer and software industry to work with us. If you go into the radiation hardened parts, they are very limited in terms of what languages you can work in, what support packages there are for them, who knows how to program in them. It really limits your ability to work with the parts. And the other thing it really does is they all take a little longer time to get and they're a little harder to come by.
I just walked around the factory this morning, just in the office area alone, and we have over 40 of the flight computers sitting on people's desks. And if they were hard-to-come-by items, we wouldn't have that many computers. We've got 54 in a Dragon – and they're all different kinds of computers, different kinds of processors. We've got computers in the Falcon 9, we've got three computers in one unit on each engine in the Falcon 9, so that's 30 computers right there. We have hundreds of flight computers of different capability levels, and we're in multiple generations of design. The radiation parts tend not to have growth and upgrade paths. It's very hard to grow, if you decide you want a little more capability, a little faster, you're really limited – it's that part. And we're already in our third generation of flight computer at SpaceX. In the last two years we've worked through three generations, we've got people working on a fourth generation computer. So we are constantly looking at what's available in the marketplace, moving with the marketplace so we can use the best software tools, the best people the best techniques and achieve the most modern, optimized, efficient design. That's why we don't want to go into these lines, and they are good pieces of equipment, lots of people use them. But they don't open up the kind of possibilities that we want to have. A lot of other programs are one program. At SpaceX our goal is the most reliable, cost effective and safe access to space in the world, and our CEO [Elon Musk] is very clear: We're going to Mars. So building the computer for the Dragon isn't just about building the computer for the Dragon, it's about building a whole suite of tools, techniques, people and processes to then go to the next vehicle, and the next vehicle. And our equipment crosses lines. Falcon designs go into Dragon, we're currently retrofitting the Dragon design into the new Falcon, so our designs constantly keep evolving, and that's why we don't want to get into lines that have limited growth capacity.
Q: Did the space shuttle have rad-hardened computers?
A: They had rad-hardened design, not rad-hardened parts. I was one of the flight directors the first time we went to repair the Hubble Space Telescope, and they had the same kind of error-correcting memory approach that we have. And we just watched the errors counting up. I remember sitting on the console with my flight computer officer and we were just watching them crank up while we were up repairing the Hubble, and we were just going bang, bang, bang, taking errors and correcting them. So radiation-tolerant design vs. radiation-tolerant parts is very common and was used in shuttle.
Q; So you're not breaking a mold here.
A: We're taking it to an extent previously not done, but we're operating in a well known set of techniques and capabilities.