Boeing 737 MAX crash and the rejection of ridiculous data

“Boeing 737 Max: What went wrong?” (BBC) contains a plot showing the angle of attack data being fed to Boeing’s MCAS software. Less than one minute into the flight, the left sensor spikes to an absurd roughly 70-degree angle of attack. Given the weight of an airliner, the abruptness of the change was impossible due to inertia. But to have avoided killing everyone on board, the software would not have needed a “how fast is this changing?” capability. It would simply have needed a few extra characters in an IF statement. Had the systems engineers and programmers checked Wikipedia, for example, (or maybe even their own web site) they would have learned that “The critical or stalling angle of attack is typically around 15° – 20° for many airfoils.” Beyond 25 degrees, therefore, it is either sensor error or the plane is stalling/spinning and something more than a slow trim is going to be required.

So, even without checking the left and right AOA sensors against each other (what previous and conventional stick pusher designs have done), all of the problems on the Ethiopian flight could potentially have been avoided by changing

IF AOA > 15 THEN RUNAWAY_TRIM();

to

IF AOA > 15 AND AOA < 25 THEN RUNAWAY_TRIM();

About 10 characters of code, in other words. (See the Related links below for the rest of the flaws in the MCAS system design, which the above tweak would not have fixed.)

We fret about average humans being replaced by robots, but consider the Phoenix resident who sees that the outdoor thermometer is reading 452 degrees F on a June afternoon. Will the human say “Arizona does get hot in the summer so I’m not going to take my book outside for fear that it will burst into flames”? Or “I think I need to buy a new outdoor thermometer”?

Related:

46 thoughts on “Boeing 737 MAX crash and the rejection of ridiculous data

  1. Do you think it could be machine learning issue? Purists did not want to combine it with heuristics (i.e hardcoded rule IF AOA > 15 AND AOA < 25 ….) Often AI takes great effort to arrive at obvious if ever.

  2. You are so right! Someone finally made a good article. It is so simple you can not believe it.
    But, in the first place MCAS should not be needed on any commercial airliner. It is one more thing to go wrong, and when it does the outcome is known now, just because the wanted to save money and time on redesinging a whole new plane, to make the big engines properly fit under the wings.

  3. I had read articles about how there’s an optional second AOA sensor and had been under the impression that these jets lacked it, but the data plot from the article seems to suggest there were two. You’d think that it would be possible, but trickier, to account for ridiculous data even with only one sensor, but the fact that there was still a good sensor and the program somehow thought the wild one was valid seems pretty incredible. (I mean I suppose it’s possible for a jet to fly into a 10g updraft while in a climb, but you’d think it should be easy to use a flight model to guess which sensor was working when they disagree)

    • The Max always had 2 sensors but only 1 was used for each flight. Unless you ordered the optional ($80,000 for a $2 LED bulb) “Disagree Light” there was no mechanism for comparing readings between the two sensors.

  4. After a career managing software development projects, it’s obvious to me that whoever was responsible for this system design fiasco at Boeing was utterly incompetent. Anyone who designed a system which forced a plane to crash based on a single faulty sensor shouldn’t be anywhere near a significant safety related project. As I understand it, Airbus uses a voting system based on 3 sensors to protect against a faulty sensor reading. Boeing should have done that as well, and as the author of this blog pointed out, it’s easy to rule out absurd readings, which also wasn’t done.

  5. You need triple redundancy for anything critical, so if one fails, the other two can override it with a quorum. On the 787, the onboard computer not only has 3 CPUs, but they are also made by 3 different manufacturers: Intel, AMD and Motorola.

    The 737 MAX’ AOA sensor is a single point of failure by default, but you could pay for a second sensor as an option, in which case it would average both sensors. If one sensor goes completely haywire, that still yields bad results. This is why fixing the planes isn’t going to be as simple as a software upgrade. Triple AOA sensors will need to be introduced, the plane recertified and given the glare of publicity around the issue, it won’t be allowed a perfunctory self-certification this time. Foreign airlines and aviation regulators will also not trust the FAA and insist on their own, expensive and time-consuming testing.

    Airbus has triple sensors by default on the A320neo that is eating the 737’s lunch. This penny-wise, pound-foolish corner-cutting will prove extremely expensive for Boeing, much as with VW’s emissions-cheating scandal.

    • There are two AOA sensors on the 737, and both feed in to the trim computer where MCAS is implemented. But the MCAS code only uses one, chosen at startup. This part of the problem was obviously bad engineering and is simple to fix.

      But the rest of the causal chain includes a high rate of sensor failure, maintenance (at least Lion Air), CRM (both sets of pilots got control to some extent but then let trim get away again), and training materials/the AD issued last year. The details are getting lost in the rush to pin the blame fully on Boeing and the FAA process.

      If Muilenberg had called all hands on deck after the Lion Air crash to reexamine failure modes and changes vs the NG, probably things would have gone differently; I bet he loses his job before this is over. Hopefully that wakes people up because it’s hard to believe the media mob, congress, and the lawyers will have a positive impact.

    • Two sensors are not sufficient, how would the computer know which one is wrong?
      Airbus has 3, and even that was not enough in the early days.

    • The Boeing proposed solution to the AOA disagree on the MAX is not fail-safe, which is impossible with two sensors, but fail-neutral instead.

      IF AOA disagree > 5 degrees, do nothing, else implement MCAS, once only!

      This caters for both normal activation, and fault inhibition. Typical aerodynamic fluctuations of +/- 2.5 degrees on each sensor should be fine, but would still eliminate large errors in either sensor.

      Note that if one of the sensors is stuck in a nose-low position, this will be a ‘hidden’ fault, though much less harmful overall.

    • Detailed list of MCAS fixes from AvWeek (firewalled). Lenghty, but worth quoting large parts: https://aviationweek.com/awincommercial/boeing-expand-mcas-fix-demos-worldwide

      In the initial briefing sessions for pilots on March 27, “we didn’t talk specifically about either of the accidents, but we ran through MCAS scenarios,” says Sinnett. “From the accidents, we now know how MCAS can behave when there is an erroneously high AOA input, so we walked through scenarios where that could occur. We demonstrated those in the simulator.”

      In these sessions, pilots and regulators were able to interact via intercom and a big screen with Boeing pilots in the 737 MAX engineering cab. Following the sessions, “we went back to the classroom and said, ‘Here are the things that concern us most when we look at the scenario of the two accidents we just experienced,’” Sinnett says. “Upon reflection on what has occurred, it appeared the system could present a high-workload environment—and that’s not our intention. So we looked at changing the design to compare values from multiple AOA indicators to essentially eliminate the unintended trigger condition that causes MCAS to activate.”

      Sinnett says pilots appear satisfied that the three main layers of protection now added to the MCAS will prevent any potential repeat of the circumstances involved in the Lion Air and Ethiopian Airlines accidents. “We answered a lot of questions during the discussion, and then we went back into the simulator and demonstrated a number of different scenarios to run against these changes,” Sinnett says. “And the most compelling thing is that the AOA failure case turns into a run-of-the-mill AOA failure case like you might have on any other airplane. We didn’t get any negative feedback. It was all very positive, and any of the pilots who got into the simulator and saw the before and after, it was like, ‘Yes, OK, this is now a nonissue.’”

      1. The first main layer of protection provided by the update is a cross-channel bus between the aircraft’s two FCCs, which now allows data from the two AOA sensors, or alpha vanes, to be shared and compared. AOA data continues to be fed from left and right vanes into their respective air data inertial reference units before being passed to the flight computers. However, the AOA data in both computers is now continuously compared. The change is made by software only and requires no hardware modification.

      “In a situation where there is erroneous AOA information, it will not lead to activation of MCAS,” says Sinnett. He underlines that the entire speed-trim system, including the MCAS, will be inhibited for the remainder of the flight if data from the two vanes varies by more than 5.5 deg. If an AOA disagreement of more than 10 deg. occurs between the sensors for more than 10 sec., it will be flagged to the crew on the primary flight display.

      2. The second layer of protection is a change to the logic in the MCAS algorithm that provides “a fundamental robust check to ensure that before it ever activates a second time, pilots really want it to activate,” says Sinnett.

      The change would have protected the system from continuing to activate in the case of the Lion Air accident, in which the left AOA vane was stuck in the 20-deg.-nose-high position. In that circumstance, the logic rechecked if the MCAS was required and, registering the apparent nose-high position from the errant vane, commanded more nose-down trim. “Now it sees you’re in the same spot, it says you’ve got a stuck vane and says, ‘I’m not going to activate again,’” Sinnett notes.

      3. Explaining the background to the third layer of protection, Sinnett says: “We also made sure if the second layer of protection failed somehow in some weird way and allowed MCAS to activate multiple times, the system now ensures the sum total is command-limited.” The result is that the pilots always have maneuvering authority remaining with the control column. “Pilots will always have the ability to override—although they had that before in other ways, like with the trim switch, for example. But with the software update, the column itself will always provide at least 1.2g of maneuvering capability. So you don’t just have the ability to hold the nose level, you can still pitch up and climb.”

  6. How can sensor go wrong in the first place? Not break and stop functioning, but feed wrong data? Does anyone consider that sensor can be targeted by some kind of directed malicious influence? Or by a hacking attack? Are there protections against hacks in Boeing systems?

    • An AOA sensor is just a mechanical vane pivoting in the (relative) wind. So it can fail the same way that anything mechanical subject to vibration and temperature extremes can fail. Getting covered in ice is probably the most likely scenario, though the systems try to defend against this with electric heat.

      https://utcaerospacesystems.com/wp-content/uploads/2018/04/Angle-of-Attack-AOA-Systems.pdf has a good illustration (and mentions “reliability,” which to the downstream software engineer, should suggest that “unreliability” is also possible).

    • That would suggest augmenting the AOA sensor with an accelerometer or mercury angle sensor not subject to outside environmental conditions, as a sanity check

    • You also can’t rule out mechanical damage – the vane sticks out and can be hit by stuff either on the ground or in the air.

      Fazal – mercury sensor would do no good. The frame of reference for angle of attack is the “relative wind” – the stream of air flowing past the wing. This has nothing to do with gravity. You could be pointed straight up and have a zero degree angle of attack so long as the wing was in line with the relative wind.

  7. Obama says if you like your 737 max you can keep your 737 max. On a different topic I heard they had take off power set for the whole flight. That seems odd to me.

    • If the goal is to climb away from the ground, why not full power? Though I guess you could also argue for “If things aren’t going well, try to return the airplane to level and power to about 50 percent.”

  8. In addition to the IF I (until recently) would have assumed they write unit tests which simulate one wild sensor, two wild sensors, one no reading, two no readings, intermittent readings, and so on.

  9. Philip – I thought the same thing when I saw that graph. A 70 degrees up AOA is just bad data and any software that responds to it is just plain dumb (and deadly).

    David

    • It felt like 737 max I flew in recently but before Ethiopian air disaster was at about 45 degrees AOA. Now I know why pilot behaved funny on exit. I thought he was drunk but I smelled no alcohol. If I knew I’d thank him.

  10. The 737 Max problems have gotten lots of press, but the recent Atlas Air 767 crash has equally scary implications. Over at http://www.pprune.org there are rumours of a history of pilot errors that were possibly overlooked for sake of “diversity” hiring. Hmmm…

  11. Ths astonishing debacle prompts an outline for a work of fiction in which Companies X and Y compete for sales, and one (or both) has a “dirty tricks” department which conceives of placing engineer “moles” in the other, to await opportunities to sabotage its credibility and drive it into bankruptcy. You might think that this could not work in commercial aviation: not only would the initial engineering have to incorporate foolish and dangerous design, but the flight test programme would have to omit all sensitive tests of the subject design. Yet it seems to have happened here, even without involvement of external dirty tricks! What hubris!

  12. From one of the aviation forums: Someone pointed out that the example of the AF447 crash shows it is not always wise to rule out any data (not that this excuses MCAS). During that event they were in a deep stall, with AOA approaching 50(!) degrees. At one stage the indicated airspeed dropped below 60kts (for a fully loaded A330), and that the stall warning (temporarily) stopped working. In its wisdom, the software had decided that 60kts was an ‘impossible’ airspeed for a passenger aircraft, so that data should be disregarded, even though it was very real, and very deadly!

  13. That’s exactly what you should NOT DO.

    You don’t fix a dirty patch with another dirty patch on top of it. Every new change opens new failure modes.

  14. The idea that programmers are supposed to be checking wikipedia to come up with constraints like this shows a fundamental lack of understanding of how software gets written and what kind of agency developers have in that process. This is a situation where any deviation between the right and left aoa sensors would’ve raised red flags when done by humans, why this wasn’t baked into the requirements from the outset is the issue.

    • Stateing that no disagreement is permitted may be another mistake. There is a small natural fluctuation in each AOA due to minor differences in airlflow. There can also be significant differences between left/right AOA, due to yaw (assymmetrical flight conditions). Neither of these is suffficient grounds to develop a hard rule disabling MCAS. The Boeing proposed rule that (extensively quoted), is presumably based on extensive flight test data, not assumptions.

  15. Hi,

    ATP holder and [infrastructure] developer here..

    Do not repeat the mistakes of past developers. This, too, is hubris. Sure, this sounds good on paper to the layperson, but your statement “ignore it if it’s 70 degrees in a second” does not match your code. Nor is it difficult to get one of these things past 25 degrees AOA, *especially* with another, unrelated malfunction modifying your very clear assumption that the rest of the airplane is OK.

    Systems interface, design, and multiple interaction is not easy to conceive of, design, implement, or fly. Please do not fall afoul of “this would never happen to MY code”.

  16. Problem is, once you cut off MCAS, you have an 727 w/ flight characteristics 727 pilots are not trained for. The issue was always Boeing management’s goal of not requiring additional pilot training for the MAX type.

  17. There was a Lufthansa Airbus jet that nearly crashed in 2014 because it’s Angle of Attack sensors. It was diving at 4,000 feet per minute, and they only resolved the issue when the ground control said to turn off the sensors. It turned out later that the AOA sensors had frozen during the ascent. The only thing that saved the plane was the altitude it was at when the issue happened.

  18. While your solution sounds simple, and would probably have helped in this specific scenario, things are not so simple.
    Having a system go into “alarm mode” when things are wrong, and then back to “normal mode” when things are VERY wrong is not a good idea.
    A very similar approach to what you propose was used by Airbus on their A330, and contributed to the crash for Air France 447.
    The pilot (incorrectly) commanded a nose up, and got an incorrect angle of attack alert. Eventually, the angle was so extreme that the system shut down the alarm, dismissing the data as incorrect. The pilot (correctly) decreased momentarily the angle of attack, but then the system kicked back in, and the alarm sounded again. This created a negative feedback loop, the alarm only stayed silent when the nose was way too up, so the pilot kept it up, maybe in part because doing otherwise triggered the alarm.
    These changes have to be very carefully considered, and dismissing the data based solely in it being an extreme value might not always be a good idea. What if it IS an extreme value?

    • I read the IEEE article, and found it interesting at first glance. Afterwards it became clear that the story has nothing in common with how real world aviation projects (should) work. MCAS was not conceived and implemented by a junior coder sitting in the basement. It was designed and specified by engineers and aerodynamicists with years of experience, and reviewed (however incompetently) by safety experts within Boeing and the FAA. The multiple failures of this process will be reviewed, and there will be consequences.

      It was never up to the software developer to dream up their own parameter validation, either by consulting the internet, or rules of thumb. The proposed changes released by Boeing, indicate that the detailed specifications are much more complex than Phil’s simple algorithm.

  19. Could someone comment on how often the MCAS system was activated by the software during routine flights and successfully performed its intended function? Ie., if I understand correctly that the purpose of MCAS was to compensate for some weight changes in the Max and make the plane handle similarly to earlier versions, was it in fact necessary and routinely (invisibly) relied on to make the plane fly smoothly? or, might the system itself be mostly unnecessary?
    thanks, richard

    • AFAIK there have not been any activation of MCAS during routine flights. However there is no proof of this, since it operates in the background. By design MCAS should only activate under high AOA (close to the stall), which should not normally be encountered. The main aim seems to have been for evasive maneuevers such as CAT and TCAS. Almost everyone seems to think it is a certification band-aid, not a critical flight system. Ironic…

Comments are closed.