UK ATC failure 28/8

Post Reply
Message
Author
Boac
Chief Pilot
Chief Pilot
Posts: 17276
Joined: Fri Aug 28, 2015 5:12 pm
Location: Here

UK ATC failure 28/8

#1 Post by Boac » Wed Sep 06, 2023 9:43 am

It appears that a 'confused' flight plan with apparently 'duplicated' waypoints threw both the main and back-up systems off-line causing the major upheaval experienced by many.

Who on earth puts this system in place, where an input error crashes it, rather than executing an error routine to isolate and flag the input error so it can be corrected? To me that is pretty basic programme construction that all systems engineers should be taught as they come out of nappies. ~X(

OneHungLow
Chief Pilot
Chief Pilot
Posts: 2140
Joined: Thu Mar 30, 2023 8:28 pm
Location: Johannesburg
Gender:

Re: UK ATC failure 28/8

#2 Post by OneHungLow » Wed Sep 06, 2023 9:59 am

Boac wrote:
Wed Sep 06, 2023 9:43 am
It appears that a 'confused' flight plan with apparently 'duplicated' waypoints threw both the main and back-up systems off-line causing the major upheaval experienced by many.

Who on earth puts this system in place, where an input error crashes it, rather than executing an error routine to isolate and flag the input error so it can be corrected? To me that is pretty basic programme construction that all systems engineers should be taught as they come out of nappies. ~X(
Your logical thinking mirrors that of any sane software engineer as well.

https://www.nats.aero/news/nats-report- ... plemented/

Full report here - https://publicapps.caa.co.uk/modalappli ... l&id=12321
Safety critical software systems are designed to always fail safely. This means that in the event
they cannot proceed in a demonstrably safe manner, they will move into a state that requires
manual intervention. In this case the software within the FPRSA-R subsystem was unable to
establish a reasonable course of action that would preserve safety and so raised a critical
exception. A critical exception is, broadly speaking, an exception of last resort after exploring all
other handling options. Critical exceptions can be raised as a result of software logic or
hardware faults, but essentially mark the point at which the affected system cannot continue.
Clearly a better way to handle this specific logic error would be for FPRSA-R to identify and
remove the message and avoid a critical exception. However, since flight data is safety critical
information that is passed to ATCOs the system must be sure it is correct and could not do so in
this case. It therefore stopped operating, avoiding any opportunity for incorrect data being
passed to a controller. The change to the software will now remove the need for a critical
exception to be raised in these specific circumstances.
Having raised a critical exception the FPRSA-R primary system wrote a log file into the system
log. It then correctly placed itself into maintenance mode and the C&M system identified that the
primary system was no longer available. In the event of a failure of a primary system the backup
system is designed to take over processing seamlessly. In this instance the backup system took
over processing flight plan messages. As is common in complex real-time systems the backup
system software is located on separate hardware with separate power and data feeds.
Therefore, on taking over the duties of the primary server, the backup system applied the same
logic to the flight plan with the same result. It subsequently raised its own critical exception,
writing a log file into the system log and placed itself into maintenance mode.
What these people have here is not a real time system then... it is not fit for purpose. Imagine if the Apollo 11 LM computer had put itself into maintenance mode after it initially crashed due to its inability to manage the volume of radio altimeter data coming through. A big black smoking hole! In reality it picked up its skirts and restarted as any real time system should and victory was dragged from the jaws of defeat.
The observer of fools in military south and north...

Boac
Chief Pilot
Chief Pilot
Posts: 17276
Joined: Fri Aug 28, 2015 5:12 pm
Location: Here

Re: UK ATC failure 28/8

#3 Post by Boac » Wed Sep 06, 2023 10:21 am

I hate to think how much 2 defective systems cost to buy! It boils down to poorly written code that does not handle errors properly. BASIC 101? - 'On error go to'? Not 'roll on your back and stick your legs in the air'. ~X(

Post Reply