Here is a news story about how a small network card brought down a huge airport for 8 hours. Do you remember the nursery rhyme? for want of a nail? this is the full rhyme
For want of a nail the shoe was lost.
For want of a shoe the horse was lost.
For want of a horse the rider was lost.
For want of a rider the battle was lost.
For want of a battle the kingdom was lost.
And all for the want of a horseshoe nail.
Given this scenario, one might have wondered why the airport did not do anything? Well, it is a truism that one cannot protect everybody from every possible risk. Either that would be impossible or too expensive.
But the common reaction is that we should be right 100% of the time. That's where the terrorists win out. They just have to be lucky once, we have to be lucky every time.
Various professionals in diverse fields will tell you that you try to work to a risk based approach. For example, the Soviet missile chaps (I had a long chat with one of their strategists in Moscow over couple of bottles of pepper vodka, will dig out the notes sometime and report on it) worked on the assumption that the interceptor missiles from the USA will get "some" warheads, some will disappear into space, some wont fire, some will miss their targets so their idea was to go for a huge number of warheads on the assumption that a few will get through!
Take a look at the high quality components which are supplied by the electronics industry to the super computer world, the avionics industry, etc. Each of those components comes with a company supplied figure of "MTBF" or mean time before failure. In other words, the company specifies when they estimate the component to fail, generally in terms of time (replace component after 500 minutes or something like that) and you replace the component before that time.
In the financial world, risk is frequently managed according to that perspective. You put aside sufficient capital so that you are protected against events that have a probability of happening say once every 10,000 years. See here for an example. Goldman Sachs also made a bad bet on the probability of events, see here for a background and here for the current state.
So while their automated models were based upon events happening say once in 1000 years, some events which were not thought to happen in 30,000 years happened, and they got caught out. In other words, even genius's get caught out. In fact, even Nobel Prize Winners get caught out like this case with Long Term Capital Management.
In this case, we are simply talking about a lowly network card, which would have been manufactured somewhere in China by a very automated process, and having not much redundancy either. Think about it, even the Challenger Space Shuttle was brought down by the failure of a tiny seal. See here for an example of MTBF.
Sometimes, even if you have all the nails, you will still lose the kingdom. To use Donald Rumsfeld's rather brutal words, "stuff happens".
So one needs to take these issues with a grain of piquant salt!!!
Contingency Planning, for Technology and Terrorism
By Stephen Barr
Thursday, August 16, 2007; D04
Small things often trip up large organizations. That's what happened at the Los Angeles International Airport last weekend.
A common piece of computer hardware -- a network interface card -- at a U.S. customs work station malfunctioned, taking down the agency's network at the airport. The system failure, which lasted nearly eight hours, delayed the arrivals of at least 17,000 international passengers and left many travelers stranded for hours in airplanes.
Network cards allow computers to communicate with one another, and most home computer users know them as Ethernet cards. Most of the time, when network interface cards feel like having a nervous breakdown, they go all the way and fail completely. This usually means that one computer goes down, but other work stations continue to function.
Customs and Border Protection, part of the Department of Homeland Security, wasn't so lucky. The network card at LAX took down the local area network when the card repeatedly began seeking attention and assistance in performing its functions, setting off a "data storm," overloading the network's efforts to manage itself.
"It was a fluke," said Ken Ritchhart, acting assistant commissioner for the CBP's office of information and technology.
A costly fluke.
Officials in the aviation industry have criticized the CBP for taking too long to fix the computer problem. Los Angeles airport and city officials have expressed frustration, and airplane passengers have complained that the agency needed a better, faster back-up plan for when computers go out.
But balancing "customer service" against national security is not easy for Customs and Border Protection. About 46,000 people move through customs lanes every hour, on average. Their names and passports are matched by customs computers against terrorism watch lists and FBI databases.
When the computers are humming, it typically takes five seconds to determine a passenger's status and a minute or so to clear him for entry. When work stations go down, customs officers get out laptops to connect into databases.
Ritchhart said work-station computers at LAX were about four years old and were scheduled for replacement next year. The cables that link work stations are about 20 years old.
Before the Los Angeles outage, the CBP had plans to upgrade work stations, cables and electronic components at its major sites, including New York and Miami. About $15 million has been set aside for replacing cables, switches and satellite links, and $10 million more will be spent to upgrade work stations, Ritchhart said.
For the short term, the CBP scrambled a tiger team of techies to review the agency's procedures for handling major computer outages. Officials also will review the contract with their telecommunications vendor to see whether the required response time of four hours should be cut in half.
Customs and Border Protection, of course, is not the only agency grappling with customer-service issues.
The Social Security Administration and the Veterans Affairs Department struggle to keep pace with benefit claims. In June, the State Department was overwhelmed with passport applications because of a new rule requiring passports for U.S. citizens flying within the Western Hemisphere.
Agencies responsible for national security, in particular, are motivated "not to let something bad happen," which does not always mesh with other goals, such as customer service, said Frank J. Cilluffo of the Homeland Security Policy Institute at George Washington University.
Government has a responsibility to provide security, and "we don't want to undermine that important mission," he said.
Still, many federal agencies need to look harder at their performance and plan more rigorously for such issues as surges in passport applications and chaos caused by technology breakdowns, said John Stewart, an operating partner at Monomoy Capital Partners, a private-equity fund that helps troubled companies turn around.
"The government is operating more in a responsive mode," he said. "That's not a good situation to be in when trying to meet the needs of customers."
Agencies also become vulnerable when they lapse into tunnel vision or rely too much on technology, said Donald F. Kettl, a University of Pennsylvania political science professor. "We are going to have things that outflank us," he said. "They might be hurricanes, or terrorists, or a bad chip inside a computer."
Stephen Barr's e-mail address isbarrs@washpost.com.
No comments:
Post a Comment