exerpted from - Science News, feb 26, 1991, vol 139 no. 7 Finding Fault the formidable task of eradicating software bugs typed by: Horatio (contact at DRU, 8067944362) this is about half of the article. the rest went on to describe the unint- eresting problems of attempting to design software safeguards for nuclear power plants ----------------------------------------------------------------------------- [several paragraphs of intro material] the software glitch that disrupted at&t's long-distance telephone service for nine hours in january 1990, dramatically demonstrates what can go wrong even in the most reliable and scrupulously tested systems. of the roughly 100 million telephone calls placed with at&t during that period, only about half got through. the breakdown cost the company more than $60 million in lost revenues and caused considerable inconvenience and irritation for telephone-dependent customers. the trouble began at a "switch" - one of 114 interconnected, computer operated electronic switching systems scattered across the united states. these sophisticated systems, each a maze of electronic equipment housed in a large room, form the backbone of the at&t's long-distance telephone network. when a local exchange delivers a telephone call to the network, it arrives at one of these switching centers, which can handle up to 700,000 calls an hour. the switch immediately springs into action. it scans a list of 14 different routes it can use to complete the call, and at the same time hands off the telephone number to a parallel, signalling network, invisible to any caller. this private data network allows computers to scout the possible routes and to determine whether the switch at the other end can deliver the call to the local company it serves. if the answer is no, the call is stopped at the original switch to keep it from tying up a line, and the caller gets a busy signal. if the answer is yes, a signaling-network computer makes a reservation at the destination switch and orders the original switch to pass along the waiting call - after that switch makes a final check to ensure that the chosen line is functioning properly. the whole process of passing a call down the network takes 4 to 6 seconds. because the switches must keep in constant touch with the signaling network and its computers, each switch has a computer program that handles all the necessary communications between the switch and the signaling network. at&t's first indication that something might be amiss appeared on a giant video display at the company's network control center in bedminster, nj. at 2:25 pm on monday, jan 15, 1990, network managers saw an alarming increase in the number of red warning signals appearing on the many of 75 video screens showing the status of various parts of at&t's world-wide network. the warnings signaled a serious collapse in the network's ability to complete calls within the united states. to bring the network back up to speed, at&t's engineers first tried a number of standard procedures that had worked in the past. this time, the methods failed. the engineers realized they had a problem never seen before. nonetheless, within a few hours, they managed to stabilize the network by temporarily cutting back on the number of messages moving through the signaling network. they cleared the last defective link at 11:30 that night. meanwhile, a team of more than 100 telephone technicians tried fantically to track down the fault. by monitering patterns in the constant stream of messages reaching the control center from the switches and the signaling network, they searched for clues to the cause of the network's surprising behavior. because the problem involved the signalling network and seemed to bounce from one switch to another, they zeroed in on the software that permitted each switch to communicate with the signalling network computers. the day after the slowdown, at&t personnel removed the apparently faulty software from each switch, temporarily replacing it with an earlier version of the communications program. a close example of the flawed software turned up a single error in one line of the program. just one month earlier, network technicians had changed the software to speed the processing of certain messages, and the change had inadvertantly introduced a flaw into the system. from that finding, at&t could reconstruct what had happened the incident started, the company discovered, when a switching center in new york city, in the course of checking itself, found it was nearing its limits and needed to reset itself - a routine, maintenance operation that takes only 4 to 6 seconds. the new york switch sent a message via the signalling network, notifying the other 113 swithces that it was termporarily dropping out of the telephone network and would take no more telephone calls until further notice. when it was ready again, the new york switch signaled to all the other switches that it was open for business by starting to distribute calls that had piled up during the brief interval when it was out of service. one switch in another part of the country received its first messages that a call from new york was on its way, and started to update its information on the status of the new york switch. but in the midst of that operation, it received a second message from the new york switch, which arrived less that a hundreth of a second after the first. here's where the fatal software flaw surfaced. because the receiving switch's communication software was not yet finished with the information from the firest call, it had to shunt the second message aside. because of programming error, the swtich's processor mistakenly dumped the data from the second message into a section of its memory already storing information crucial for the functioning of the commucations link. the switch detected the damage and promptly activated a backup link, allowing time for the original communication link to reset itself. unfortunately, another pair of closely spaced calls put the second processor out of commission, and the entire switch shut down temporarily. these delays caused further telephone-call backups, and because all the switches had the same software containing the same error, the effect cascaded throughout the system. the instability in the network persisted because of the random nature of the failures and the constant pressure of the traffic load within the network. although the software changes introduced the month before had been rigorously tested in the laboratory, no one anticipated the precise combination and pace of events that would lead to the network's near-collapse. in their public report, members of the team from at&t bell laboratories who investigated the incident state: "we believe the software design, development and test processes we used are based on solid, quality foundations. all future releases of software will continue to be rigorously tested. we will use the experience we've gained through the problem to further improve our procedures." in spite of such optimism, however, "there is still a long way to go in attaining dependable distrubuted control," warns peter g. neumann, a computer scientist with sri international in menlo park, california. "similar problems can be expected to recur, even when the greatest pains are taken to avoid them." [more uninteresting nuclear reactor stuff] ----------------------------------------------------------------------------- EOF thanks go out to everybody in the hack/phreak world who is/was kind enough to type up a few bytes of information for the education/amusement of all, particularly: cDc, toxic shock, phrack, phun, LOD, NIA, and CUD.