'Amateur' Linux IBM mainframe failure blamed for stranding New Zealand flyers

By Scott M. Fulton, III | Published October 12, 2009, 11:20 AM

Update ribbon (small)

12:05 pm EDT October 11, 2009 · The president of a design firm that specializes in data center power efficiency, and that was working on a new design last year for the Auckland-based data center that failed Friday morning, told Betanews today that even if changes were being made to that data center, if both the original design and the changeover plan were implemented properly, the data center failure would not have happened.

"What seems strange about this incident is that they are blaming it on a generator failure during testing," stated California Data Center Design Group President Ron Hughes, whose organization was not responsible either for the data center's current design or the changeover. "If this failure did occur during testing, the question I would ask is why didn't the redundant generators assume the load or why didn't they just switch back to utility power."

Though Hughes has no specific knowledge of last Friday's incident, his insight does shed more light on the situation.

"A properly designed Tier 3 data center -- which is the minimum level required for any critical applications -- should have no single points of failure in its design. In other words, the failure of a single piece of equipment should not impact the customer," Hughes told Betanews. "A generator failure is a fairly common event, which is why we build redundancy into a system. In a Tier 3 data center, if you need one generator to carry the load, you install two. If you need two, you install three. This is described as N+1 redundancy. It allows you to have a failure without impacting your ability to operate...In a Tier 3 data center, it should take 2 failure events before the customer is impacted."

The CEO of Air New Zealand -- one of the few major CEOs anywhere to have been elevated to the top post from a CIO position -- expressed his disgust last weekend over what he describes as the poor handling of a data center failure at his airline's outsourcing partner, IBM. Rob Fyfe's e-mail, made public by IDG's Randal Jackson, excoriates IBM for its handling of a systems outage that took place at 9:30 am local time Friday morning, and that lasted for at least six hours.

During the entire time, ticketing, baggage handling, and traffic rerouting procedures for the entire airline were at a standstill, causing chaos for airports there. This at a time when Air New Zealand was engaged in a public showdown with its chief rivals there, Pacific Blue and Qantas subsidiary Jetstar, challenging them to meet ANZ's standards for flight punctuality. "In my 30-year working career," Fyfe told his colleagues, "I am struggling to recall a time where I have seen a supplier so slow to react to a catastrophic system failure such as this and so unwilling to accept responsibility and apologize to its client and its client's customers...We were left high and dry and this is simply unacceptable. My expectations of IBM were far higher than the amateur results that were delivered yesterday."

Indeed, even as of this morning, IBM New Zealand has issued no public statements. The data center failure apparently affected all of IBM's customers in the region, not just the airline, although there is no word yet as to the identity of those customers or the extent of damage to their operations.

The move to outsource data center operations to IBM appears to have happened partly under Fyfe's watch as CIO, and was heavily touted by the time by IBM's marketing literature as a "design win" for mainframe-based Linux. Though some mainframe database operations for ANZ came online as early as 1999, the most lucrative move came in August 2002, when the airline replaced its mid-range Windows NT-based in-house network made up of 150 Compaq z800 workstations, with a single eServer zSeries Linux outsourced mainframe hosted by IBM Global Services. The airline's CIO at the time of the move, Andrew Care, said maintaining the outsourced zSeries would cost his airline 30% less in maintenance fees, and save $600,000 in software licenses.

The migration was seen as a huge loss for Microsoft, whose NT operating system was already well on its way to having been branded a failure for mid-level networks.

IBM established the global airline industry standard software for transaction processing as far back as 1960, in a joint project with American Airlines called the Airlines Control Program, which made possible the original, groundbreaking Sabre system. Since 1979, IBM has sold other airlines a commercial version of this system, called Transaction Processing Facility (TPF).

To this day, the transaction format used by airlines everywhere is based on ACS' half-century-old protocol. It isn't the format that has needed evolution, but rather the software that runs it; and IBM itself has been the key innovator here, developing a new class of software this decade called the Airlines Control System (ALCS). Originally seen as a mid-level alternative to a higher-class TPF system for smaller airlines that couldn't afford big iron, ALCS -- a TPF emulator -- now runs on bigger iron, thanks to the evolution in hardware as well.

Air New Zealand was one of ALCS' biggest customer wins in August 2002. Up to now, the airline has been one of ALCS' more active supporters, contributing a big chunk of new requirements for the software's latest version, according to literature from the UK-based ALCS User Group.

At this point, Air New Zealand may have too much investment tied up in the software to be in any position to migrate its applications to an IBM competitor -- if there even really is one in this field. But the airline's problem may not be with so much with the software but with its current host.

According to an ANZ group general manager cited in local radio news reports, the offline incident was traced to a single generator failure at IBM's Newton Data Center in Auckland. Usually data centers have redundant power sources, and normally the Newton center would not be an exception. An August 2008 article in Data Center Journal by the designer of new energy-efficient data center power generators with redundant sourcing, specifically mentioned the Auckland center as one of his customers at that time.

"I've seen numerous references recently to reducing the amount of redundancy as a way to achieve higher energy efficiency," wrote engineer Ron Hughes, president of California-based Data Center Design Group, referring to his Auckland data center project. "While I have no doubt that it is true, it may not be in the long term interest of the client. Data center outages can be career changing events. That extra redundancy may be the difference between a component failure with little impact and a system-wide outage."

Current ANZ CIO Julia Raue has been overseeing an innovative new information systems project at her airline, which has involved the creation of customizable self-serve ticketing kiosks, which customers themselves can change online using selectable widgets to suit their airport demands. In an interview with CIO Magazine last month, iGoogle was credited as a design inspiration for the self-serve system. But the entire system revolves around the zSeries mainframe, whose uptime last week appeared to have revolved around a single faulty generator.

While CEO Fyfe certainly has understandable reasons for wanting to abandon IBM, with his entire information strategy dependent on the move ANZ made in 2002, he may not have many alternatives open.

Comments

View comments by with a score of at least

Regarding the abbreviation: within the industry, people most often use the IATA codes to refer to airlines; in this case the code is "NZ". There are also ICAO three-letter codes, in this case "ANZ".

Score: 1

|

Amazing is it not how quick everyone forgets. A Tier III datacentre only has 99.982% availability averaged over 5 years according to industry recognised Uptime institute. This equates to an outage of 5 hours every 5 years. This is obviously that hit. If ANZ wanted higher availability then they should had paid the extra bucks and gone Tier IV and gone for the five nines (99.999%) availability.

Score: 1

|

If you must abbreviate Air New Zealand, please use "AirNZ". ANZ is the name of a bank down here in Australia and New Zealand.

* not affiliated with either, it's just confusing to read.

Score: 0

|

I've actually seen this type of failure in a tier3 data center before; it was caused by human error. The utility/generator switch was set to bypass, so automatic switchover could not occur. This was the result of a procedural error following a maintenance event. Nothing is idiot proof, and there are a lot of idiots.

Score: 0

|

Might this be the first time in known history when someone actually does loose his job for going with IBM (or am I just sold on a myth perpetuated by IBM?)

Rob Fyfe beware!

Score: -2

|

Loose or lose?

Score: 0

|

Security firm: Windows patches not responsible for 'Black Screen of Death'

On second thought, maybe that access control list thingie with the lockdown something-or-rather didn't trigger an alleged, perhaps non-existent, pandemic.

Windows desktops and notebooks reach near price-performance parity for Holiday 2009

Gone are the days when average Windows desktop offered more for less than laptops.

Latest Firefox 3.6 beta fixes 133 bugs, promises faster page load times

A once-sluggish beta testing process has kicked into overdrive, with astonishing success at finding serious bugs. Will Mozilla be able to fix all the others in time?

Confirmed: Office 2010 to ship in June

Two weeks after Microsoft had been expected to draw a clearer roadmap for its principal applications suite, it's finally ready to commit to the end of H1.

Fee or free? Murdoch, Huffington square off over the cost of Internet news

Participants in an FTC workshop yesterday witnessed the two extremes of the Web news publishing debate, still centered on the issue of long-term profitability.

Apple settles with Psystar except for 'circumvention devices'

The fracas with the Florida clone computer maker might have ended today had Apple not have muddled the issue over a cheap piece of Psystar software.

Microsoft denies latest 'Black Screen of Death' claims

After an anti-malware producer announced a fix to what it says is a swarm of recent KSoD problems, evidence of the swarm itself has yet to turn up.

New EU antitrust commissioner will oversee Microsoft, Oracle+Sun, Intel issues

As one of Europe's most prominent politicians shifts positions in January, her replacement remains a question mark over technology's biggest issues.

Without its own 'iTablet' yet, is Apple missing the boat?

Steve Jobs is on record as dissing "single-purpose" devices like e-readers. But given their recent popularity, was that a mistake?

Not-so-mobile battery life: Time to force the issue

Carmi Levy | Wide Angle Zoom: If power efficiency is important when you buy a car or even a motorcycle, why shouldn't it matter for a smartphone?

Apple invokes DMCA, claims Psystar is 'trafficking in circumvention devices'

In trying to close the book on possibly the last attempt at a Mac clone, Apple cites from its own landmark case...but may actually be misinterpreting it.