With the petaflop barrier broken, is it time to change the benchmark?
By Scott M. Fulton, III | Published November 20, 2008, 5:48 PM
The Roadrunner supercomputer now runs more than 1.1 thousand trillion floating-point operations per second. But what's an "operation" really? By the time the next Top 500 list comes out, the definition could change.
At a presentation at the SC08 semi-annual supercomputing conference in Austin, Texas, an engineer with Oak Ridge National Laboratories in Tennessee who is an expert on the Linpack benchmark, suggested that the methodology used to determine supercomputer performance using Linmark may be behind the times. Specifically, Jack Dongarra -- the man credited with introducing the High-Performance Linpack (HPL) benchmark to the Top 500 program -- suggested that as supercomputers get bigger and can store more data, their lag times increase exponentially. This implies that making existing supercomputers bigger and faster eventually leads to a point of diminishing returns.
In the presentation (PDF available here), Dongarra states up front that it's only natural to test a supercomputer cluster with a problem size that's proportionate to its capacity. HPL expresses problem size in orders of 10; whereas desktop computers may be tested using Linpack set with a problem size of 1,000, the problem or matrix size for supercomputers is set into the millions. The #1 contender in this week's Top 500, Los Alamos National Laboratories' IBM Roadrunner, ran the HPL with a problem size (n) of 2,300,000. It was able to perform the HPL in about two hours. Oak Ridge had the #8 and #2 contenders, both of them Cray XTs and the #2 player -- Jaguar -- also beating the petaflop barrier this year.
Jaguar has more memory available to it than Roadrunner. So in one test, Oak Ridge increased the problem size proportionate to its capacity, for n of 4,700,000 (4.7 x 106). As Dongarra reported, that test took Jaguar 16 hours to complete.
At the rate at which supercomputers are presently scaling, he told attendees, it should be perfectly reasonable to adjust the problem size matrix to 33,500,000 by 2012. That would only seem fair, to scale the problem size to the extent that the clusters are scaling. But with current architectures, applying the current rate of performance falloff, the completion time could actually plummet, Dongarra predicted, to over two and a half days.
What's the solution? Dongarra's presentation showed how performance declines over the period of the entire benchmark's run, which it's supposed to do. Conceivably if a certain segment of the run were captured prior to the big dropoff late in the run, then a formula for the dropoff could be estimated based on the graph. That way, the test won't take days for each iteration...or conceivably even weeks.
But implementing the solution may mean changing the benchmark, and quite possibly impacting the results. This from a guy who works for the laboratory whose best contender placed second.
Dongarra's presentation closed with the following: "We are planning on making changes and will probably be ready after ISC in Hamburg." That's the very next supercomputing conference, slated for June 23 in Hamburg, Germany, and that's when the next Top 500 list will be published. At that time, we could see whether changing the way we measure tasks that scale up or down with their respective clusters, changes in turn the way we perceive supercomputing.
Yawn.
If someone is so anally concerned with an artificial benchmark, perhaps they should begin by looking at the the applied function of the complex and what it is designed to do - and then evaluate how well it does that job, rather than worry about some artifical task ignorant benchmorks intended to normalize meaningless performance devoid of an intended application.
I mean a Porsche GT3 is a great car, but it is lousy at hauling a large boat! And a 4 wheel drive pickup may be great hauling a load in the bed as it pulls a boat, but its a lousy handling performance sports car on twisty mountain roads, if that is your need! And comparing the two with regards to an abstract metric such as how many station presets their radios have, fails to provide much meaningful insight into either's performance strengths and weaknesses, let alone the behavior of either item within the context of the typical real world application.
Not to mention the fact that the benchmarks compare apples and oranges as they compare monolithic '1 job at a time' non-scalable (or with only very limited finite scalability) dedicated units versus complexes that can not only run MANY jobs simultaneously, but can also have the resources dynamically reallocated in real time as the job structure and loads change,as well as being able to scale almost without limit.
This benchmark has become increasingly marginalized to the point that it is meaningful only to a few marketing departments who think they can leverage it in a brochure to impress those unaware of the nature of the MP market and to those who lack an understanding of the wide range of configurations and functionality available.
This metric has become like buying a riding lawnmower based strictly on the basis of a marketing brochure's claims regarding how fast it can run, after having been modified to operate without a blade or operator for the test; let alone while being able to actually cut grass at the rate at which it is tested.
Score: 0
|Foxy, why don't you go play with your dodo?
Score: 0
|foxfyre,
Yes, you do make a number of good points. However, in order to give funders/investors/elected officials a semi viable way to rank a computer it is probably difficult to convince them of funding. Having a PhD tell a funder they need $200M+ to buy a machine while only supplying them with a number of processors, RAM, and some arcane chip to chip communications technology isn't reasonable. The investors (government or business/scientific) want to have a relative idea of how the funding will boost performance.
Another thing to consider is the value of running multiple types of benchmarks. It may be feasible and cost efficient to have a dozen+ types of software and benchmarks for games, office suites, video, photo manipulation, etc. for say, desktop PCs. But this is when machines cost only thousands of dollars and the benchmarks are relatively standard software packages across x86 type chips & video cards. It is far far more costly to rewrite benchmark software to run on the multitude of hardware that encompasses supercomputers...plus the extraordinary cost involved in running the benchmarks on 1 of a kind equipment that costs millions and presumably thousands of dollars just in electricity to run per day.
1-2 benchmarks may not be optimum (and obviously, a rewrite of the HPL is necessary) but it is the best we can reasonably hope for.
Score: 0
|If the folks considering the need for such a machine as so woofully lacking in any understanding of what the intended application(s) are, and how the design and configuration of the MP environment impacts the resolution of said applications, they should be relying upon some other source to evaluate the suitability of the environment.
The 'fastest' machine is not the measure of the optimal configuration in this environment. Nor does it give any meaningful insight in to the the ability of the the environment to be utilized optimally any more than does judging an individual's intelligence by how fast they can run.
Heaven forbid one should propose having folks familiar with such needs evaluate the various configurations and applications that actually impact total useability of the complex.
But more fundmentally, the market has progressed far past the 'let's build a freak machine to impress the neighbors'. Folks (or a 'you' used editorially ;-) ) seem to assume that this market is dominated by experimental proof of concept designs rather than mature scientific and business applicaation solutions.
The proof of concept machines are a significant minority, and really don't factor into the market as they are not being sold. Especially when new configurations of exisiting designs are so readily available along with the natural evolutionaly advance in product development.
This market is MUCH more mature than that. And those with the need are not generally appealing to rube funders. Instead they are being purchased by matiure facilities with a specialied or high level need - be it for applied research or purely commercial purposes. To reiterate once again, for some reason, too many here seem to have the idea that this market is dominated by the construction of proof of concept machines rather than by supplying a robust applied production environment.
While there will always be the example where a more powerful and capable component nodes as well as aggregate complexes will be constructed, the commercial market has progressed FAR past the point of simply pointing to the speed of a particular machine when that metric becomes nearly superfluous relative to the configuration and flexibility of the complex.
And if those designated to select the appropariate MP complex are doing so on the basis of monolithic 'speed' metrics, then they already lack the credentials necessary for evaluating such a purchase. At that point, they may as well use the color of the machine as the basis of their decision.
The problem is not the capability of the machine, but rather the question of whether the idiots buying the machine even need, let alone should be allowed to have one.
And that is why you appoint a qualified individual or staff to evaluate the needs and report back with a summary of the pros and cons of various approaches relative to the intended application and associated costs and additional overhead, rather than simply relying on a near-meaningless spec.
Again, sugesting that a simple metric such as speed is in any way useful belies the lack of awareness of the myriad mature configurations developed for a wide array of research and commercial applications.
And an even more significant issue in most environments is the ability for the complex to accomodate a mixed bag of applications, not simply once calculation! If you only need a limited focused solution, he moeny would be better spent buying a time slice on another complex rather than buying such a dedicated machine. And this, by the way, is exactly what lead to Cray being sold twice! Monolithic dragsters do not make viable high performance utility vehicles suitable for a broad array of high demand solutions.
But next time you apply for a job, bring some good running shoes. And don't complain if your ability to do the job is based upon how quickly you can run the 100 yard/meter dash. After all, common sense tells us that this provides a meaningful metric regarding your technical and content related knowledge as well as your analytical and critical reasoning skills.
...As evidently, those evaulating the computational requirements of the institution qualitified for their jobs in precisely the same way. And , if this is indeed the case, I would serious suggest that you stop and consider whether you really wan to work for such a company. ;-)
We can hope for, and EXPECT much more! And an institution, be it public or private should demand a more robust decision process. And fortunately, the fact is, most do.
Otherwise, just give them a 'rad fast' gaming rig and pocket the million$ or so difference. They will never know the difference. And most likely, they will be too busy fighting over who gets to play WOW on it, and they won't even notice. ;-)
Score: 0
|Gee prepubescent, I asked, but your mother said you couldn't come out and play...
Score: 0
|