Betanews Comprehensive Relative Performance Index 2.2: How it works and why
By Scott M. Fulton, III | Published October 30, 2009, 9:11 PM
We did not have the Comprehensive Relative Performance Index (CRPI -- the "Creepy Index") out for very long before we found it needed to be changed again. The main reason came from one of the architects of the benchmark suites we use, Web developer Sean Patrick Kane. This week, Kane declared his own benchmark obsolete, and unveiled a completely new system to take its place.
When the author of a benchmark suite says his own methodology was outdated, we really have no choice but to agree and work around it. As you'll see, Kane replaced his original, simple suite that covers all the bases with a very comprehensive, in-depth battery of classic tests called JSBenchmark that covers just one of those bases. For our CRPI index to continue to be fair, we needed not only to compensate for those areas of the old CK index that were no longer covered, but also to balance those missing points with tests that just as comprehensively covered those missing bases.
The result is what we call CRPI 2.2 (you didn't see 2.1, although we tried it and we weren't altogether pleased with the results). The new index number covers a lot more data points than the old one, and the result...is a set of indices that are stretched back out over the 20.0 mark, like the original 1.0, but whose proportions with respect to one another remain true. In other words, the bars on the final chart look the same shape and length, but there are now more tick marks.
General explanation of the CRPI
Since we started this, we've maintained one very important methodology: We take a slow Web browser that you might not be using much anymore, and we pick on its sorry self as our test subject. We base our index on the assessed speed of Microsoft Internet Explorer 7 on Windows Vista SP2 -- the slowest browser still in common use. For every test in the suite, we give IE7 a 1.0 score. Then we combine the test scores to derive a CRPI index number that, in our estimate, best represents the relative performance of each browser compared to IE7. So for example, if a browser gets a score of 6.5, we believe that once you take every important factor into account, that browser provides 650% the performance of IE7.
We believe that "performance" means doing the complete job of providing rendering and functionality the way you expect, and the way Web developers expect. So we combine speed, computational efficiency, and standards compliance tests. This way, a browser with a 6.5 score can be thought of as doing the job more than five times faster and better.
Here now are the ten batteries we use for our CRPI 2.2 suite, and how we've modified them where necessary to suit our purposes:
- Nontroppo CSS rendering test. Up until recently, we were using a modified version of a rendering test used by HowToCreate.co.uk, whose two purposes have been to time how long it takes to re-render the contents of multiple arrays of <DIV> elements and to time the loading of the page that includes those elements. The reason we modified this page was because the JavaScript onLoad event fires at different times for different browsers -- despite its documented purpose, it doesn't necessarily mean the page is "loaded." There's a real-world reason for these variations: In Apple Safari, for instance, some page contents can be styled the moment they're available, but before the complete page is rendered, so firing the event early enables the browser to do its job faster -- in other words, Apple doesn't just do this to cheat. But the actual creators of the test themselves, at nontroppo.org, did a better job of compensating for the variations than we did: Specifically, the new version now tests to see when the browser is capable of accessing that first <DIV> element, even if (and especially when) the page is still loading.
Here's how we developed our new score for this test battery: There are three loading events: one for Document Object Model (DOM) availability, one for first element access, and the third being the conventional onLoad event. We counted DOM load as one sixth, first access as two sixths, and onLoad as three sixths of the rendering score. Then we adjusted the re-rendering part of the test so that it iterates 50 times instead of just five. This is because some browsers do not count milliseconds properly in some platforms -- this is the reason why Opera mysteriously mis-reported its own speed in Windows XP as slower than it was. (Opera users everywhere...you were right, and we thank you for your persistence.) By running the test for 10 iterations for five loops, we can get a more accurate estimate of the average time for each iteration because the millisecond timer will have updated correctly. The element loading and re-rendering scores are averaged together for a new and revised cumulative score -- one which readers will discover is much fairer to both Opera and Safari than our previous version.
- Celtic Kane JSBenchmark. The very first benchmark tests I ever ran for a published project were taken from Byte Magazine, and the year was 1978. They were classic mathematical and algorithmic challenges, like finding the first handful of prime numbers or finding a route through a random maze, and I was excited at how a TRS-80 trounced an Apple II in the math department. The new JSBenchmark from Sean P. Kane is a modern version of the classic math tests first made popular, if you can believe it, by folks like myself. For instance, the QuickSort algorithm segments an array of random numbers and sorts the results in a minimum number of steps; while a simplified form of genetic algorithms, called the "Genetic Salesman," finds the shortest route through a geometrically complex maze. It's good to see a modern take on my old favorites. Like the old CK benchmark, rather than run a fixed number of iterations and time the result, JSBenchmark runs an undetermined number of iterations within a fixed period of time, and produces indexes that represent the relative efficiency of each algorithm during that set period -- higher numbers are better.
- SunSpider JavaScript benchmark. Maybe the most respected general benchmark suite in the field focuses on computational JavaScript performance rather than rendering -- the raw ability of the browser's underlying JavaScript engine. Though it comes from the folks who produce the WebKit open source rendering engine that currently has closer ties with Safari, though is also used elsewhere, we've found SunSpider's results to appear fair and realistic, and not weighted toward WebKit-based browsers. There are nine categories of real-world computational tests (3D geometry, memory access, bitwise operations, complex program control flow, cryptography, date objects, math objects, regular expressions, and string manipulation). Each test in this battery is much more complex, and more in-tune with real functions that Web browsers would perform every day, than the more generalized, classic approach now adopted by JSBenchmark. All nine categories are scored and average relative to IE7 in Vista SP2.
- Mozilla 3D cube by Simon Speich, also known as Testcube 3D, is an unusual discovery from an unusual source: an independent Swiss developer who devised a simple and quick test of DHTML 3D rendering while researching the origins of a bug in Firefox. That bug has been addressed already, but the test fulfills a useful function for us: It tests only graphical dynamic HTML rendering -- which is finally becoming more important thanks to more capable JavaScript engines. And it's not weighted toward Mozilla -- it's a fair test of anyone's DHTML capabilities.
There are two simple heats whose purpose is to draw an ordinary wireframe cube and rotate it in space, accounting for forward-facing surfaces. Each heat produces a set of five results: total elapsed time, the amount of that time spent actually rendering the cube, the average time each loop takes during rendering, and the elapsed time in milliseconds of the fastest and slowest loop. We add those last two together to obtain a single average, which is compared with the other three times against scores in IE7 to yield a comparative index score.
- SlickSpeed CSS selectors test suite. As JavaScript developers know, there are a multitude of third-party libraries in addition to the browser's native JS library, that enable browsers to access elements of a very detailed and intricate page (among other things). For our purposes, we've chosen a modified version of SlickSpeed by Llama Lab, which covers many more third-party libraries including Llama's own. This version tests no fewer than 56 shorthand methods that are supposed to be commonly supported by all JavaScript libraries, for accessing certain page elements. These methods are called CSS selectors (one of the tested libraries, called Spry, is supported by Adobe and documented here).
So Llama's version of the SlickSpeed battery tests 56 selectors from 10 libraries, including each browser's native JavaScript (which should follow prescribed Web standards). Multiple iterations of each selector are tested, and the final elapsed times are rendered. Here's the controversial part: Some have said the final times are meaningless because not every selector is supported by each browser; although SlickSpeed marks each selector that generates an error in bold black, the elapsed time for an error is usually only 1 ms, while a non-error is as high as 1000. We compensate for this by creating a scoring system that penalizes each error for 1/56 of the total, so only the good selectors are scored and the rest "get zeroes."
Here's where things get hairy: As some developers already know, IE7 got all zeroes for native JavaScript selectors. It's impossible to compare a good score against no score, so to fill the hole, we use the geometric mean of IE7's positive scores with all the other libraries, as the base number against which to compare the native JavaScript scores of the other browsers, including IE8. The times for each library are compared against IE7, with penalties assessed for each error (Firefox, for example, can generate 42 errors out of 560, for a penalty of 7.5%.) Then we assess the geometric mean, not the average, of each battery -- the reason we do this is because we're comparing the same functions for each library, not different categories of functions as with the other suites. Geometric means will account better for fluctuations and anomalies.
Next: The other five elements of CRPI 2.2...
"But the fact that we perform all of our tests on one machine, and render their results as relative speeds, means that the physical platform is actually immaterial here. We could have chosen a faster or slower computer (or, frankly, a virtual machine) and you could run this entire battery of tests on whatever computer you happen to own. You'd get the same numbers because our indexes are all about how much faster x is than y, not how much actual time elapsed."
- Except for browsers that may make use of say one core more efficiently than 4, or vice versa.
- Or a browser that might take advantage of a new processor instruction versus another.
- Or a browser that likes higher speed memory but not a slow disk, versus another browser that does fine with a slow disk.
- Or that different browsers have different cache levels in their defaults, and you haven't listed those, nor where the caches are located on each platform.
- Or that you use a slower generation mechanical drive in the age of Solid State Disks. Oh you could argue that SSD's are "too new," but then the same could be said for using beta versions of software. Using a SSD I dare say your scores would be much closer together, and the differences of the browsers might be negligible. That you ignore hardware but embrace the latest and greatest software.... well, you get our point. You could do better.
You have a long way to go.
Score: 0
|I have yet to see any popular browser that takes direct advantage of multi-core, so that argument is bogus.
Browsers don't talk directly to processors. Another bogus point.
Yes, one browser might do slightly better with more RAM than a fast disk. Yes, browsers have different cache levels - they have a LOT of differences, so what? The whole point is seeing which design wins out on a typical platform, with results that scale as well as possible. That's the very definition of a good benchmark.
LOL, "the age of SSDs", which precisely no desktops except power gaming rigs are using yet.
Not only is your post ridiculous, its obviously nothing but thinly-veiled fear that your favorite browser might not do that well in future tests. Either that or simple hating on the BN staff, which would be even more bizarre. Honestly, the whole fanboy/troll thing baffles me.
Score: 4
|Good deal, psycros. You beat me to it. I would have worded it slightly differently... but you got the meat-and-potatoes of it. =)
Score: 0
|"- Except for browsers that may make use of say one core more efficiently than 4, or vice versa."
Nonsense, and is entirely the point of a benchmark (if it is designed to expose those strengths and weaknesses). You suggest cutting-edge hardware, yet find fault with the use of a quad-core processor. If browser 'A' performs better than browser 'B' simply because the PC had a dual- or quad-core CPU (which are extremely commonplace these days), then the developers of browser 'B' need to get their act together and implement better multi-threading support.
"- Or a browser that might take advantage of a new processor instruction versus another."
"- Or a browser that likes higher speed memory but not a slow disk, versus another browser that does fine with a slow disk."
Huh?? Where are you getting all of this from??
"- Or that different browsers have different cache levels in their defaults, and you haven't listed those, nor where the caches are located on each platform."
Again, that's the entire point of a benchmark. If one browser performs better than another, and one of the reasons is because of default browser cache settings (which has nothing to do with the hardware that was chosen), then one would think that's cause for the developer to do something to rectify that... wouldn't one?
"- Or that you use a slower generation mechanical drive in the age of Solid State Disks..."
Oh c'mon... the age of SSDs? From where I sit, we are all still firmly rooted in the age of mechanical magnetic storage. Yes, SSDs are too new right now, especially when there is such wide-ranging performance measurements across all manufacturers and models.
SSDs are nice. They are currently showing great potential, but they are not without their fair share of faults and drawbacks, and are not quite ready for mainstream adoption (yet). Severe performance degradation once the drive starts filling up with data (sometimes causing performance to drop below that of their mechanical counterparts), combined with the fact that not all drives suffer the exact same amount of degradation under identical circumstances, make them an unreliable component in what should be a stable benchmark testbed.
The platform that has been chosen is very representative of an extremely broad range of average PC users currently... the Lowest Common Denominator.
I'm not sure what's happened recently. You used to make much more sense than you have in recent weeks, but you've started losing it. I'm starting to think your account's been hacked...
Score: 1
|You guys really shouldn't mension Kane's site, before it stops spreading virus.. He has been spreading virus from his site for 2 weeks now... You now get more people infected by making ad for his site :/
Score: 0
|Well, we've been corresponding with Sean, and what happened there was that one of his sponsors was linking to a site that apparently spread a Flash virus. Sean was very responsible about this; he took down his site, moved it, and rebuilt it from scratch. Right now, it does not link to the site that spread the virus; we made certain of that before we ran with this new test suite.
I'm actually somewhat sorry that Sean has been blamed for this, when he was just as much a victim as those who got the virus. A similar incident happened last week to Gawker, and I don't recall anyone accusing Gawker of spreading the virus intentionally.
-SF3
Score: 1
|I didn't see where he stated Sean was doing it intentionally. Must not have read the whole post...
Score: 0
|You should try reading the whole post yourself... "He has been spreading virus from his site for 2 weeks now". It's not exactly a stretch to interpret that as saying he did it intentionally.
Score: 0
|It's not exactly a stretch to interpret that as saying that he did it unknowingly and unintentionally either.
All the OP stated was that he (Sean) was spreading a virus (which indeed was the case), and regardless of his apparent good intentions, made an uninformed and premature call (not knowing the problem had been resolved) to inform others to steer clear... nothing more, nothing less.
Score: 0
|""He has been spreading virus from his site for 2 weeks now"."
English Lesson: In the above sentence, it does not state in any way shape or form, the level of intent.
Any intent applied is done purely based on speculation by the reader.
As easily as you or anyone else could assume intent was implied, I or anyone else could assume the opposite.
The lesson for the day has been completed. Please do not forget to pick up your sign on the way out. ;)
Score: 0
|I prefer to assume that Sean is not his real name... but is, in fact, Wesley. =)
Score: 0
|Just pointing out where the confusion arises!
Score: 0
|"I prefer to assume that Sean is not his real name... but is, in fact, Wesley."
The Dread Pirate Roberts?
Score: 0
|"I am not the Dread Pirate Roberts", he said. =)
Score: 0
|"The real Roberts has been retired fifteen years and living like a king in Patagonia."
Heh... Way back in the day when the internet was still second-rate compared to the BBS, my friends and I would watch this movie at least once a month. Sad, isn't it?
(Anybody want a peanut?)
Score: 0
|"Way back in the day when the internet was still second-rate compared to the BBS..."
Inconceivable! =)
By the way, I just watched my Laserdisc version again just last week.
Score: 0
|