Google releases its data encoding format to compete with XML

By Scott M. Fulton, III | Published July 8, 2008, 11:53 AM

In an effort to solve the bulk and time-consumption problem when encoding large databases, Google developed its own alternative to XML. Yesterday, the company began evangelizing others to use it as an alternative to the industry standard.

There's an argument that open standards are only truly useful when one standard applies to any given category of service -- an argument that was raised in the matter of application formats. Now the broader category of data encoding -- handled nowadays by XML -- is about to receive a big challenge, ironically from the group perceived as the champion of open standards in Internet communication: Google.

Yesterday afternoon, Google publicly released documentation for a system it has been using internally, called Protocol Buffers, inviting others to use it as well. And in a surprising blog post, one of its own software engineers argued that its system was preferable to XML because it's less expensive to deploy, and can more easily scale up to very large databases.

"As nice as XML is, it isn't going to be efficient enough for this scale. When all of your machines and network links are running at capacity, XML is an extremely expensive proposition," wrote Google software engineer Kenton Varda. "Not to mention, writing code to work with the DOM tree can sometimes become unwieldy."

Google's public documentation shows Protocol Buffers (which has yet to be formally abbreviated) is indeed conceptually different from XML, in that it's rooted more in procedural logic than structural declaration. In XML, there's a schema which defines the structures of tables and recordsets, which is separate from the document that relates the contents of records in that structure.

In Protocol Buffers, by contrast, one file contains class declarations whose composition looks much more like C++. They're called .proto files, and they define structural prototypes for tables using object-oriented language with which many programmers are already familiar. Each member of a class -- analogous to an entry in a database -- has characteristics that define their types in memory, just like variables.

But here, in an unusual departure from the norm, the default values for these members are set to digits (for strings or literals) or values (for numerals) that define their place in a sequence -- where they fall within a record. Imagine if data were streamed onto recording tape, the way it used to be in the late 1960s and '70s. It's that streaming of the data sequence, without all the fenceposts, that differentiates XML from Protocol Buffers, by taking out all those markups that say when an entry or a record starts and stops.

Setting the data contents then takes place programmatically, using programming language constructs rather than a marked-up data file.

Under the heading, "Why not just use XML?" an overview page in the Protocol Buffers documentation reads, "Protocol buffers have many advantages over XML for serializing structured data. Protocol buffers: are simpler, are 3 to 10 times smaller, are 20 to 100 times faster, are less ambiguous, [and] generate data access classes that are easier to use programmatically."

Some might argue that, in the effort to solve the bulk problem, Google didn't really invent anything new at all -- it simply reverted to the older concept of the interface definition language (IDL), a defining feature of the era of COM and CORBA. Google anticipated that argument, and yesterday Varda offered a pre-emptive counter-argument to the question, "Isn't it just another IDL?"

"Yes, you could call it that. But, IDLs in general have earned a reputation for being hopelessly complicated," Varda wrote. "On the other hand, one of Protocol Buffers' major design goals is simplicity. By sticking to a simple lists-and-records model that solves the majority of problems and resisting the desire to chase diminishing returns, we believe we have created something that is powerful without being bloated. And, yes, it is very fast -- at least an order of magnitude faster than XML."

Comments

View comments by with a score of at least

If only the filesize would matter, I think EBML is for you.(h ttp://ebml.sourceforge.net)

Score: 0

|

If it is pre-processed to produce C++ code, Java code, etc... it should be possible to do in XML also without affecting size and speed. It is time to come-up with an "xml to protocol buffers" (de)converter.

Score: 0

|

I always thought that XML was an incredible waste of resources. I applaud any initiative for a format that can fix that.

Score: 0

|

our lord, let this be goodbye to the DOM API... amen.

Score: 0

|

PB is to XML as Javascript is to HTML. You can code in PB, but at some point you'll need to execute object.SerializeToXML();

I know its tough to learn XSD, XML, and XSL. These standards often create a mess of what could easily be accomplished with custom EDI. Not sure "open source" EDI is an improvement on the XML language though!

Here are the suppossed benefits:

1) Simple - until you find everything lacking and have to add a whole bunch more open source code!

2) Size - XML can also be compressed (compiled) when stored or transmitted. It can be encrypted too! "XML protocols" tend to be managed within an application's architecture.

Public XML web services are not compiled (yet) because line level compression and web service caching take care of most issues.

3) Speed - It may be fast, so long as it remains simple, and you don't send data to any other program except your own. Otherwise the overheard and phone calls will drown out any speed attained. Make your own standard, support your own revisioning!

4,5) Simple ambiguity and simple data classes are great so long as apples are in fact oranges. Pattern matching, bounds criteria, and other constraint info is missing from the .proto def.

Score: 0

|

But will Protocol Buffers run on Linux? After reading http://www.promotinglinux.com/truth/ I'm not so sure.

Score: 0

|

That's about the stupidest s*** I've ever read. Don't think a single paragraph had any truth in it.

I tried to read it as satire, but it was still just stupid s***.

meh

Score: 0

|

XML is a huge waste of space when you are dealing with anything larger than a config file.

People like XML because it is easily editable, but do you really need to edit record 38484384 directly?

How much of XML is simply start and close tags?
A lot.

Score: 0

|

No mention of JSON? I thought that was supposed to be smaller and faster, too.

Score: 0

|

It's not really competing with XML in most cases.

Only serializing data for internal purposes. Google points out some of the shortcomings including that it's not human readable. It's also not as easy to parse on things that aren't Java, C++ or Python since no code exists for that yet. That's where protocol buffers excels.

Protocol buffers isn't a replacement of XHTML, XML, RSS, XSLT, or any of the various XML uses out there.

It's more of a replacement for serialized PHP or JSON.

Score: 0

|

They need to come out of the tower and into the real world, in my opinion. They seem to have forgotten some of the reasons XML was intended for like ease of use and a lower barrier to entry.

This sounds like you'll need to recompile after every change and just for those who can do programming.

Score: 0

|

You need to realize that PB isn't the solution to every problem. It solves a very targeted problem domain, one where programs must be modified anyway.

Score: 0

|

Sounds good to me. When I first learned of XML back in 2002, my first concern was that while it was nice to have a universal format for storing data, it seemed extremely inefficient... will be interested to see how protocol buffers go...

Score: 0

|

Sounds like "doing evil" to me!

Score: 0

|

PDC 2009: What have we learned this week?

There was the freebie that no one will forget, the heebie-jeebies courtesy of Scott Guthrie, and a teensy bit clearer picture of how this cloud thingie should work.

Live report: Will Google Chrome OS change Linux?

The mysteries of just what Chrome OS is, and how much of an operating system it truly is, may be resolved today.

PDC 2009: Microsoft cares about Web browser performance

The effort to give users of the world's dominant Web browser the impression of quality, is a personal one for the man who leads that battle.

Nokia re-affirms its commitment to Symbian, sort of

Maemo won't necessarily be replacing Symbian in the Nokia N-Series, but that's definitely a place where it will be found.

E-book readers will be in short supply this holiday season

E-readers are hot this year, and a lot of compelling new products have been released, but are there enough electrophoretic displays to go around?

Sony looks to finally open a single storefront for downloads

Sony has had many different download portals for movies, music, e-books, and games, and now it's looking to make a single shop for all of it.

Tuning out the tablet: Time to give the endless speculation a rest

Wide Angle Zoom: Wishing and hoping and thinking and praying....won't put an iTablet on the market.

Five improvements for IT managers in 2010

If businesses are to improve their efficiency for next year, they need to stop and reassess the basic tenets of their job.

AOL's spinoff from Time Warner to shed 2,500 jobs

As AOL moves toward become an independent company again, it will cut nearly a third of its workforce.

Gartner: SMS-based money transfer will be bigger than mobile browsing, search

Gartner issues its predictions for the 10 things our phones will be doing in 2012.

Don't forget to upgrade to Firefox 3.6 beta 3 today

Mozilla has released the latest beta its Firefox 3.6 browser software, just over one week after beta 2.