Login:
Password:

Co-opting GPU for CPU Tasks Advanced by NVidia

By Scott M. Fulton, III, BetaNews

February 16, 2007, 2:40 PM

Earlier this week, engineers at nVidia put the finishing touches on version 0.8 of its Compute Unified Device Architecture system for Windows and Red Hat Linux. CUDA's objective is to enable C programmers to utilize the high-throughput pipelining architecture of an nVidia graphics processor - pipelines that are typically reserved for high-quality 3D rendering, but which often sit unused by everyday applications - for compute-intensive tasks that may have nothing to do with graphics.

Today, the company announced its first C compiler - part of the CUDA SDK, which will enable scientific application developers for the first time to develop stand-alone libraries that are executed by the graphics processor, through function calls placed in standard applications run on the central processor.

NVidia's objective is to exploit an untapped reservoir on users' desktops and notebooks. While multi-core architecture has driven parallelism in computing into the mainstream, multi-pipeline architecture should theoretically catapult it into the stratosphere. But applications today are naturally written to be executed by the CPU, so any GPU-driven parallelism that's going to happen in programming must be evangelized first.

Which is why the company has chosen now to make its next CUDA push, a few weeks prior to the Game Developers' Conference in San Francisco. The greatest single repository of craftspersons among developers may be in the gaming field, so even though games already occupy the greater part of the GPU's work time, it's here where a concept such as CUDA can attract the most interest.

"The GPU is specialized for compute-intensive, highly parallel computation - exactly what graphics rendering is about," reads nVidia's latest CUDA programming guide (PDF available here), "and therefore is designed such that more transistors are devoted to data processing rather than data caching and flow control."

Huge arithmetic operations may be best suited to GPU execution, nVidia engineers believe, because they don't require the attention of all the CPU's built-in, microcoded functions for flow control and caching. "Because the same program is executed for each data element," reads the CUDA v. 0.8 guide, "there is a lower requirement for sophisticated flow control; and because it is executed on many data elements and has high arithmetic intensity, the memory access latency can be hidden with calculations instead of big data caches."

For CUDA to actually work, however, a computer must be set up with an exclusive NVidia display driver; CUDA is not an intrinsic part of ForceWare, at least not yet. In addition, programs must be explicitly written to support CUDA's libraries and custom driver; it doesn't enable the GPU to serve as a "supercharger" for existing applications. Because the GPU is such a different machine, there's no way for it to take a load off the CPU's shoulder's directly, like an old Intel 8087 or 80186 co-processor used to do.

So an application that supports CUDA thus, by definition, supports nVidia. AMD also has its own plans for co-opting GPU power, which it made immediately clear after its acquisition of ATI.

The CUDA programming guide demonstrates how developers can re-imagine a math-intense problem as being delegated to processing elements in a 2D block, like bestowing assignments upon a regiment of soldiers lined up in formation. Blocks and threads are delegated and proportioned, the way they would normally be if they were being instructed to render and shade multi-polygon objects. Memory on-board the GPU is then allocated using C-library derivatives of common functions, such as cudaMalloc() for allocating blocks of memory with the proper dimensions, and cudaMemcpy() for transferring data into those blocks. It then demonstrates how massive calculations that would require considerable thread allocation on a CPU are handled by the GPU as matrices.

"This complete development environment," read an nVidia statement this morning, "gives developers the tools they need to solve new problems in computation-intensive applications such as product design, data analysis, technical computing, and game physics."

Add a Comment (20 Comments)

BetaNews reserves the right to remove any comment at any time for any reason. Please keep your responses appropriate and on topic. Foul language and personal attacks will not be tolerated.

Name (required):

E-mail (required):

Enter Your Comment:

By foxfyre

edited Feb 17, 2007 - 7:54 PM

I would be curious to know how this is similar or how it differs from such structures as Alti-vec, the agile (2-128 bit) vector calculation unit (matrix) employed in the PowerPC, but never really exploited.

But with the expense of some of the high-end GPUs, is this really the most cost effective way to implement such solution? Or would incorporating such technology into a co-processor similar to what was done in Alti-vec thus removing some of the contention issues make more sense?

If ATI is working on a similar solution, we may find out with AMD.

It sure would be nice if the x86 world would take advantage of technology such as what Alti-vec offered...

Score: 0

By holy1661

edited Feb 19, 2007 - 12:28 PM

Here is ATI's Folding at home page...

http://folding.stanford.edu/FAQ-ATI.html

Neat O

Score: 0

By foxfyre

posted Feb 19, 2007 - 3:19 PM

Thanks.

Score: 0

By 8800GTXer

posted Feb 16, 2007 - 3:20 PM

you mean my AMD64 FX-60 OCed at 3Ghz will do less of the work and my (2) 8800GTX-SLI cards I've got will actually do something other than sitting idle most of the time?? Hell I hope so...

Vista isn't that big of deal with it's flashy interface.. I want some damn games that will make use of the (2) 8800GTX cards in SLI mode..

Score: 0

By AshG

posted Feb 17, 2007 - 1:30 PM

If Folding can make use of this type of programming in some way, then you are going to have one monster number-crunching machine.

As for games.... The hardware is almost always a year ahead of the bulk of games for it. When you upgrade to the next-gen set of SLI cards, the bulk of games for what you have will just be hitting the scene.

Nice sounding rig, you must be enjoying yourself greatly.

Score: 0

By the artist

posted Feb 16, 2007 - 5:19 PM

"oh, excuse me, i'm Richie Rich Super Computer Guy"
I wish we were friends so i could borrow your billion dollar Ferrari too.

Score: 0

By quicken123

edited Feb 17, 2007 - 3:17 AM

Well said.

Score: 0

By Fickleflame

edited Feb 16, 2007 - 4:16 PM

You either have a perverse sense of task specific hardware configurations, or just way to much money.

This isn't so much about offloading the CPU, but rather using the GPU to enhance what the CPU is doing. It also has little to do with Vista's interface. You did see the reference to Red hat in the article?

Now, how this would apply to a real world scenario... You could have a scientific application doing mathematical analysis processed by the CPU. Then in parallel the GPU could take the output and convert it to a usable form for use with Excel. Your not offloading CPU, but just using the GPU to do a task that would normally eat up cycles on the CPU.

So 8800GTXer is still wasting power with his (2) 8800GTX cards in SLI mode because someone with an Academia project would probably never get within 10 feet of his over priced rig.

Score: 0

By Grazer

posted Feb 16, 2007 - 7:03 PM

You could have a scientific application doing mathematical analysis processed by the CPU. Then in parallel the GPU could take the output and convert it to a usable form for use with Excel.
I think it would be the other way around. Despite the multicore trend in recent CPUs they are still horribly serial compared to any modern GPU. It is much more likely that the GPU is running the same calculation on an extremely large number of data points and that the CPU is then spitting the results to a file or using the results in other sequential tasks.

A simple example is that CUDA probably makes the addition of N numbers possible in log(N) time now...if not faster.

Score: 0

By Fickleflame

posted Feb 17, 2007 - 12:37 AM

Very true. Thanks for the correction.

Score: 0

By PC_Tool

posted Feb 16, 2007 - 3:51 PM

How many folks build an SLI system for Excel Document Processing?

I'd hazard to guess that most SLI systems are used for gaming a majority of the time.

You may be an exception to that rule, but...

Score: 0

By foxfyre

posted Feb 16, 2007 - 5:03 PM

You joke about that now...but remember you are referring to WINDOWS!

Score: 0

By PC_Tool

posted Feb 17, 2007 - 12:18 PM

I also know that that's got nothing to do with it.

But hey... Flame on.

Score: 0

By foxfyre

edited Feb 19, 2007 - 3:16 PM

I will go slow for the business and technologically challenged...

Thus far, Windows is the only OS where issues of resources are a fundamental and limiting condition...

It has nothing to do with flaming anyone. But it does have a relationship to one's knowledge of the market and the current issues with PCs, Vista and having a machine with sufficient resources necessary to run it.

I realize that you are standing still, but need we go even slower?

Is there a phrase for "Dumb on"?
It would certainly be appropriate for you.

Score: 0

By PC_Tool

posted Feb 19, 2007 - 4:36 PM

Truly, good sir, your arrogance knows no bounds.

Obviously, you don't use Apple computers.

Photoshop CS2 system requirements:

Windows
* 320MB of RAM (384MB recommended)
* 650MB of available hard-disk space
Macintosh
* 320MB of RAM (384MB recommended)
* 750MB of available hard-disk space

Pretty damn similar. If OS overhead in resources is so extensive in Windows, the Mac version must be insanely bloated. I suppose, if one were so inclined, it would also explain the additional 100MB of Hard disk space required.

But, hey...what do I know.

Score: 0

By foxfyre

posted Feb 20, 2007 - 5:46 AM

What do you know? A very appropos question!
LMAO!

Once again you exceed our expectations as you again demonstrate your stupidity!

So just what in hell do the resources necessary to operate Photoshop have to do with the basic resources required to load and operate the OS?

I guess we should offer remedial courses for fools such as yourself that confuse an application such as Photoshop with an operating system such Windows Vista.

Of course, I suspect you are off looking for
Adobe's sale figures in order to determine Vista adoption rates too.

It is a treat to watch such a clown such as yourself confuse the resources necessary to load and operate an OS with that of an application!

Newsflash! PCDroll confuses an application with an OS.

"But, hey...what do (YOU) know"?

I trust that was a rhetorical question!

Well, we all now know that you are clueless as to the distinction between an application and an operating system!

NOTHING gets by you!

ROFLMAO!!!!!!!!!!!!

Score: 0

By PC_Tool

posted Feb 20, 2007 - 9:20 AM

Alright. You don't get it.

That's fine.

It is a treat to watch such a clown such as yourself confuse the resources necessary to load and operate an OS with that of an application!

Glad you can find your own stupidity so very entertaining. You do know that the reference to 320MB refers tot he total system RAM, and not just what is dedicated to the application, right? Ya know, the amount of RAM needed to run botht he system *and* the App?

Yeah, the fact that they both (Mac and Windows) require the smae amount of RAM to do the same job says *nothing* about the OS, eh?

Score: 0

By bourgeoisdude

posted Feb 16, 2007 - 2:51 PM

WOOHOO! C is stil out there!!!

It was the only language I really enjoyed writing programs in--except maybe batch files, but that's a little different :)

Score: 0

By Red_Vader

edited Feb 16, 2007 - 7:22 PM

Alive and kicking. Heck, those of us who are waiting to use CUDA are still writing things in FORTRAN (about 4x faster then C when dealing with multidimentional arrays)

Score: 0

By Grazer

posted Feb 16, 2007 - 6:58 PM

FORTRAN (about 4x faster then C)
In my experience it really depends more on the compiler and programmer than it does on the language. For instance, memory allocation of multidimensional arrays in Fortran and C are completely inverted. You would not want to access elements of the same large array in the exact same way in fortran and c, it would work wonderfully in one, and horribly in the other.

Score: 0