Graphics Processing Unit

A GPU is the common term for a piece of a computer/console's hardware that is dedicated to drawing things. IE: graphics. The term "GPU" was coined by nVidia upon the launch of their GeForce line of hardware. This was generally a marketing stunt, though the GeForce did have some fairly advanced processing features in it. However, the term GPU has become the accepted shorthand for any graphics processing chip, even pre-GeForce ones.

The general purpose of a GPU is to relieve the CPU of the responsibility for a significant portion of rendering. This has the double benefit of providing more CPU time while simultaneously offloading work to a chip designed specifically for that task.

Both consoles and regular computers have had various kinds of GPUs. They had two divergent kinds of 2D GPUs, but they converged with the advent of 3D rendering.

Console 2D GPU

This kind of GPU, pioneered by the TMS 9918/9928 (see below) and popularized by the NES, Sega Master System and Sega Genesis, forces a particular kind of look onto the games that use them. You know this look: everything is composed of a series of images, tiles, that are used in various configurations to build the world.

This enforcement was a necessity of the times. Processing power was limited, and while tile-based graphics were somewhat limited in scope, it was far superior to what could be done without this kind of GPU.

In this GPU, the tilemaps and the sprites are all built up into the final image by the GPU hardware itself. This drastically reduces the amount of processing power needed -- all the CPU needs to do is upload new parts of the tilemaps as the user scrolls around, adjust the scroll position of the tilemaps, and say where the sprites go.

Computer 2D GPU

Computers had different needs. Computer 2D rendering was driven by the needs of applications more so than games. Therefore, rendering needed to be fairly generic. Such hardware had a framebuffer, an image that represents what the user sees. And the hardware had video memory to store extra images that the user could use.

Such hardware had fast routines for drawing colored rectangles and lines. But the most useful operation was the blit or BitBlt]: a fast video memory copy. Combined with video memory, the user could store an image in VRAM and copy it to the framebuffer as needed. Some advanced 2D hardware had scaled-blits (so the destination location could be larger or smaller than the source image) and other special blit features.

The CPU effort is more involved in this case. Every element must be explicitly drawn by a CPU command. The background was generally the most complicated. This is why many early computer games used a static background. They basically had a single background image in video memory which they blitted to the framebuffer each frame, followed by a few sprites on top of it. Later PC games before the 3D era managed to equal or exceed the best contemporary consoles like the Super NES both through raw power (the 80486DX2/66, a common gaming processor of the early 90s, ran at 66 MHz, almost 10 times the clock speed of the Sega Genesis, and could run 32-bit code as an "extension" to 16-bit DOS) and through various programming tricks that took advantage of quirks in the way early PCs and VGA worked. John Carmack once described the engine underpinning his company's breakout hit Wolfenstein 3D as "a collection of hacks", and he was not too far off. (It was also the last of their games that could not only run, but was comfortably playable on an 80286 PC with 1 MB RAM -- a machine that was considered low-end even in 1992 -- which serves as a testament to the efficiency of some of those hacks.)

Before the rise of Windows in the mid-1990s, most PC games couldn't take advantage of newer graphics cards with hardware blitting support; the CPU had to do all the work, and this made both a fast CPU and a fast path to the video RAM essential. PCs with local-bus video and 80486 processors were a must for games like Doom and Heretic; playing them on an old 386 with ISA video was possible, but wouldn't be very fun.

Basic 3D GPU

The basic 3D-based GPU is much more complicated. It isn't as limiting as the NES-style 2D GPU.

This GPU concerns itself with drawing triangles. Specifically, triangles that appear to imitate shapes. They have special hardware in them that allows the user to map images across the surface of a triangular mesh, so as to give it surface detail. When an image is applied in this fashion, it is called a texture.

The early forms of this GPU were just triangle/texture renderers. The CPU had to position each triangle properly each frame. Later forms, like the first GeForce chip, incorporated triangle transform and lighting into the hardware. This allowed the CPU to say, "here's a bunch of triangles; render them," and then go do something else while they were rendered.

Modern 3D GPU

Around the time of the GeForce 3 GPU, something happened in GPU design.

Take the application of textures to a polygon. The very first GPU had a very simple function. For each pixel of a triangle:

color = textureColor * lightColor

A simple equation. But then, developers wanted to apply 2 textures to a triangle. So this function became more complex:

color = texture1 * lightColor * texture2

Interesting though this may be, developers wanted more say in how the textures were combined. That is, developers wanted to insert more general math into the process. So GPU makers added a few more switches and complications to the process.

The GeForce 3 basically decided to say "Screw that!" and let the developers do arbitrary stuff:

color = Write it Yourself!

What used to be a simple function had now become a user-written program. The program took texture colors and could do fairly arbitrary computations with them.

In the early days, "fairly arbitrary computations" was quite limited. Nowadays, not so much. These GPU programs, called shaders, commonly do things like video decompression and other sundry activities. Modern GPUs can become something called the General Purpose GPU, which people have taken advantage of massive calculation performance of the GPU to do work that would take a CPU much longer to do.

Difference between GPU and CPU

GPUs and CPUs are built around some of the same general components, but they're put together in very different ways. A chip only has a limited amount of space to put circuits on, and GPUs and CPUs use the available space in different ways. The differences can be briefly summarized as follows:

Execution units: These are the things that do things like add, multiply, and other actual "work". A GPU has dozens of times as many of these as a CPU, so it can do a great deal more total work than a CPU in a given amount of time, if there's enough work to do.
Control units: These are the things that read instructions and tell the execution units what to do. CPUs have many more of these than GPUs, so they can execute individual instruction streams in more complicated ways (out-of-order execution, speculative execution, etc.), leading to much greater performance for each individual instruction stream.
Storage: GPUs have vastly more fast storage (registers) than CPUs, but no cache. This means that they're ridiculously fast on workloads that can fit in their registers, but if more data is required, latency will shoot through the roof.

In the end, CPUs can execute a wide variety of programs at acceptable speed. GPUs can execute some special types of programs far faster than a CPU, but anything else it will execute much slower, if it can execute it at all.

The Future

GPUs today can execute a lot of programs that formerly only CPUs could, but with radically different performance characteristics. A typical home GPU can run hundreds of threads at once, while a typical mid-range home CPU can run 4-16 threads. On the other hand, each GPU thread progresses far more slowly than a CPU thread. Thus if you have thousands of almost identical tasks you need to run at once, like many pixels in a graphical scene or many objects in a game with physics, a GPU might be able to do work a hundred times faster than a CPU. But if you only have a few things to do and they have to happen in sequence, a CPU-style architecture will give vastly better performance. As general-purpose GPU programming progresses, GPUs might get used for more and more things until they're nearly as indispensable as CPUs. Indeed, in some tasks a strong GPU is often required: From the late 2010s on, more and more consumer devices include powerful GPUs for things other than gaming, often related to machine learning.

Some notable GPUs over the years:

1970s

Motorola 6845 (1977)

Not actually a GPU on its own, but an important building block of one. The 6845 was a CRT controller, a chip that generated timing signals for the display and video memory. It was originally meant for dumb terminals (and has a few features that would have only made sense on a 1970s terminal, such as light-pen support), but since it didn't impose any restrictions on memory or resolution other than the size of its counters, it was very, very tweakable. This chip ended up being the basis of all of IBM's display adapters and their follow-ons and clones, and still exists in modern VGA cards in one form or another. It was also used by a few non-IBM machines such as the Commodore CBM range. A full GPU version, the Motorola 6847, was used in the TRS-80 Color Computer.

Texas Instruments 9918/9928 (1979)

The first GPUs to implement tile-based rendering, the 9918 and 9928 were originally designed for and introduced with TI's 99/4 home computer in 1979. While the 99/4 was a flop, the 9918 and 9928 were far more successful, being used in some of the most innovative consoles of the early 1980s. Sega in particular made good use of the 99x8, and the GPUs in the Sega Master System and Sega Genesis are improved versions of it. Also, the NES's PPU is an improved (but incompatible) 9918 workalike, with support for a 256-color palette (some of which are duplicates, making the usable number on NTSC somewhere around 55) and more sprites available at once (though only at one size, 8×16).

The 9918 had its own VRAM bus, and the NES in particular wired the VRAM bus to the cartridge slot, making it so that the tiles were always available for use. The NES had only 2 kilobytes of VRAM; since the tiles were usually in the game cartridge's CHR-ROM, the VRAM was only needed for tile maps and sprite maps. Other consoles (including Sega's machines and the SNES) stuck with the more traditional 9918-esque setup and required all access to the GPU to go through the main bus; however, the SNES and the Genesis can use DMA for this, something the NES didn't have.

Atari ANTIC/CTIA/GTIA (1979)

The first programmable home-computer GPU. ANTIC was ahead of its time; it was a full microprocessor with its own instruction set and direct access to system memory, much like the blitter in the Amiga 6 years later (which, not coincidentally, was designed by the same person). By tweaking its "display list" or instruction queue, some very wild special effects were possible, including smooth animation and 3D effects. CTIA and GTIA provided up to 128 or 256 colors, respectively, a huge number for the time.

1980s

IBM Monochrome Display Adapter and Color Graphics Adapter (1981)

These were the GPUs offered with the original IBM PC. The MDA quickly gained a very good reputation for its crisp, high resolution text display, but lacked the ability to render graphics of any sort. For that you needed the CGA adapter -- which only supported two simultaneous colors (not counting black and white) in a low-resolution mode with a freaky, unnatural-looking color palette. This ensured that the early PCs had a pretty dismal reputation for graphics work, and was a major factor in the Macintosh and the Amiga taking over that market. The CGA was superseded by the Enhanced Graphics Adapter (EGA) in 1984, and while that chip was at least usable for graphics work, it was still very limited; it wouldn't be until 1987 that the PC finally got an affordable, world-class color GPU.

Hercules Graphics Card (1982)

You may have worked out from the summary above that early PC users had something of a dilemma with their GPU card choices -- either the "excellent text-display but no real graphics" of the MDA or the "mediocre text display but with actual graphics" of the CGA. Bear in mind that DOS couldn't understand multiple graphics cards in the same system, so you had to choose one or the other. Hercules decided to Take a Third Option and produced their own graphics card which emulated the MDA's text mode, and offered a monochrome version of the CGA's color mode, making it suitable for basic graphics work. This made the Hercules a hugely popular card, and the most popular solution for PC users for about five years.

IBM Professional Graphics Controller (1985)

This may well be the earliest version of a modern-day GPU (rather than just a display adapter), supporting full 2D hardware acceleration, some very basic 3D capabilities, along with display resolutions that wouldn't be equaled until the arrival of the Super VGA standard four years later. In addition, it effectively included a full computer on the card (including an Intel 8088 CPU), making it fully programmable in a way that most other cards wouldn't be until the turn of the century. Unfortunately, there was one very big flaw -- the combined cost of the card and the proprietary IBM monitor that you had to use was $7,000, almost four times the cost of the Commodore Amiga that debuted a couple of months later. You can probably guess which of the two became the must-have solution for graphics professionals.

IBM Video Graphics Array (1987)

Introduced with the IBM PS/2 line, finally offering the IBM-compatible PC graphics abilities that outshone all its immediate competitors. Many noted that it would probably have been a Killer App for the PS/2, had it not been for the Micro-Channel Architecture fiasco (more on that elsewhere). In any case, this established most of the standards that are still used for graphics cards to this day, offering a then-massive 640x480 resolution with 16 colors, or 320x240 with an also-massive 256 colors. Subsequent cards included the Super VGA and XGA cards, which extended the maximum resolutions to 800x600 and 1024x768 respectively, but they are generally considered to fall under the VGA umbrella.

1990s

S3 86C911 (1991)

One of the most popular 2D accelerators, and one that made them matter. Originally, graphics chips on PCs were limited to a higher resolution text mode, or a smaller resolution graphics mode. 2D acceleration however, only had a graphics mode that could manipulate any arbitrary 2D image you wanted. The chip was named after the Porsche 911, and it delivered its claims so well, by the mid 90s, every PC made had a 2D accelerator.

nVidia NV1 (1995)

One of the earliest 3D accelerators on the PC, manufactured and sold as the Diamond Edge 3D expansion card. It was actually more than just a graphics card - it also provided sound, as well as a port for a Sega Saturn controller. It was revolutionary at the time, but very difficult to program due to its use of Quadrilateral rendering (which the Sega Saturn's GPU also used) instead of the Triangles (Polygons) used today. It was also very expensive and had poor sound quality. It was eventually killed when Microsoft released Direct X and standardized Polygon rendering. It was supposed to be followed by the NV2, which never made it past design stage.

S3 ViRGE (1995)

The first 3D accelerator chip to achieve any real popularity. While it could easily outmatch CPU-based renderers in simple scenes, usage of things like texture filtering and lighting effects would cause its performance to plummet to the point where it delivered far worse performance than the CPU alone would manage. This led to it being scornfully referred to as a "3D decelerator," though it did help get the market going.

Rendition Vérité V1000 (1995)

This chip included hardware geometry setup years before the GeForce. While it was used as its marketing point against 3dfx's Voodoo chip, its performance was fairly abysmal when it came to games. It was used mostly in workstation computers.

Matrox Mystique (1996)

Matrox's first venture into the 3D graphics market. Was notable for being bundled with a version of Mechwarrior 2 that utilized the card. It did not perform well however against the 3dfx Voodoo and lacked a number of features the competition had. Matrox tried again with the slightly more powerful Mystique 220, but to no avail. The product line was soon after labelled the "Matrox Mystake" due to its dismal performance.

3dfx Voodoo (1996)

The chip that allowed 3D accelerators to really take off. Its world-class performance at a decent price helped ensure that it ascended to market dominance. However, it lacked a 2D graphics chip, meaning one had to have a separate card for essentially any non-video game task. Fortunately most users already had a graphics card of some kind so this wasn't a major issue, and 3dfx cleverly turned this to their advantage by saying the Voodoo wouldn't require users to throw their old graphics cards away in their marketing campaign. 3dfx did produce a single-card version called the Voodoo Rush, but used a rather cheap and nasty 2D GPU which had mediocre visual quality and actually prevented the Voodoo chipset from working optimally. As a result, they stuck to making 3D-only boards at the high-end until 1999's Voodoo3.

SGI Reality Co-Processor (1996)

Developed for the Nintendo 64, what this GPU brought was anti-aliasing and trilinear mipmapping (which helped textures look less pixelated up close). Mipmapping wasn't a standard feature for PCs until 1998, and anti-aliasing didn't show up until 2000, though impractical until 2002.

nVidia Riva 128 (1997)

nVidia's real major foray into the 3D graphics market and helped establish nVidia as a major 3D chipmaker. One of the first GPU's to fully utilize Direct X. The Riva 128 performed fairly well, but was not enough to really challenge the 3dfx Voodoo. An improved version, the Riva 128 ZX released the following year, had more onboard memory, a faster memory clock, and full Open GL support.

3dfx Voodoo2 (1998)

Aside from introducing dual-texturing (which would become multi-texturing), Voodoo2 also introduced the idea of putting two cards together to share the rendering workload. While it was termed SLI, the abbreviation stood for scan-line interleave. nVidia revived the trademark later. It also had a derivative where two of the GPUs were mashed into one unit. Like the first Voodoo, 3dfx created a single card solution, derived from the Voodoo2 and known as the Banshee. This one had the 2D and 3D chips combined into one unit and was actually somewhat successful in the OEM market; while it didn't catch on in the high-end, it provided the basis for the following year's Voodoo3.

Intel i740 (1998)

First introduced in order to promote its new Accelerated Graphics Port (AGP) slot. However, its 3D graphics performance was so awful^[1] that it was quickly shelved months later. It did however, give Intel enough experience to incorporate it into its chipsets as integrated graphics, notably the GMA series.

PowerVR Series 2 (1998)

The GPU that drove the Dreamcast. What set it apart from the others was it used PowerVR's tile-based rendering and the way it did pixel shading. Tile based rendering worked on only a subset of the 3D scene, which can make use of lower memory buses. It also sorted the depth of polygons first then colored them, rather than coloring them then figuring out the depth later. Though the Dreamcast failed, and PowerVR stopped making PC cards as well, this card is notable in that it showed PowerVR that the true strength of their tile rendering hardware was in embedded systems. This lead to their Series 4 and 5 lines (see below). One notable feature it had thanks to its tile based rendered was Order-Independent Transparency, which allows transparent objects to move in front of each other and still look realistic. Direct X 11, released in 2010, made this feature a standard.

ATI Rage 128 (1998)

While ATI were very successful in the OEM market for most of the 1990s, most enthusiasts didn't give them a second thought. The Rage 128 started to change things, by being the first chip to support 32-bit color in both desktop and games, along with hardware DVD decoding. On top of that, ATI offered a variant called the "All-in-Wonder" which integrated a TV tuner. The Rage 128 quickly became popular as a good all-round solution for multimedia-centric users, but its rather average gaming performance (and ATI's then-notoriously bad drivers) meant most high-end gamers looked elsewhere.

S3 Savage (1998)

S3's major attempt at breaking into the 3D graphics market. This chip introduced the now industry standard S3 Texture Compression algorithm that allowed very large and highly detailed textures to be rendered even with relatively little Video RAM on board. The Savage itself however suffered poor chip yields and buggy drivers, thus it never really took off.

Matrox G400 (1999)

The first graphics chip which could natively drive two monitors. On top of that the G400 MAX version was the most powerful graphics card around on its release, meaning you could game across two monitors with just one card, something that neither ATi nor nVidia came up with for several more years. Unfortunately, management wrangles meant that Matrox's subsequent product releases were botched, and nothing ever came from the G400's strong positioning.

nVidia GeForce 256 (1999)

The first graphics processing unit to carry the term "GPU". What differentiated this from other graphic processors was the inclusion of transform (placing polygons) and lighting in the hardware itself. The CPU normally handled those in prior hardware. The GeForce 2, an evolution of the 256, introduced two other kinds of graphics cards: budget and professional. The impact of the GeForce 256 was so great, many games up until 2006 only required its successor, the GeForce 2.

2000s

3dfx Voodoo5 (2000)

What was notable about this video card was that in a desperate attempt to take the crown, 3dfx wanted this GPU to be scalable. Graphics cards sporting two GPUs were released, and it almost accumulated into 4 with the Voodoo5 6000, but the company folded before that happened. One major factor that contributed to this demise was the sudden rise in the prices of RAM chips, and the Voodoo5 needed LOTS of it (because each GPU needed its own bank of RAM; they couldn't share it), which would have driven production costs over the roof. On a more positive note the, Voodoo5 was the first PC GPU to support anti-aliasing, though enabling it resulted in a major performance hit. A single-GPU version called the Voodoo4 was also released, but got steamrollered by the equally-priced, far more powerful GeForce 2 MX that came out at the same time.

ATi Radeon (2000)

The Radeon was ATI's first real contender for a well performing 3D graphics chip, and basically started the road that put ATI in the place of 3dfx as nVidia's main rival. Notably, it was the first GPU to compress all textures to make better use of memory bandwidth, and also optimized its rendering speed by only rendering pixels that were visible to the player. Despite being much more technologically elegant than the GeForce 2, it lacked the brute force of its rival, and only outperformed it at high resolutions where neither chip really delivered acceptable performance.

ATi Flipper (2001)

This was the GPU for the Nintendo GameCube, and was superficially similar to the GeForce 2 and the original Radeon. Where it stood out was that it integrated a megabyte of RAM into the actual graphics processor itself, making it possible to carry out low levels of Anti-Aliasing with virtually zero performance loss. The Flipper itself was recycled and slightly upgraded to serve in the GameCube's successor, the Wii.

nVidia nForce (2001) While not exactly a GPU, the nForce was a brand of motherboard chipsets with one of NVIDIA's GPUs integrated in it. This meant that motherboards with integrated graphics no longer had abysmal performance in 3D applications. People who wanted to build entry-level computers with some graphical capabilities could now do it without a discrete graphics card. The nForce was exclusive to AMD processors until about 2005, when they started making Intel versions as well.

Eventually the nForce became a motherboard chispet without graphics to host NVIDIA exclusive features, such as SLI. When they brought back a GPU into the chipset, they also included a new feature known as HybridPower, which let the computer switch between the integrated graphics when not doing a 3D intensive task to save power. Eventually, NVIDIA dropped out of the chipset business (Intel withdrew permission for NVIDIA to make chipsets for their CPUs, while AMD bought out rivals ATI and co-opted their chipset line, resulting in NVIDIA quickly being abandoned by AMD enthusiasts), but most of their features were licensed or revamped. SLI is now a standard feature in Intel's Z68 chipset and HybridPower is seen in laptops as Optimus, though rumor is that NVIDIA is making a desktop version called Synergy.

Matrox Parhelia (2002)

Matrox's last major design to date. Like the G400 before it, the Parhelia included a lot of new features, including a 256-bit memory bus (which allows Anti-Aliasing with a much lesser performance loss) and support for gaming across three displays^[2]. While the Parhelia should have been more than a match for the competing GeForce 4 Ti series, Matrox made a serious blunder and neglected to include any sort of memory optimization system, crippling the Parhelia to the point where it struggled to match the GeForce 3. ATi and nVidia's subsequent efforts would implement the Parhelia's features more competently, and Matrox quickly dropped away to being a fringe player in the market.

ATi Radeon 9700 (2002)

What was actually stunning about this graphics card was that it supported the new Direct X 9.0 before it was officially released. But not only that, due to nVidia making a critical error (see below), it was a Curb Stomp Battle against the GeForce FX in any game using Direct X 9 (in particular, Half-Life 2). Moreover, it still offered exceptional performance in older games, thanks to ATi ditching the until-then popular multitexture designs in favor of a large array of single-texturing processors, which generally offered better performance and remains the standard approach for GPUs to this day. Sweetening the deal was a much more competent execution of the 256-bit memory bus that Matrox had attempted with the Parhelia, finally making Anti-Aliasing a commonly used feature. This cemented ATi as competitor in graphics processors. A slightly revised version, the Radeon 9800 was released the following year, and ATi also started selling their graphics chips to third-party board makers, ending nVidia's monopoly on that front.

nVidia GeForce FX (2003)

After an unimpressive launch with the overheating, under-performing FX 5800 model, the succeeding FX 5900 was on par with ATi's cards in DirectX 7 and 8 games, but nVidia made some ill-advised decisions in implementing the shader processor across the series. Direct3D 9 required a minimum of 24-bit accuracy in computations, but nVidia's design was optimized around 16-bit math. It could do 32-bit, but only at half performance. nVidia had assumed that developers would write code specifically for their software. They didn't, and it resulted in the card performing barely half as well as the competing Radeons in Half-Life 2 and Far Cry.

The aforementioned FX 5800 introduced the idea of GPU coolers which took up a whole expansion slot all by themselves, which is now standard in anything higher than an entry level card. Unfortunately, nVidia got the execution of that wrong as well, using an undersized fan which constantly ran at full speed and made the card ridiculously loud. This eventually gave way to a more reasonable cooler in the FX 5900, and some fondly remembered Self-Deprecation videos from nVidia. In a bit of irony, the GeForce FX was developed by the team that came from 3dfx, whom nVidia bought a few years earlier.

nVidia GeForce 6800 (2004)

Learning from its mistakes, nVidia this time released series for the then latest DirectX 9.0c standard, ensuring that it was more than capable of meeting the standard and turning the tables on ATi for a while, as the GeForce 6800 Ultra had double the performance of the Radeon 9800 XT at the same $500 price point. This GPU also marked the re-introduction of the idea that GPUs could be linked up much like 3dfx's SLI, this time as Scalable Link Interface, with the introduction of the PCI-Express interface to displace the existing Accelerated Graphics Port interface. As a result, ATi countered with CrossFire. While it offered more flexibility than SLI (which only worked with identical GPUs at the time), it was clunky to setup as one needed to buy a master Crossfire card and a funky dongle to connect to the other card, and only offered a very limited set of resolutions (they subsequently implemented CrossFire much better in the following year's Radeon X1900 family).

ATi Xenos (2005)

The GPU that powers the Xbox 360. What set this apart from its competitors was the unified shader architecture. Traditionally, vector and pixel shading work is done on separate, fixed pipelines. Unified shading allows the GPU to assign any number of pipelines to do pixel or geometry work. Thus if one part of a scene needs more pixel output, the GPU allocates more pipelines to it and vice versa for geometry work. It was able to use the same number of transistors more efficiently.

nVidia GeForce 7950GX2 (2006)

Refining on the SLI technology, nVidia took their second best GPU, made two graphics cards out of them, and stacked them together. This essentially created a dual-GPU solution that would fit in a single PCI-Express slot. And then they took it Up to Eleven by allowing two of these graphics cards to work in SLI, creating quad-SLI (although this feature wasn't officially supported initially). Since then, both nVidia and ATi have introduced at least one graphics card of this type in subsequent generations. It should be noted though, that dual GPU versions of Voodoo2, ATi RAGE 128, and even NVIDIA's own 6600 GT, existed, but most of these (except the ATi card) were done by the board makers, not the chip makers.

nVidia GeForce 8800 and ATi HD Radeon 2900 (2006)

The idea of the unified shader pipeline kicked off. However, these two graphics processors took another approach. Instead of using vector units, the graphics cards would incorporate scalar processors. Vector units can work on multiple data at a time, but scalar processors can only work on one datum. But with scalar processors, there could be much, much more. It was also found that scalar processors are more efficient than vector ones, as they can reach near 100% utilization on heavy workloads.

However, since scalar processors are essentially agnostic to the work they must perform (i.e., you just throw it some datum and an operation, it doesn't care what it's for), it lead to another kind of "shader", the compute shader. This handles work that doesn't make sense to do on the vertex, geometry, or pixel shader, things like physics simulations or what not. This led to the creation of the General Purpose GPU, which both NVIDIA and ATi made graphics cards just for computational work. GPUs now can do so much computation that they've replaced CPUs by a good margin in the supercomputer market to the point where one can make a supercomputer class computer for around $2000.

It should be noted however, that GPUs are only good for their computational power, they are not exactly good for making decisions (i.e., logic/control).

Intel Larrabee

In 2008, Intel announced they would try their hands in the dedicated graphics market once more with a radical approach. Traditionally lighting is actually estimated using shading techniques done on each pixel. Intel's approach was to use ray tracing, which at the time was a hugely computationally expensive operation. Intel's design however was to use the Pentium architecture, but scale it down using modern integrated chip sizes, modify it for graphic related instructions. A special version of Enemy Territory Quake Wars was used to demonstrate it. It was axed in late 2009.

nVidia tried their hands on "real time" ray tracing with the GeForce GTX 480 using a proprietary API. However nVidia's attempts would not see adoption until 2018 with the release of the GeForce 20 series, introducing RTX hardware accelerated ray tracing.

Intel GMA X3000/X3100 (2007)

It's not a very good performer, as expected of integrated graphics, but it's notable in that it took Intel eight years for their GMA line to even include hardware transform & lighting, a critical feature to Direct X 7 and the defining feature of the GeForce 256 mentioned above. Now they can at least run older games at a decent framerate.

Note that the earlier GMA 950, while being touted as Direct X 9-compliant and bearing hardware pixel shaders, does not have HT&L or hardware vertex shaders.

ATi Radeon 5870 (2009)

This GPU finally allowed ATi to retake the GPU performance lead, having been pretty much continually behind nVidia for the previous five years. More notably however, it included gaming across three displays as a standard feature, and the special EyeFinity edition took things Up to Eleven by supporting gaming across six displays.

PowerVR Series 5 (2005-2010)

After several generations of failed attempts to re-enter the PC graphics market, PowerVR entered the embedded systems market. The Series 5 brought exceptional performance for mobile devices, including Apple's iPhone 4 and many Android powered tablets and phones. The second generation, SGXMP, is arguably the first multi-core GPU. A dual-core variant powers the iPad 2 while a quad-core version powers Sony's PS Vita.

Intel Graphics HD2000/HD3000 and AMD Radeon HD6000D/G (2010)

While neither of these will set world record benchmark scores, they have the noted ability to accelerate computational work in the processor if needed. This is usually used in video transcoding to speed up the process.

↑ Oddly enough its 2D performance was actually very good, albeit not up to the standard of the Matrox G200, the 2D market leader at the time.
↑ By comparison, ATi and nVidia didn't introduce this feature until 2009 and 2012 respectively.

[1] Oddly enough its 2D performance was actually very good, albeit not up to the standard of the Matrox G200, the 2D market leader at the time.

[2] By comparison, ATi and nVidia didn't introduce this feature until 2009 and 2012 respectively.

[1]

[2]