Some gory guts of Geforce GTX 470/480 explained
Not easily satisfied with what's printed in architecture briefs or reviewer's guides, PC Games Hardware has been nagging Nvidia with questions about their new Fermi architecture and the just launched Geforce GTX 480 and 470.
Accompanying the launch of Nvidias new Geforce GTX 470/480 (see our test over here) we've taken the opportunity to fire some questions to Nvidia regarding architectural details that where not part of the officially supplied information.
Here's what the californian guys had to say about the guts of their latest babies!
What's the official number of transistors? 3 billion was announced, 3.2 made it into the rumor mill and "more than 3 billion" is given in the papers. Well, it turns out, 3 billion is the number of little transistors that's included in each and every GF100-GPU.
Clock Domains: Nvidia states a 700/607 MHz engine clock for 480 and 470 respectively. But what parts of the engine run at this speed? Does this also include ROPs and L2-Cache or do those run in another clock domain? The answer we got was, that the engine clock refers to all key units. Except for the shader core itself, which runs at the so called hot clock (1400 MHz for GTX 480 and 1215 MHz for GTX 470). Also not included is the memory interface, but unfortunately Nvidia did not make it clear, if that's only the controllers or also the attached units like L2-Cache or Octo-ROPs.
Regarding Load/Store units: As we've learnt, there are 16 of them per shader multiprocessor, thus yielding a convenient number of 256 for the whole chip an 240 for a GTX 480. Operating on atomic values, that'd be a perfect fit to double for fetching textures. But as Nvidia told us, that's not the case. Because LD/ST is separate from the texture fetch path. The Load/Store units use a different path, through the L1, then L2, then Framebuffer. Texture units first look in texture cache (there's 12k each), then L2 cache, then FB memory.
L1 or Shared Memory? Does it remain valid that in "Geforce-Mode" the Shared Memory/L1 is 48k/16k? Well, on this question we've got to keep with already existing information, we've got beforehand. Because Nvidia only told us that GeForce products support both configs, 48/16 and 16/48. But what we've actually wanted to know was, what the default configuration was. Our older information was, that in graphics mode, the larger part, i.e. 48k will be devoted to Shared Memory, because the graphics kernels from the driver compiler are rather a regular than a random memory consumer, thus better fitting for a Shared Memory configuration.
FMAs and all is fine, what about more transcendental functions? Nvidias new architecture has four transcendental units per 32 ALUs instead of two per eight in the previous generation. Are those operations becoming less important? Do they still use them for texture interpolation?
According to Nvidia, in the Fermi architecture, the transcendental ops and texture interpolation hardware are actually separated now (guess you didn't know that). There are now four transcendental units per 32 ALUs (that you knew), which is a 2:1 ratio change vs the previous generation. Nvidia said, they felt this was a reasonable balance given that the more decoupled nature of the Fermi design would allow them to use the units more efficiently. Now, if this separate from transcendental TEX interpolation means they're carrying it out in the normal shader-ALU like AMD does or if there's extra hardware involved, we cannot tell just yet.
The longer your register file, the more the girls will like you! - True? Compared to GT200, the registers available per individual core are lower in GF100/Fermi-architecture. But don't graphics and cuda programs tend to get longer, consuming more register space we wonder.
Nvidia says, they've been looking across a variety of workloads including long running programs, and are pretty satisfied with the ratio of floating-point units per register space (FP:RF) Fermi. In general, they said, they'd find that the scalar architecture to be very helpful in minimizing RF requirements, and the addition of L1 cache in Fermi improves spill performance. That's when the register file is full and you need to store the data somewhere - that's when the new L1-cache comes in. This is giving the architecture effectively a capacity amplifier.
I(s)N'T it just a waste of space? Each core in Fermi consists of an FP and a INT pipeline. Does the INT pipeline get used at all in gaming-graphics mode? Maybe via Compute Shader or Physx? If so,what'd be exampels were it's not sitting idle while gaming? Nvidia said, they do see some use of the INT pipeline, for example compare operations were fairly common and would run on the integer unit.
Is cache mandatory for all units? Since they've implemented a fully coherent caching hierarchy in the GF100/Fermi architecture, we wondered if the individual units have the ability at all to fetch directly from global memory or do they only "fetch" and the memory subsystem decides by itself how to best service each request? Nvidia replied by stating that the caching system supports cache hints so clients could specify desired caching behavior on a per request basis.
What about DP rate? What's the maximum DP-rate on GTX 480 and 470? And what's DP at all? It's an acronym for double precision and means that every value gets now 64 Bits to represent it instead of the usual 32. That can come in handy with extra-extra large numbers or where extra-extra precise results are needed, such as in astrophysics of engineering for example. We understood that the Fermi architecture should theoretically do DP at half-rate SP which would be a great leap, but there were some rumors beforehand that consumer level products could have their DP throughput limited. Nvidia said, that on their consumer GF100-based GPUs (of which Geforce GTX 480 and GTX 470 have just been launched), double precision runs at 1/8th the speed of single precision, unlike for Tesla GPU computing boards at ½ rate, with Tesla being the brand and product name for supercomputing and numbercrunching parts from Nvidia.
Geometry: 700*4 = 648*8? No way! At the presentation of the Geforce part of the architecture at CES, Nvidia stated that Fermi (no Geforce-name was given then!) would have 8x the geometry throughput of GT200. How did they arrive at that number? Wasn't GT200 1 tri/clock, whereas Fermi is 4 tris/clock (both theoretical maxs)? Well, Nvidias Fermi's theoretical peak for tessellated geometry is 4 drawn triangles per clk. They said it's hard to compare exactly to GT200 since GT200 does not support tessellation. Cnsidering GT200's lower clock and lower peak triangle rate (which is only 0.5 drawn triangles per clk and 1 tri per clk is valid only for culled triangles), and serial geometry processing pipeline, 8x was a reasonable estimate of performance difference between the architectures. They added, that 4 drawn triangles per clock would be the theoretical peak and that the actually achieved tessellated triangle rates would depend on many factors including tessellation program, nature of the geometry etc, and that they are not expecting to hit the peak rate. Instead, other sources pointed at 3.2 triangles per clock as reasonable number for real-life peaks.
What about the fillrate? Pixel fillrate, that is. Some fillrate tests from 3DMark runs indicated a pixel throughput not worthy of the mighty 48 ROP-Units a GTX 480 has to do it's engines bidding. Earlier in our conversations with Nvidia, they said a full-blown Fermi chip could have a throughput of 32 pixels per clock in the shader-engine and 256 z-samples if the data is compressible. But how does this change with actual products like GTX 480 and GTX 470? According to Nvidia this throughput can change at either the GPC or SM level. A 15 SM configuration like GTX 480 would be limited to 30 pixels per clock due to the SM count, for example.
The raw ROP throughput of 48 seems to be higher than the maximum number of pixels the shader-engine can supply (max. 32) - what's Nvidias take on this? They said, the ROP-throughput was sample based whereas the shader engines are pixel based [note that a pixel can have multiple samples of z (for depth information)]. This is important for AA rendering where complex scenes will have significant portions that are uncompressed. For example in 8xAA, the peak GPC output rate is 32*8 = 256 samples per clk, whereas the peak ROP rate is 48 samples per clk. Improved performance on AA rendering was the main objective for the increased ROP horsepower.
Note: This Q&A was done via email and contains passages from Nvidia reply in direct reproduction were we couldn't find better words to describe it. Thanks a lot to the guys who beared with us and answered our dull questions!
Grafikkarten-Rangliste 2016: 32 Radeon- und Geforce-GPUs im Benchmarkvergleich [Oktober]