It is now the case to change that ancient motto in "can it play Crysis". With such computing power it is now the moment to create an AI able to complete the game alone.
Probably cooled by the chassis fans, like most server CPU's with "passive" heatsinks. The applications in which these cards will be used will probably have bespoke chassis'.
Back on university, there was a cooler for some high end Alpha server (not classical flat rack mount)... I'd say roughly 80cm in diameter, 3 phases 400V power, and I think they told us it's taking about 1,5kW. And in the server room it sounded like very angry blizzard. Blowing off 2kW heat wouldn't be much of a hard work for it.
Modern coolings will be quite improved from that, but you get the idea about high-perf server cooling ;-)
Wasn't GP100 the only architecture able to do packed calculations smaller than 32bit? Wasn't it the only one having extra ALU able to perform those calculations? So it results that in the end nvidia used the same 32bit ALU with packed instructions for <32bit? I'm a bit confused
Still I am confused. You wrote in Pascal article that nvidia put specific 2xFP16 units in GP100, but here we have standard FP32 units able to do 4xFP8. May I ask why the specific 2xFP16 were needed?
GP100's double speed FP16 is for fast neural network training GP102 and later do not have this feature, as NVIDIA clearly doesn't believe it makes sense on these mixed consumer/professional parts
The INT 8 is not 4xFP8. They are integer operations. There are two new instructions, IDP2A and IDP4A. IDP4A packs 4 8-bit integers into a 32 bit package and then performs a dot product operation on 2 of these a packages, and accumulates the result into a 32 bit integer. The entire thing is 8 operations. For instance, suppose a = [a1 a2 a3 a4] and b = [b1 b2 b3 b4], where the ai's and bi's are 8 bit integers. Then IDP2A computes, if I remember correctly, c := c + a (dot) b, which is c := c + a1*b1 + a2*b2 + a3*b3 + a4*b4. So there are 4 integer multiplications and 4 integer additions.
These new instructions are for applying a trained neural network to do tasks, which is called inferencing. The FP16 instruction is targeted towards training neural networks. Although the information about NVIDIA's TensorRT does mention that it has the capability to optimize neural networks for inferencing using FP16 instructions. Presumably there's no advantage to doing so if the inferencing is going to take place on a P4 or P40. The P100, as well as the Tegra X1 and I believe the new "Tegra Next" chip used in the Drive PX 2, do have double throughput packed FP16 capability.
It looks like the cheapest card that NVIDIA currently sells with INT8 is the Titan X. So they don't enable INT8 on all GP102/104/106 chips. (Which seems reasonable since they like to charge a lot more for AI-focused chips.)
Are you sure about that? I think maybe they just didn't market the ability on the GeForce cards. The Titan X is marketed as a machine learning card so they mentioned its ability to perform that particular calculation quickly. By the way I think all NVIDIA GPUs going back at least to Fermi (and probably much further back than that) "have INT 8". They would just take 8 clock cycles to compute the equivalent to the 1 clock cycle IDP4A instruction.
"They would just take 8 clock cycles to compute the equivalent to the 1 clock cycle IDP4A instruction." Right. There was no true INT8 capability in hardware until Pascal. Before Pascal, NVIDIA was basically just selling graphic cards and adding software to them to help with AI. Now NVIDIA is adding instructions in the hardware to help with AI. It will be interesting to see what they do with Volta which will likely come out next year. In general, I expect companies to be creating more and more specialized hardware to push AI forward.
INT8 is just an 8 bit integer calculation. The ability was there. It's like saying there was no true FP16 capability in hardware until Pascal (or the Tegra X1). That's not accurate. It just wasn't accelerated with special instructions. For instance, one could say that Maxwell had INT8 TOPS at 1/2 FP32 FLOPS.
As far as the software, it's the key component. It's not just "adding software to help..." It's the reason AMD is not really a competitor in this space even though their hardware is fully capable. The software (CUDA, cuDNN, TensorRT, etc) is more important that the IDP4A instruction.
As far as specialization, NVIDIA's strategy is not one of specialization. Adding instructions to accelerate certain key components isn't really specialization, it's just catering a general purpose processor to a popular use case. Other companies are going the specialization route, however. Google with their TPU, Nervana with the Nervana Engine, Intel with Knights Mill, and perhaps with FPGAs (and Nervana Engine now that they have bought Nervana). Which is the preferable strategy will shake out over the next few years I guess. As for Volta, my guess is that it will focus mostly on architectural efficiency of the SMs, something which Pascal forwent in favor of adding features such as mixed precision, INT 8, NVLink, and finer-grained pre-emption, as well as pipeline optimizations allowing for higher clock speeds. NVIDIA promises an improved NVLink and a significant improvement in performance per Watt without a die shrink for Volta. That could already be a lot on the plate for what appears to be a short turnaround from Pascal to Volta, judging by the Summit supercomputer schedule. Perhaps they will add some more capabilities targeted at deep learning but Volta will remain a general purpose processor. NVIDIA have stated to their investors and to those in the industry they are trying to get to buy their products that building general purpose processors is their strategy.
"I checked and this seems to be the case. A source of info was http://www.anandtech.com/show/10510/nvidia-announc... The article says: "With the exception of INT8 support, this is a bigger GP104 throughout." But the P4 is based on the GP104 and has the faster INT8 throughput. I'd like to find confirmation that either the IPD4A instruction cannot be run on the 1080 or that it runs at a reduced rate.
The prime candidate is power efficiency. However, GDDR5X is said to be more efficient at the same transfer speed. Maybe this doesn't apply at those relatively low GDDR5 clocks?
So nvidia can stick a GPU that big onto a low profile GPU, but refuses to give us a decent 750ti replacement. That's rather annoying. I want a GPU that powerful in my tiny HTPC box!
(and yes I know they get away with it based on how servers cool their GPUs. Still annoying that we cant even get a 1060 low profile or near a 50 watt TDP, yet this 2560 core part has a 50-75 watt TDP.)
GTX 1060 can do it. In regular compute workloads mine uses ~95 W running at 2.0 GHz. I can lower its power target to 70 W and it still runs at ~1.8 GHz. At games you might see 1.5 - 1.7 GHz at 70 W, which is easy to cool unless your case is extremely contrained.
Well the 1050 should be coming out soon. It's rated at 75W TDP.
But two things. Firstly, the P4 is probably going to sell for a lot more than you'd be willing to pay for an HTPC GPU. Secondly, the amount of revenue they think they could get from such an HTPC GPU probably isn't very much, whereas the P4 is a key product in their strategy to capture the burgeoning machine learning market. It costs resources to design, market, and sell a product.
No, it uses more. The reason is the higher clock speed (1.3 - 1.5 GHz vs. 0.8 - 1.0 GHz), resulting in higher voltage and less power efficiency for the bigger card.
Interestingly, the P40 is not the direct successor to the M40, marketing-wise. The M40 was marketed towards training, while the P40 is being marketed towards inference.
In any case, the jump from the M4 to the P4 is impressive. And I wonder if Intel's Knights Mill will force NVIDIA to put double throughput packed FP16 capability on some of their less expensive (P40-class, for example) cards in the upcoming generations. Right now NVIDIA doesn't have a lot of competition on the training side of things.
NVIDIA deepstream SDK... so finally a computer can "check"/ "process" youtube videos uploaded, instead of a human for all of the vile, sick, degrading, torturous videos. I am being SERIOUS, google has a team of humans, that get burned out dealing with the sickness that is depraved youtube videos. Not talking about cat videos (the one that show cats being tortured, etc), or lets play videos, or the millions of vacuous videos... i am talking about the depraved videos that never make it to youtube, the ones that burnout human people who have to vet them...
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
36 Comments
Back to Article
HollyDOL - Tuesday, September 13, 2016 - link
Ok, so to follow the tradition...Can it run Crysis?
It's nice boost, for server farms the most significant will be performance per watt jump.
Ryan Smith - Tuesday, September 13, 2016 - link
It can run the AI neural network that has been trained to play Crysis.HollyDOL - Tuesday, September 13, 2016 - link
Yay, I am sold :-)extide - Friday, September 16, 2016 - link
Haha, can it run Crysis .. it can PLAY Crysis, lolhahmed330 - Tuesday, September 13, 2016 - link
It can run an AI neural network that can create crysis.CiccioB - Tuesday, September 13, 2016 - link
It is now the case to change that ancient motto in "can it play Crysis".With such computing power it is now the moment to create an AI able to complete the game alone.
salimbest83 - Tuesday, September 13, 2016 - link
"how fast can it finish crysis?"BlueScreenJunky - Tuesday, September 13, 2016 - link
So the P40 dissipates 250W of power, you can stick 8 of them together in a server, and they're passively cooled ? How does that even work ?Ryan Smith - Tuesday, September 13, 2016 - link
Large, very high RPM server fans. They sound like a jet engine and move about as much air, too.Wardrop - Tuesday, September 13, 2016 - link
Probably cooled by the chassis fans, like most server CPU's with "passive" heatsinks. The applications in which these cards will be used will probably have bespoke chassis'.HollyDOL - Tuesday, September 13, 2016 - link
Back on university, there was a cooler for some high end Alpha server (not classical flat rack mount)... I'd say roughly 80cm in diameter, 3 phases 400V power, and I think they told us it's taking about 1,5kW. And in the server room it sounded like very angry blizzard. Blowing off 2kW heat wouldn't be much of a hard work for it.Modern coolings will be quite improved from that, but you get the idea about high-perf server cooling ;-)
CiccioB - Tuesday, September 13, 2016 - link
Wasn't GP100 the only architecture able to do packed calculations smaller than 32bit?Wasn't it the only one having extra ALU able to perform those calculations?
So it results that in the end nvidia used the same 32bit ALU with packed instructions for <32bit?
I'm a bit confused
Ryan Smith - Tuesday, September 13, 2016 - link
GP100 supports full speed FP32 and double speed FP16.GP102/104/106 supports full speed FP32 and quad speed INT8 (and very, very slow FP16)
CiccioB - Tuesday, September 13, 2016 - link
Still I am confused. You wrote in Pascal article that nvidia put specific 2xFP16 units in GP100, but here we have standard FP32 units able to do 4xFP8.May I ask why the specific 2xFP16 were needed?
Ryan Smith - Tuesday, September 13, 2016 - link
GP100's double speed FP16 is for fast neural network trainingGP102 and later do not have this feature, as NVIDIA clearly doesn't believe it makes sense on these mixed consumer/professional parts
Yojimbo - Tuesday, September 13, 2016 - link
The INT 8 is not 4xFP8. They are integer operations. There are two new instructions, IDP2A and IDP4A. IDP4A packs 4 8-bit integers into a 32 bit package and then performs a dot product operation on 2 of these a packages, and accumulates the result into a 32 bit integer. The entire thing is 8 operations. For instance, suppose a = [a1 a2 a3 a4] and b = [b1 b2 b3 b4], where the ai's and bi's are 8 bit integers. Then IDP2A computes, if I remember correctly, c := c + a (dot) b, which is c := c + a1*b1 + a2*b2 + a3*b3 + a4*b4. So there are 4 integer multiplications and 4 integer additions.These new instructions are for applying a trained neural network to do tasks, which is called inferencing. The FP16 instruction is targeted towards training neural networks. Although the information about NVIDIA's TensorRT does mention that it has the capability to optimize neural networks for inferencing using FP16 instructions. Presumably there's no advantage to doing so if the inferencing is going to take place on a P4 or P40. The P100, as well as the Tegra X1 and I believe the new "Tegra Next" chip used in the Drive PX 2, do have double throughput packed FP16 capability.
CiccioB - Wednesday, September 14, 2016 - link
Thanks for the clear explanation.Eric Klien - Tuesday, September 13, 2016 - link
It looks like the cheapest card that NVIDIA currently sells with INT8 is the Titan X. So they don't enable INT8 on all GP102/104/106 chips. (Which seems reasonable since they like to charge a lot more for AI-focused chips.)Yojimbo - Tuesday, September 13, 2016 - link
Are you sure about that? I think maybe they just didn't market the ability on the GeForce cards. The Titan X is marketed as a machine learning card so they mentioned its ability to perform that particular calculation quickly. By the way I think all NVIDIA GPUs going back at least to Fermi (and probably much further back than that) "have INT 8". They would just take 8 clock cycles to compute the equivalent to the 1 clock cycle IDP4A instruction.Eric Klien - Tuesday, September 13, 2016 - link
"They would just take 8 clock cycles to compute the equivalent to the 1 clock cycle IDP4A instruction." Right. There was no true INT8 capability in hardware until Pascal. Before Pascal, NVIDIA was basically just selling graphic cards and adding software to them to help with AI. Now NVIDIA is adding instructions in the hardware to help with AI. It will be interesting to see what they do with Volta which will likely come out next year. In general, I expect companies to be creating more and more specialized hardware to push AI forward."Are you sure about that?" I checked and this seems to be the case. A source of info was http://www.anandtech.com/show/10510/nvidia-announc...
Yojimbo - Tuesday, September 13, 2016 - link
INT8 is just an 8 bit integer calculation. The ability was there. It's like saying there was no true FP16 capability in hardware until Pascal (or the Tegra X1). That's not accurate. It just wasn't accelerated with special instructions. For instance, one could say that Maxwell had INT8 TOPS at 1/2 FP32 FLOPS.As far as the software, it's the key component. It's not just "adding software to help..." It's the reason AMD is not really a competitor in this space even though their hardware is fully capable. The software (CUDA, cuDNN, TensorRT, etc) is more important that the IDP4A instruction.
As far as specialization, NVIDIA's strategy is not one of specialization. Adding instructions to accelerate certain key components isn't really specialization, it's just catering a general purpose processor to a popular use case. Other companies are going the specialization route, however. Google with their TPU, Nervana with the Nervana Engine, Intel with Knights Mill, and perhaps with FPGAs (and Nervana Engine now that they have bought Nervana). Which is the preferable strategy will shake out over the next few years I guess. As for Volta, my guess is that it will focus mostly on architectural efficiency of the SMs, something which Pascal forwent in favor of adding features such as mixed precision, INT 8, NVLink, and finer-grained pre-emption, as well as pipeline optimizations allowing for higher clock speeds. NVIDIA promises an improved NVLink and a significant improvement in performance per Watt without a die shrink for Volta. That could already be a lot on the plate for what appears to be a short turnaround from Pascal to Volta, judging by the Summit supercomputer schedule. Perhaps they will add some more capabilities targeted at deep learning but Volta will remain a general purpose processor. NVIDIA have stated to their investors and to those in the industry they are trying to get to buy their products that building general purpose processors is their strategy.
"I checked and this seems to be the case. A source of info was http://www.anandtech.com/show/10510/nvidia-announc...
The article says: "With the exception of INT8 support, this is a bigger GP104 throughout." But the P4 is based on the GP104 and has the faster INT8 throughput. I'd like to find confirmation that either the IPD4A instruction cannot be run on the 1080 or that it runs at a reduced rate.
ajp_anton - Tuesday, September 13, 2016 - link
"this may have something to do with the tradeoffs the GDDR5X standard makes for its higher performance"What are these tradeoffs?
MrSpadge - Tuesday, September 13, 2016 - link
The prime candidate is power efficiency. However, GDDR5X is said to be more efficient at the same transfer speed. Maybe this doesn't apply at those relatively low GDDR5 clocks?Ryan Smith - Tuesday, September 13, 2016 - link
One of the changes to GDDR5X was how error correction works. I'm not 100% sure whether GDDR5X can support Soft ECC like GDDR5 can.Eric Klien - Wednesday, September 14, 2016 - link
GDDR5 can support soft-ECC. See http://www.anandtech.com/show/10516/nvidia-announc...Eric Klien - Wednesday, September 14, 2016 - link
I meant GDDR5X can support soft-ECC. See http://www.anandtech.com/show/10516/nvidia-announc...TheinsanegamerN - Tuesday, September 13, 2016 - link
So nvidia can stick a GPU that big onto a low profile GPU, but refuses to give us a decent 750ti replacement. That's rather annoying. I want a GPU that powerful in my tiny HTPC box!(and yes I know they get away with it based on how servers cool their GPUs. Still annoying that we cant even get a 1060 low profile or near a 50 watt TDP, yet this 2560 core part has a 50-75 watt TDP.)
MrSpadge - Tuesday, September 13, 2016 - link
GTX 1060 can do it. In regular compute workloads mine uses ~95 W running at 2.0 GHz. I can lower its power target to 70 W and it still runs at ~1.8 GHz. At games you might see 1.5 - 1.7 GHz at 70 W, which is easy to cool unless your case is extremely contrained.Michael Bay - Tuesday, September 13, 2016 - link
You see, 750ti was a htpc king. 1060 can`t be that, since it want sexternal power.Yojimbo - Tuesday, September 13, 2016 - link
Well the 1050 should be coming out soon. It's rated at 75W TDP.But two things. Firstly, the P4 is probably going to sell for a lot more than you'd be willing to pay for an HTPC GPU. Secondly, the amount of revenue they think they could get from such an HTPC GPU probably isn't very much, whereas the P4 is a key product in their strategy to capture the burgeoning machine learning market. It costs resources to design, market, and sell a product.
TheinsanegamerN - Tuesday, September 13, 2016 - link
also going to point this out"Tesla P40 requires about 50% more power per FLOP on paper"
Should that be Less, not More, power?
MrSpadge - Tuesday, September 13, 2016 - link
No, it uses more. The reason is the higher clock speed (1.3 - 1.5 GHz vs. 0.8 - 1.0 GHz), resulting in higher voltage and less power efficiency for the bigger card.Yojimbo - Tuesday, September 13, 2016 - link
He is comparing the P40 with the P4 there. The P4 is more efficient if you're able to get around the memory capacity limitations.Yojimbo - Tuesday, September 13, 2016 - link
Interestingly, the P40 is not the direct successor to the M40, marketing-wise. The M40 was marketed towards training, while the P40 is being marketed towards inference.In any case, the jump from the M4 to the P4 is impressive. And I wonder if Intel's Knights Mill will force NVIDIA to put double throughput packed FP16 capability on some of their less expensive (P40-class, for example) cards in the upcoming generations. Right now NVIDIA doesn't have a lot of competition on the training side of things.
danbob999 - Tuesday, September 13, 2016 - link
I am waiting for the Telsa P100Dsurfnaround - Thursday, September 15, 2016 - link
NVIDIA deepstream SDK... so finally a computer can "check"/ "process" youtube videos uploaded, instead of a human for all of the vile, sick, degrading, torturous videos. I am being SERIOUS, google has a team of humans, that get burned out dealing with the sickness that is depraved youtube videos.Not talking about cat videos (the one that show cats being tortured, etc), or lets play videos, or the millions of vacuous videos...
i am talking about the depraved videos that never make it to youtube, the ones that burnout human people who have to vet them...