The DirectX 12 Performance Preview: AMD, NVIDIA, & Star Swarm
by Ryan Smith on February 6, 2015 2:00 PM EST- Posted in
- GPUs
- AMD
- Microsoft
- NVIDIA
- DirectX 12
DirectX 12 vs. Mantle, Power Consumption
Although the bulk of our coverage today is going to be focused on DirectX 12 versus DirectX 11, we also wanted to take a moment to also stop and look at DirectX 12 and how it compares to AMD’s Mantle. Mantle offers an interesting point of contrast being that it has been in beta longer than DirectX 12, but also due to the fact that it’s an even lower level API than DirectX 12. Since Mantle only needs to work on AMD’s GPUs and can be tweaked for AMD’s architectures, it offers AMD the chance to exploit their GPUs in a few additional ways that a common, cross-vendor API like DirectX 12 cannot.
With 4 cores we find that AMD achieves better results with Mantle than DirectX 12 across the board. The gains are never very great – a few percent here and there – but they are consistent and just outside our window of variability for the Star Swarm benchmark. With such a small gain there are a number of factors that can possibly explain this outcome – better developed drivers, better developed application, further benefits of working with a known hardware platform – so we can’t credit any one factor. But it’s safe to say that at least in this one instance, at this time, Star Swarm’s Mantle rendering path produces even better results than its DirectX 12 path on AMD cards.
On the other hand, Mantle doesn’t seem to be able to accommodate a two-core situation as well, with the 290X seeing a small but distinct performance regression from switching to Mantle from DirectX 12. Though we didn’t have time to look at an AMD APU for this article, it would be interesting to see if this regression occurs on their 2M/4C parts as well as it does here; AMD is banking heavily on low-level APIs like Mantle to help level the CPU playing field with Intel, so if Mantle needs 4 CPU cores to fully spread its wings with faster cards, that might be a problem.
Diving deeper, we can see that part of the explanation for our Mantle performance regression may come from the batch submission process. DirectX 12 is unexpectedly well ahead of Mantle here, with batch submission taking on average a bit more than half as long as it does under Mantle. As batch submission times are highly correlated to CPU bottlenecking on Star Swarm, this would imply that DirectX 12 would bottleneck later than Mantle in this instance. That said, since we’re so strongly GPU-bound right now it’s not at all clear if either API would be CPU bottlenecked any time soon.
Update: Oxide Games has emailed us this evening with a bit more detail about what's going on under the hood, and why Mantle batch submission times are higher. When working with large numbers of very small batches, Star Swarm is capable of throwing enough work at the GPU such that the GPU's command processor becomes the bottleneck. For this reason the Mantle path includes an optimization routine for small batches (OptimizeSmallBatch=1), which trades GPU power for CPU power, doing a second pass on the batches in the CPU to combine some of them before submitting them to the GPU. This bypasses the command processor bottleneck, but it increases the amount of work the CPU needs to do (though note that in AMD's case, it's still several times faster than DX11).
This feature is enabled by default in our build, and by combining those small batches this is the likely reason that the Mantle path holds a slight performance edge over the DX12 path on our AMD cards. The tradeoff is that in a 2 core configuration, the extra CPU workload from the optimization pass is just enough to cause Star Swarm to start bottlenecking at the CPU again. For the time being this is a user-adjustable feature in Star Swarm, and Oxide notes that in any shipping game the small batch feature would likely be turned off by default on slower CPUs.
If we turn off the small batch optimization feature, what we find is that Mantle' s batch submission time drops nearly in half, to an average of 4.4ms. With the second pass removed, Mantle and DirectX 12 take roughly the same amount of time to submit batches in a single pass. However as Oxide noted, there is a performance hit; the Mantle rendering path's performance goes from being ahead of DirectX 12 to trailing it. So given sufficient CPU power to pay the price for batch optimization, it can have a signifcant impact (16%) on improving performance under Mantle.
Finally, we wanted to take a quick look at power consumption among cards and APIs. To once again repeat what we said earlier, Star Swarm is an imperfect, non-deterministic benchmark, and coupled with the in-development status of DirectX 12 everything here is subject to change. However we thought this was interesting enough to include in our evaluation.
As expected, the increased throughput from DirectX 12 and Mantle drive up system power consumption. With the CPU no longer the bottleneck, the GPU never gets a chance to idle and video card power consumption ramps up to full load.
245 Comments
View All Comments
dakishimesan - Friday, February 6, 2015 - link
Because DirectX 10 and WDDM 2.0 are tied at the hip, and by extension tied to Windows 10, DirectX 12 will only be available under Windows 10.dakishimesan - Friday, February 6, 2015 - link
PS: great article.FlushedBubblyJock - Sunday, February 15, 2015 - link
First thoughts: R9 290X dx11=8 frames mantle=46 frames TEST= TOTAL FRAUDAlthough the difference there is what AMD told us mantle would do, only in this gigantic liefest is such hilarity achieved.
Another big industry lie-test blubbered out to the sheep at large.
0ldman79 - Monday, February 16, 2015 - link
It looks more like the people that coded that game are not very experienced and have spent far more time optimizing for future API than DX11.Christopher1 - Monday, February 16, 2015 - link
Not necessarily. DX11 no matter how 'optimized' still does not get you as close 'to the metal' as Mantle does. So yes, there can be these kinds of extreme differences in FPS.The_Countess666 - Thursday, February 19, 2015 - link
they are in fact very experienced. but they choose to do the things that previously DX11 bottleneck prevented them from doing in the past.0ldman79 - Saturday, February 21, 2015 - link
That makes sense, still not quite an apples to apples comparison in that situation, though using previously unavailable features on the new API tends to show the differences.The question still remains, will we see similar improvements on the current crop of DX11 games?
I don't think that will be the case, though I could be wrong.
Seems the gains are from multithreading, which is part of the DX11 or 11.1 spec.
RobATiOyP - Sunday, February 21, 2016 - link
Of course you won't see such a performance increase, because games have to be designed and tuned to what the platform is capable of. The console API's have allowed games, lower level access, Mantle, DX12 & Vulkan are about removing a bottleneck caused by the assumptions in DX11 & OpenGL API's which were designed when GPUs were novel items and much evolution has occured since. Those doubting the benchmark, please say why a graphics application would not want to do more draw calls per second!Fishymachine - Monday, February 16, 2015 - link
DX11 can manage up 10k draw calls, Star Storms makes 100k. Also Assasins Creed Unity makes up to 50k in case you wanted a retail game that would skyrocket in low API(there's a spot where even 2 GTX980 get 17fps)The_Countess666 - Thursday, February 19, 2015 - link
this engine was spefiically written to do all the things that previously DX11 doesn't allow game developers to do. it was designed to run headlong into every bottleneck that DX11 has.it is in fact a great demonstrations of the weaknesses of DX11.
the fact that nvidia gets higher framerates in dx11 then ATI is because they optimized the hell out of this game. that isn't viable (costs too much, far too time consuming) for every game and was purely done by nvidia for marketing, but all it really does is further illustrate the need for a low level API where the burden of optimizations is shifted to the game engine developers where it belongs, not the driver developers.