Overhauled NVIDIA RTX 50 "Blackwell" GPUs reportedly up to 2.6x faster vs RTX 40 cards courtesy of revised Streaming Multiprocessors and 3 GHz+ clock

erek

[H]F Junkie
Joined
Dec 19, 2005
Messages
10,786
RTX 50-series Rumours

"Furthermore, the leaker maintains that RTX 50 Blackwell GPUs will have GDDR7 memory, support PCIe Gen 5, and boast a clock frequency of more than 3 GHz for the gaming models.
According to the leaker, early specification targets for the GB102 gaming GPU include 144 SMs, a 382-bit wide bus, GDDR7 memory, support for PCIe Gen 5 x16, and 96 MB of L2 cache. As a comparison, the RTX 4090 has the same bus width, an SM count of 128, and features 72 MB of L2 cache.
The server GB100 will reportedly be even more decked out with 256 SMs, a 512-bit wide bus, HBM3 memory, and 128 MB of L2 cache.
Finally, when it comes to performance, the RTX 50 GPUs could be anywhere from 2 to 2.6 times more performant than the RTX 40 boards. Considering how the fastest RDNA 3 board, the RX 7900 XTX, is significantly slower than the RTX 4090, RDNA 4 will need to pack a massive performance upgrade in order to compete with the RTX 50 if these reports are accurate.
As always, unconfirmed rumors of graphics hardware that is far away should always be taken with a giant grain of salt, as GPU specifications can change quite rapidly."

nvidia_hopper_gh100_larger_than_ampere_ga10091.jpg


Source: https://www.notebookcheck.net/Overh...ocessors-and-3-GHz-clock-speeds.705159.0.html
 
I hear it also comes with a voucher good for one *free reacharound.


*With purchase, not valid in Quebec, or Puerto Rico, must be redeemed in person on your birthday, all participating persons must be of legal drinking age, and consent of all affected persons must be obtained in writing and notarized.
 
...
Finally, when it comes to performance, the RTX 50 GPUs could be anywhere from 2 to 2.6* times more performant than the RTX 40 boards. Considering how the fastest RDNA 3 board, the RX 7900 XTX, is significantly slower than the RTX 4090, RDNA 4 will need to pack a massive performance upgrade in order to compete with the RTX 50 if these reports are accurate.
As always, unconfirmed rumors of graphics hardware that is far away should always be taken with a giant grain of salt, as GPU specifications can change quite rapidly."
*With DLSS4 Frame generation technology!!!
 
That’s crazy. The 4090 is already stupidly fast. I’ll be shocked if the RTX 50 series is that much faster.
 
DLSS4 is time traveling frame technology. It travels into the future and grabs a frame for you, and brings it back to your time.

DLSS4 will take a random frame into the future. Wait for DLSS5, it will take one closer the the frame you need. DLSS6 will be 10% close to what you need and then they will discontinue DLSS after that! :)
 
Yea I have a bridge to sell you for 2-2.6x performance increase lmao. Heard the same last time. Realiistic I am not going to expect more than this gen. 70% max. Everything else is probably ray tracing marketing dlss4 stuff lmao.
 
Does it come with DLSS4 with double the fake frames? Because that's likely where most of the performance gain is.
 
The 4090 wasn't double the cost of the 3090 despite having double the performance.
it wasn't 2x, more like 75% faster. Thats why I call horse shit in 2.6x lmao. I am sure its ray tracing number.
 
it wasn't 2x, more like 75% faster. Thats why I call horse shit in 2.6x lmao. I am sure its ray tracing number.
Maybe, but the rumors have been that this is the most significant change in architecture since Pascal->Turing, which was basically only tacking on RT and Tensor cores onto Pascal. It may not be as dramatic as 2.6x, but it may be significant.
 
Yea I have a bridge to sell you for 2-2.6x performance increase lmao. Heard the same last time. Realiistic I am not going to expect more than this gen. 70% max. Everything else is probably ray tracing marketing dlss4 stuff lmao.

the 4090 mostly lived up to their claim before release,, but only if you use DLSS2 and DLSS3 and RT. At native resolution without frame generation or other gimmicks the benefits were much smaller.
 
RTX 50-series Rumours

"Furthermore, the leaker maintains that RTX 50 Blackwell GPUs will have GDDR7 memory, support PCIe Gen 5, and boast a clock frequency of more than 3 GHz for the gaming models.
According to the leaker, early specification targets for the GB102 gaming GPU include 144 SMs, a 382-bit wide bus, GDDR7 memory, support for PCIe Gen 5 x16, and 96 MB of L2 cache. As a comparison, the RTX 4090 has the same bus width, an SM count of 128, and features 72 MB of L2 cache.
The server GB100 will reportedly be even more decked out with 256 SMs, a 512-bit wide bus, HBM3 memory, and 128 MB of L2 cache.
Finally, when it comes to performance, the RTX 50 GPUs could be anywhere from 2 to 2.6 times more performant than the RTX 40 boards. Considering how the fastest RDNA 3 board, the RX 7900 XTX, is significantly slower than the RTX 4090, RDNA 4 will need to pack a massive performance upgrade in order to compete with the RTX 50 if these reports are accurate.
As always, unconfirmed rumors of graphics hardware that is far away should always be taken with a giant grain of salt, as GPU specifications can change quite rapidly."

View attachment 562069

Source: https://www.notebookcheck.net/Overh...ocessors-and-3-GHz-clock-speeds.705159.0.html
I assume they mean 384-bit bus, not 382.

And if the server part is using HBM3, I would think that's a much higher number than 512-bit lol. What did they do, forget a 0? The A100 with 40-80GB of HBM2e is 5120-bit.

Man these rumor sites should proof-read their rumors.

Maybe, but the rumors have been that this is the most significant change in architecture since Pascal->Turing, which was basically only tacking on RT and Tensor cores onto Pascal. It may not be as dramatic as 2.6x, but it may be significant.
That's not exactly true.

Pascal was mainly the same as Maxwell, just with a die shrink that allowed additional SMs in each GPC and higher clocks. Architecturally though, largely the same as Maxwell.

Turing was definitely different then Pascal, not just because of added Tensor and RT cores, but definitely had a change to the SM. Maxwell and Pascal had 128 CUDA cores per SM. Turing dropped to 64, and made up for this reduction by adding more SM's. Comparing full Pascal to full Turing, you more than doubled the CUDA core count because of the increase in SM's despite the core count per SM halving.

There was another change too. Starting with Fermi, each CUDA core had an FP and INT unit and that was the same through Pascal, with really the only difference between arch being how capable they were. But functionally from Fermi through Pascal, each CUDA core was capable of either FP or INT operations, but not both at the same time. With Turing, The FP32 and INT32 datapaths were now split, so while before each core could do one or the other, with Turing you could now do both at the same time.

With Ampere, FP32 was added to the INT32 data path so now you had 16 FP32 and then another 16 that was either FP32 or INT32 per clock cycle.

Lovelace is largely the same as Ampere, just has a much larger cache, and an increase in SM count plus die shrink and clock speed increases.
 
Last edited:
I assume they mean 384-bit bus, not 382.

And if the server part is using HBM3, I would think that's a much higher number than 512-bit lol. What did they do, forget a 0? The A100 with 40-80GB of HBM2e is 5120-bit.

Man these rumor sites should proof-read their rumors.
I was thinking that they were talking about the per-stack number, but it's 1024 bits per stack for HBM3.
 
I heard the processor actually creates rips in time and pulls pre-rendered frames from the future to increase framerates. True Story. Thats how it gets 2.6x the performance.
Sure, the problem is they have to reimplement TCP over those frames to put them back in order.
 
The SM count change would come from the N5 to N3B change should they do that. As it scales at roughly 1.3x the density though at 1.5x the cost per mm^2…
The increased cache is good as ray tracing and all that jazz fills it up pretty fast.
I have a suspicion though it will be maybe 1.5x faster in raster, with the 2.6x faster parts being improvements to ray tracing and or DLSS.
Nvidia has done a lot of work to their Variable Rate Shading APIs and when combined with DLSS and all that jazz can really improve performance with almost no visual loss.
https://developer.nvidia.com/vrworks/graphics/variablerateshading
I expect to see this being in more stuff soon.
 
Temporal Transmission Control Protocol. TTCP it is. Nvidia needs to get on this for more frames.
It renders the frames in the future and transmits them to the present to give infinite framerates at approximately 0ms latency using interdimensional quantum entangling.
It is a thing in another 30 years they should be announcing it next month for preview in their hosted cloud solutions.
 
It renders the frames in the future and transmits them to the present to give infinite framerates at approximately 0ms latency using interdimensional quantum entangling.
It is a thing in another 30 years they should be announcing it next month for preview in their hosted cloud solutions.
Nah, nothing quite so complex and technobabbly. Probably would use AI to guess at what the frames should be and then render them before the call to render them to get more frames... wait that's dlss3 isn't it?
 
If Nvidia has working RTX 5000 silicon then their graphic cards aren't selling well. We don't even have a RTX 4060 and 4050 out, and we already know enough about Blackwell?
 
Wonder how good the design/simulation tools have gotten, perhaps good enough now to ~ rely on.
 
Back
Top