Is there an Intel CPU that can hit one teraflop yet?

aphexcoil

Limp Gawd
Joined
Jan 4, 2011
Messages
322
I'm not sure how fast the new Haswell-E 8 core extreme CPU clocks in at, but I'm assuming somewhere north of 250 Gigaflops.

It seems like since Sandy Bridge, we really haven't seen many gains in terms of raw speed. I realize that there have been a lot of new instructions released and it will take software / compilers some time to catch up with these new instructions and utilize them, but are we getting close to a one teraflop CPU?

Nvidia / AMD graphic processors can hit well above 1 teraflop, but as we all know GPU's aren't as flexible as CPU's.

I'm hoping we can get a general purpose 1 teraflop CPU by 2018 from Intel. Do you guys see this as a realistic timeframe?
 
I'm hoping we can get a general purpose 1 teraflop CPU by 2018 from Intel. Do you guys see this as a realistic timeframe?


RAW x86 performance? Absolutely not. Perhaps 2028 would be more realistic.


Now that Intel and AMD are pushing APU's they're starting to add the GFLOPS gained by the iGPU into the mix skewing the rating. It'll only get worse and you're right it isn't the same thing at all. With Intel considering adding an iGPU on all platforms in the future we'll definitely see multiple teraflop processors, but as you know it wont be relevant whatsoever.
 
I'm hoping we can get a general purpose 1 teraflop CPU by 2018 from Intel. Do you guys see this as a realistic timeframe?

Only if AMD (or someone else even) comes out with a CPU that can actually compete, and I'm petty sure that isn't going to happen.
 
Only if AMD (or someone else even) comes out with a CPU that can actually compete, and I'm petty sure that isn't going to happen.

Yeah, the lack of competition has been upsetting to say the least. Intel has just been adding token improvements, a few new features and 3-5% overall speed increases every year. I remember in the 90's and early 2000's how fierce the competition was and how you would occasionally see a solid 15-20% speed improvement with a new release.
 
RAW x86 performance? Absolutely not. Perhaps 2028 would be more realistic.

Ouch! 14 years just to get a CPU 4 times faster than today's best? I really hope you're wrong on this one. We should be at 5nm and below by 2028.
 
Competition isn't the only reason general purpose CPUs aren't getting faster that fast. I can name at least 3 factors working against that end:

CPUs are quickly approaching a point where the silcon gates they are composed of are so small that they are only a few atoms wide. While you could imagine a pathway of only 1 or 2 atoms in width the physics at those scales are very different - effects like quantum tunnelling become very real in much the same way static electricity becomes an incredibly powerful force on the microscopic scale.

There is also general demand. Currently consumer demand for new CPUs is mainly in favor of lower power usage in order to provide the same processing speed in a smaller portable device. Mainstream computing has reached a plateau in terms of processing requirements. A 5 year old computer can still run most software required by the average user and even gamers are easily served by a computer 2 or 3 years old.

Finally there is a change in high end demand. Those applications that actually do need more power are quickly creating more specialized CPUs instead of asking general purpose CPUs to keep up. GPUs, APUs and other DSP processors have expanded greatly (for example processors designed specifically for bitcoin mining).
 
Demand and supply change with each-other. It's no coincidence that when Intel purchased their market share and AMD dropped off the radar of high-end PC systems, the market began to grow stale and software stopped pushing the limits.

Yes, transistors are getting small, but intel was damning any TDP limits by slapping two chips on the same package to compete before they bought the OEMs. If AMD was as big of competition as they were in 2004, than we would be seeing 18 core unlocked intel chips now selling for the same price as current reality 8 cores, at the consumer level. They have the silicon now, they just have no reason to compete. We COULD have higher TDP chips in our systems, hell: AMD have 200+ watt processors selling. Imagine if Intel was compelled to compete at the same TDP? Instead 'power efficiency' is prime-market. I'm not saying that if AMD had stayed huge competition we wouldn't have smartphones and ultra books that need long battery life, but rather the 'damn the electricity bill' high-end enthusiast market wouldn't have shrank, an the software that would have been able to exist could have been made... And the feedback of software driving hardware driving software would have continued.
 
I'm in the boat of lower TDP. As we become more energy conscious, I think this will be the main goal for most people. I am thinking of building a new PC and was looking at the GTX 780 and their drop in prices. I wanted to pick one up but due to the wattage difference between it and the 980s, I am willing to spend the extra money for less TDP.
 
TFLOPS is floating point performance and is only useful for certain types of math-heavy tasks. Most typical software depends on (integer) ALU/pipeline throughput. In any case...

Iris Pro 5200 has peak GPGPU performance of 832 GFLOPS. With the 4 SIMD units on the CPU, that's pretty close to 1 TFLOP of aggregate performance. Depending on the application and FP precision, the percentage of that peak can vary widely.
 
The Power8 can do well over a Teraflop... But it is priced like an E7 Xeon. It would be cool if there was a lower-end version of it, priced like a 5960X or E5 Xeon, for general computers, but it will never happen. Even if that happened, it could only run Linux and BSD natively, as it is based on a superior, but less supported uarch.
 
Right answer:

Intel will have not 1 but 3+ Teraflops (double floating point) processor next year if you define processor=heart of standalone system. Presently Intel has Teraflop coprocessors which are sitting in Xeon Phi PCIe cards but these are co- since they need proper processor for the system. Next Teraflop chip called Knight
Landing will have core(s) which will support booting the OS.

http://www.pcworld.com/article/2366...ul-chip-ever-packs-emerging-technologies.html

The question is obviously what will be the price of such processor, where it will be applied and how it will be sold. It is quite possible such processor will be only sold through special channels for special builds, e.g. for supercomputer builders. Thus, it will likely be a pure fantasy thinking one will be able buy teraflop PC or parts for it
in shops soon.
 
Intel already has a > 1 TFLOP CPU, even without tricks like counting the GPU.

E5-2699v3 = 18 Haswell cores

Each core has 2 AVX2 floating point units, each capable of 1 FMA (two floating point ops) per cycle on a vector of 8 fp elements.

Let's assume to no turbo at all and use the base frequency of 2.3 GHz:

2.3 GHz * 18 cores * 2 AVX2 units/core * 8 SP floats/unit * 2 ops/float (FMA) =

1.325 TFLOPS
 
I'm hoping we can get a general purpose 1 teraflop CPU by 2018 from Intel. Do you guys see this as a realistic timeframe?

For what kind of application are you hoping to get that performance?

1 (marketing) TFlop/s will be reached much earlier than sustainable TFlop/s in a set of applications. The key bottleneck for sustainable HPC performance is not the ability of the datapath (= ALU) to compute stuff, but for the memory subsystem to load and store data at the required speed. One of the main reasons why some of the HPC codes run at 5% and less of theoretical peak performance of the CPU.

A little example:
Add 2 vectors of floating point data into a third one (double precision), long enough that they don't fit into L1/L2/L3 cache.

For each floating operation (one add) operation the system need to transfer 24 bytes of data. Nonwithstanding overhead on the memory controller a modern Xeon can transfer approx 60 GB/s. So your beautiful fast 1 (Marketing) TFlop CPU is able to deliver real 2.5 GFlop/s performance, 0,25% of its peak.

Good real world performance requires a balanced system, not one component overoptimized.

Prozessors can be optimized for low latency (CPUs usually are) or high throughput (like GPUs). TFlop/s is a throughput measure, not a latency measure.

The memory system is often a barrier to good performance in commercial code as well. Take sorting: The merge phase performance in external sorting is usually limited by the speed of the virtual address translation hardware in the CPU (the TLB). Your wonderful 18 cores are more or less in a constant idle mode waiting for the translation subsystem to deliver the required physical address to access the next data item for the sort process. Adding another 10 cores would add zero performance.

If you want to have better performance in a wide range of use cases?
Ask Intel to innovate in the memory interface. Both latency and bandwidth.

The famous Cray-1 supercomputer in the eighties had a 7 CPU cycle memory access latency to access any memory area in its main memory. Today's desktop CPUs are rather in the 400-600 CPU cycle range to access its main memory. For interactive (latency driven) applications the invention of caches mitigated a lot. For throughput oriented apps requiring a lot Giga/Tera/Peta Flop/s, caches are of little help as the data sets are way larger (Yes, I know there are advanced blocking algorithms leveraging cache hierarchies to some extend)

BTW,
just got a GTX 980 (EVGA Superclocked): Fantastic 5.8 TFlop/s (single precision), lousy 212 GFlop/s (double precision). The GTX Titan is much more balanced if you need the accuracy of double precision.

cheers,
Andy
 
Last edited:
BTW,
just got a GTX 980 (EVGA Superclocked): Fantastic 5.8 TFlop/s (single precision), lousy 212 GFlop/s (double precision). The GTX Titan is much more balanced if you need the accuracy of double precision.

Not to get off topic, but this is like this because NVIDIA artificially inhibits the DP FLOPs performance on GeForce GPUs, and has it fully unlocked on their Tesla GPUs.
My old GTX480 gets around 1.3TFLOPS SP and 168GFLOPS DP; 168÷1300 ≈ 1/8 instead of 1/2 like it should be.

212÷5800 ≈ 1/24
So NVIDIA is artificially limiting the newer GeForce GPUs by quite a bit more on DP FLOPS, sadly; primary reason is so companies will fill their systems with expensive Tesla GPUs instead of cheap GeForce GPUs. ;)

Compare this to the Tesla M2090 GPU: 666DP÷1333SP ≈ 1/2 (as all GPUs technically can be)

Anyways, good thread, back on topic!
 
Not to get off topic, but this is like this because NVIDIA artificially inhibits the DP FLOPs performance on GeForce GPUs, and has it fully unlocked on their Tesla GPUs.
My old GTX480 gets around 1.3TFLOPS SP and 168GFLOPS DP; 168÷1300 ≈ 1/8 instead of 1/2 like it should be.

212÷5800 ≈ 1/24
So NVIDIA is artificially limiting the newer GeForce GPUs by quite a bit more on DP FLOPS, sadly; primary reason is so companies will fill their systems with expensive Tesla GPUs instead of cheap GeForce GPUs. ;)

Compare this to the Tesla M2090 GPU: 666DP÷1333SP ≈ 1/2 (as all GPUs technically can be)

Anyways, good thread, back on topic!

Technically untrue. If I recall correctly, the GM204 (and other non-Gxxx0 cores) are not built with double FP in mind, as they are entirely consumer targeted. GM200 (and GK110, etc) are the ones that have double FP built in, but in the GTX 780 and 780ti, they were intentionally gimped (the Titans weren't gimped). That is a departure from nVidia's previous practices, where double FP performance wasn't gimped in the case of the GTX 470, 480, 570, and 580. Not too sure about the previous GTX 2xx models though.
 
A little example:
Add 2 vectors of floating point data into a third one (double precision), long enough that they don't fit into L1/L2/L3 cache.

For each floating operation (one add) operation the system need to transfer 24 bytes of data. Nonwithstanding overhead on the memory controller a modern Xeon can transfer approx 60 GB/s. So your beautiful fast 1 (Marketing) TFlop CPU is able to deliver real 2.5 GFlop/s performance, 0,25% of its peak.

Definitely, but of course that's a worst possible case example where every operand is used exactly once and must be read and written from memory.

If you are just adding huge vectors of numbers you can do that so quickly that you definitely don't care about FLOPs. You could add together every number in 1 TB of RAM in less than a minute on even consumer grade CPUs.

So anyone paying the big $$$ for heavy duty hardware is doing something more interesting than that, and that usually means much higher than a 1:1 ratio of memory ops/ALU ops for each operand.

So real world codes fall somewhere in between the extreme of "use every operand once" and "entire working set falls into the L1". In practice, many *interesting* problems fall closer to the latter end of the scale in practice. Matrix multiplication, for example. A large matrix multiplication may be blocked into matrices of 64 x 64 = 4k elements, which approximately fit in L1. Each of those 4k elements is involved in ~128 operations just for that 64x64 multiply. So the bandwidth required is < 1% of the pathological "read everything from memory case".

That's not a toy example, either. Many practical LINPACK problems get > 80% of theoretical peak FLOPs on stuff like Haswell Xeons.

So a system definitely has to balanced - but there is no single "balance" that makes sense - it's problem specific. Current high core count CPUs are well balanced for a class of problems that can use each operand 100+ times after reading it from memory. For other problems, more, but smaller CPUs with local memory (hence more aggregate bandwidth) are better.

I still think it's pretty amazing that we have a > 1 TFLOP CPU which is latency optimized!
 
Intel already has a > 1 TFLOP CPU, even without tricks like counting the GPU. E5-2699v3 = 18 Haswell cores
Each core has 2 AVX2 floating point units, each capable of 1 FMA (two floating point ops) per cycle on a vector of 8 fp elements.
Let's assume to no turbo at all and use the base frequency of 2.3 GHz:
2.3 GHz * 18 cores * 2 AVX2 units/core * 8 SP floats/unit * 2 ops/float (FMA) =
1.325 TFLOPS

This is theoretical calculation. Let's se how it stands the reality test with the Linpack (Intel optimized so I presume with AVX) which is authoritative benchmark for numerical computations. Now there is no 18-core processor benchmark result but there is one for dual E5 2697 v.3 10-core i.e. 20 core system and there is result for 8-core i7-5960X

For the dual 20-core: 0.788 Teraflops
For the i7-5960X 8-core: 0.354 Teraflops

Extrapolating from this we can see that practicallly the 18-core will be pumping about 3/4 of a Teraflop. Which is amazing but still not a full Teraflop.

http://www.pugetsystems.com/blog/2014/09/08/Xeon-E5-v3-Haswell-EP-Performance-Linpack-595/
 
This is theoretical calculation. Let's se how it stands the reality test with the Linpack (Intel optimized so I presume with AVX) which is authoritative benchmark for numerical computations. Now there is no 18-core processor benchmark result but there is one for dual E5 2697 v.3 10-core i.e. 20 core system and there is result for 8-core i7-5960X

For the dual 20-core: 0.788 Teraflops
For the i7-5960X 8-core: 0.354 Teraflops

Extrapolating from this we can see that practicallly the 18-core will be pumping about 3/4 of a Teraflop. Which is amazing but still not a full Teraflop.

http://www.pugetsystems.com/blog/2014/09/08/Xeon-E5-v3-Haswell-EP-Performance-Linpack-595/

Right, but that's not how CPUs, GPUs are rated. They are rated by their peak FLOPs, not on a specific benchmark. Large clusters or supercomputers may differ.

That guy at the link seems confused - he says he's testing a E5-2687W in the specs, which is an Ivy Bridge part. So I'm not sure what to believe. If he is testing the Haswell EP part, his theoretical calculation is off, since he didn't multiple by 2 for the two AVX pipelines.
 
Back
Top