Gigahertzes

skritch · Jan 2, 2004

Originally posted by xonik
the P4 EE has varying and usually marginal performance gains over a similarly clocked P4 3.2C.

I thought I'd already pointed out that the P4EE has a large L3, not L2, cache. Thus it's a red herring in this conversation, because L3 runs significantly slower than L2.

Looking at Barton vs. Thoroughbred comparisons, you'll see that the effect of a DOUBLED L2 cache lends a scant 2-3% performance gain at identical clocks.

Under what benchmarks? If it's an artificial benchmark, that's a huge "duh". Artificial benchmarks are designed NOT to exercise cache -- i.e., random memory access patterns, which can't be predicted by prefetch algorithms, and wouldn't speed up if you had 8GB of L2 cache. If it's not used again, it doesn't matter if it's in cache.

Show me that same result using application benchmarks.

And just how is the L3 cache much slower than L2 cache when it's on-die? There is very, very little difference in latency between the two cache levels when they are on-die, to the point that they could be effectively combined. Just look at a picture of the die layout and you will see a very small difference in trace length from the cache to the ALU or FPU, leading to very small differences in latency and resultant performance. Your reasoning would have held water when L3 caches were on-chip or even discrete, but now it all falls apart.

Because L3 is the interface with the system RAM. It's not the speed with which it shares data with L2 and the CPU -- it's the speed with which it accesses system RAM. As long as it's the go-between for the CPU and the RAM, it's going to be slower as a matter of function, if not of design.

Finally, all current model Xeons have 512 kB of L2 cache, with varying levels of that forbidden L3 cache.

There are larger-L2 cache chips that can be bought. At least, that used to be the case. I haven't looked in the past year or so.

And I'm not against L3 -- it's a wonderful thing. But L2 is moreso.

skritch · Jan 2, 2004

Originally posted by xonik
But if "L3 cache allows a slight speed improvement in accessing system RAM," then wouldn't it too be a part of this awful bottleneck?

See my larger response, above, for the answer to that question. Yes, it is, due to function, but that's not what I said in the sentence you responded to.

GonePostal · Jan 2, 2004

The P4 benifits more from increased cache because it's 20 stage pipeline vs AMD's what? 10 stage is it or 13. The overall through put of the P4 is higher but if a hazard occurs in the pipeline then the whole process is stalled and you take an exagrated penalty. The longer your pipeline the larger the penalty taken. That is one reason increased cache inproves preformance. The main fact though is cache is a factor of 10-100x faster then your standard DRAM and like 10000x faster then your mechanical/optical storage.

I don't know the exact numbers for the p4 but for Intels itanium there L1 cache has a latency of 1 cycle (0 load!). L2's latency is 5 for integer 6 for floating point and the L3 latency is 12 cycles.

Just because the caches are all being driven by the same clock does not mean their latency is the same. There are different implementations with different costs associated. The p4 was supposed to have 1mb of L3 cache in the beginning but even this was too expensive for Intel to swallow. If you think L2 and L3 cache are almost exactly the same you are wrong.

Phantum · Jan 2, 2004

Wow, ok this has gone on long enough folks.

First off, the largest bottleneck in a computer would be the throughput between the CPU, RAM and hard drive. Have any of you had to start swapping from you HD because you're out of RAM, damn that sucks. Even PC133 is faster than that.

Second, the L3 cache on a P4EE runs at full core clock speed (I believe

). Why wouldn't it? It runs at full core clock speed on the Itanium 2.

Trying to compare AMD and Intel is really unfair. It's like comparing apple and oranges. They're completely different architecures. Intel believes in long pipelines, sheer horsepower and ungodly large caches. AMD believes in more efficient clock cycles. Intel runs on a 200MHz quad pumped bus and AMD runs on a 166MHz double pumped bus. Don't even get me started on the HyperThreading issue. It's alot like saying, who's better at running, the sprinter or the 2 mile guy.

GonePostal · Jan 2, 2004

Originally posted by xonik
And just how is the L3 cache much slower than L2 cache when it's on-die? There is very, very little difference in latency between the two cache levels when they are on-die, to the point that they could be effectively combined. Just look at a picture of the die layout and you will see a very small difference in trace length from the cache to the ALU or FPU, leading to very small differences in latency and resultant performance. Your reasoning would have held water when L3 caches were on-chip or even discrete, but now it all falls apart.

L3 cache is not made in the same why L2 cache is. They have different densities and speed and latencies. Latency is not dependant only on trace length it is mainly effected by circuit implmentation. Increasing L2 to 2mb would cause the area to explode and this would cost way too much. Cost is the main factor in cache design. That is why everything is not implemented in L1 which runs at either 1 load or 0 load (as I stated before). It is too damn expensive.

rolo · Jan 2, 2004

One nice thing about the Pentium 4 is that access to the L1 cache is really really fast. Access to the L2 cache is very fast as well, with latency of only 2-3 clock cycles. It can also transfer up to 16 bytes of data on every clock cycle.

The Athlon (don't know about the 64-bit models), on the other hand, has horrible L2 latencies and throughput. 7-12 clock cycles and it can only transfer 8 bytes every other clock cycle. So while the Athlons usually have lots and lots of cache, they're really really slow. The P4's L2 cache is probably as good as the Athlon's L1 cache.

That said, having lots of L2 cache on the P4 does help. However, not all other chips will benefit from more L2 cache in the same way. The Athlons, the P3 and the P-M do not have the same latency and throughput characteristics as the P4, and so they benefit differently from having various amounts of L2 cache.

Why do you think the "P4 Celeron" is so slow? There's a huge disparity in how fast it can access its small L2 cache, and how fast it can access regular memory. As an academic exercise, it'd be interesting to see how a Celeron would perform with that 2MB of L3 cache strapped on it.

So hey, it would be nice if skritch would come up with some of his own "evidence" for once instead of pointing out one difference between two chips and saying "Oh, that's the full reason for the performance!" I mean, that's like saying "A Mercedes-Benz E55 AMG is way faster than a Ford Explorer, and is much smaller. Therefore, smaller cars are faster." This analogy kinda works if you look at what they use in the Indy 500, but then why aren't those little RC cars at Radio Shack breaking the sound barrier?

Suki243 · Jan 2, 2004

someone should sum all this up and post it as a sticky

malingjc · Jan 3, 2004

Maybe all of our processor geniuses here should make a new processor company...AMtel anyone?

emorphien · Jan 3, 2004

Originally posted by Phantum

Second, the L3 cache on a P4EE runs at full core clock speed (I believe ). Why wouldn't it? It runs at full core clock speed on the Itanium 2.

That's all well and good if it runs at clock speed, but it if takes 100 clock cycles then that doesn't help a whole lot. That's the point they're making, it runs a clock speed but a certain number of cycles have to occur.

rolo · Jan 3, 2004

Originally posted by Suki243
someone should sum all this up and post it as a sticky

The charlatans would still piss and moan and claim they were right regardless.

TheMostWantedPolishTwin · Jan 3, 2004

P4 has a weird architecture - being infact a RISC processor with a CISC shell makes it run slower... and yeah - cache managment is critical when it comes to performance...

[Neural Interface] · Jan 3, 2004

That was quite the informitive read, this should be edited and sticked very useful for those who are ignorant (AMD Roxxurs, Intel sucks, etc ,etc)

Maybe all of our processor geniuses here should make a new processor company...AMtel anyone?

If only the best of both could be combined, 3.2 64 would be very noce

skritch · Jan 3, 2004

Originally posted by TheMostWantedPolishTwin
P4 has a weird architecture - being infact a RISC processor with a CISC shell makes it run slower... and yeah - cache managment is critical when it comes to performance...

Er...it's not RISC. 326 opcodes in today's P4, some of them near-redundant. You don't find that in a RISC chip.

GonePostal · Jan 3, 2004

His post is correct well kind of. You could view the p4 as a risk processor emulating cisc instruction set through their micro instruction set. I learned this in class from my computer organization prof. So I would hazard to guess he is right.

TheMostWantedPolishTwin · Jan 3, 2004

Originally posted by skritch
Er...it's not RISC. 326 opcodes in today's P4, some of them near-redundant. You don't find that in a RISC chip.

I'm taking second semester of computers' architecture classes and this semester I had quite allot of lectures about Intel's CPUs architecture so I know what I'm saying... and these were the exact words of my professor and, believe it or not - he knows what's he's talking about...

skritch · Jan 3, 2004

Originally posted by GonePostal
His post is correct well kind of. You could view the p4 as a risk processor emulating cisc instruction set through their micro instruction set. I learned this in class from my computer organization prof. So I would hazard to guess he is right.

Hm. Odd.

rolo · Jan 3, 2004

Since the Pentium Pro, the Pentiums have been RISC chips that just emulate the x86 IA-32 CISC instruction set. The CISC instructions get translated into 1-or-many RISC "micro ops" and are then re-ordered, scheduled and executed from there. In a way it's like the x86 machine code gets recompiled and optimized into an even lower level form of machine language. When armed with this knowledge, you can do a lot of really good optimization work at the assembly-language level. Unfortunately, the micro-op architectures are different between the 6th and 7th generations of Intel processors, and even between Intel and AMD processors. The "4-1-1" rule only holds for the PPro through the P3, and the P4 has a whole suite of do's and don'ts that are totally different.

Gigahertzes

skritch

2[H]4U

skritch

2[H]4U

GonePostal

n00b

Phantum

[H]ard|Gawd

GonePostal

n00b

rolo

Gawd

Suki243

Gawd

malingjc

[H]ard|Gawd

emorphien

2[H]4U

rolo

Gawd

TheMostWantedPolishTwin

Supreme [H]ardness

[Neural Interface]

Limp Gawd

skritch

2[H]4U

GonePostal

n00b

TheMostWantedPolishTwin

Supreme [H]ardness

skritch

2[H]4U

rolo

Gawd