Name: Nehalem Part 3: The Cache Debate, LGA-1156 and the 32nm Future
Item: Nehalem Part 3: The Cache Debate, LGA-1156 and the 32nm Future
Author: Anand Lal Shimpi

Nehalem Part 3: The Cache Debate, LGA-1156 and the 32nm Future

by Anand Lal Shimpi on 11/19/2008 8:00 PM EST

Posted in
CPUs

Post Your Comment
Please log in or sign up to comment.

Comments Locked

33 Comments

Back to Article

SiliconDoc - Saturday, December 27, 2008 - link
" The move to multi-core chip designs meant that the focus was no longer on feeding the individual core, but making sure all of the cores on the chip were taken care of. It’s all so very socialist (oh no! ;) ). "
Umm... wouldn't socialist be more like feeding one or two of the four cores, and all you get back is cache misses, or a stalled pipeline anyway ? I mean, if you're feeding the core, it is expected to do some work. So this is clearly not socialist.
Maybe you meant it's socialist in the sense that all 4 cores can't be kept fed and therefore working, so most the time they're standing around with the shovel, because yeah they are "on the job", but they ain't doing much, but the clock keeps ticking and ticking... gosh when is the city street core gonna move that and+ sign, and why does it take 5cas to get it done...look at that the bueaucratic core keeps pushing through dirty bits paper and asking for another copy...
:-0)
Yes, making the 256k cache work properly to "feed" all the cores and keep them working is the exact opposite of socialism. The exact opposite.
JonnyDough - Friday, November 21, 2008 - link
I think it's pretty obvious what's happening here. Intel is pedaling slowly, waiting for AMD to catch up. It's obvious to everyone that more cache would feed the cores better, but with nothing from AMD to answer - there's no need to make huge gains. They'd prefer to milk the market with incremental advances and sell us the same architecture more times, increasing cache as they will.
Bullfrog2099 - Friday, November 21, 2008 - link
Theres one more thing that Intel needs to be grilled about.
Will Bloomsfield and Westmere run on the same motherboard since they both use the same socket? Or will you have to buy a new mb if you want to upgrade to 32nm?
USSSkipjack - Thursday, November 20, 2008 - link
I am also very concerned about the cache sizes in the Nehalem architecture. We are doing interactive realtime software volume rendering and that has naturally a very high demand on the CPU. A current Q9450 is actually doing quite well due to its large cache (12 MB) it is also quite inexpensive for its power. Actually we need pure FP power and the ability to feed large amounts of data to the CPU in a short time. Just imagine gigabytes of data that need to be processed in realtime.
Of course faster is always better and due to the nature of our renderer it also scales very nicely with additional cores. So we were hoping for the release of an affordable 8 core (single package or die) product before the first quarter of next year. Instead we get Nehalem which seems to be inferior for our cause due to its smaller (and slower) cache. Hyperthreading wont help us much, due to the FP nature of almost everything we do.
Still I am somewhat encouraged by the rather good benchmarks we have seen so far. So my question is, in a direct comparison how does a very cache dependent application do in a Nehalem CPU versus a Q9750 or simillar current quad core?
Anyone got any benchmarks that shed light on that?
bollux78 - Thursday, November 20, 2008 - link
MAY BE with current knowledge (or lack of) about general physics and stuff that overcome any industrial attempt, there is nothing very exciting to do about processors. We REALLY should stop thinking stuff will go faster and faster and start to look back at real programmers that made miracles on insignificant hardware, like the amiga, lots of jap PCs etc. Video games should not be even commented, because their hardware is un-upgradeable, and yet, the guys make things run on them no matter what. see PS2 as an outrageous example. they made all NFS series on n and more, and nobody put more than 32mb of ram. I´m sorry guys (ans gals where applicable) but this is all bull, people wanting to tease all of us with new unncessary hardware, lots of useless modifications that make no sense for the end user etc. There should be more programmers and soft engineers and less wall street bastards.
mutarasector - Monday, November 24, 2008 - link
"We REALLY should stop thinking stuff will go faster and faster and start to look back at real programmers that made miracles on insignificant hardware, like the amiga, lots of jap PCs etc."

As a former Amiga software developer, I tend to agree with this statement. While I wouldn't go quite so far as to say we shouldn't expect to see hardware speed improvements, I would modify that expectation in that future hardware R&D shouldn't be oriented towards 'brute force' speed enhancements, but should be more granular refinements. To be sure, I think we are indeed seeing this from both AMD and Intel, particularly with cache structuring and power optimizations.

The *real* problem here is that hardware development is still largely driven by an agenda with an eye towards favoring Microsoft bloatware on monolithic architectures by AMD and Intel. This is ultimately a dead-end as it still fails to address the fundamental mindset changes required to push computing technologies forward in the long term. The first (and most important) change is to get OS and application development moving towards more tightly handcrafted/refined coding that assumes more responsibility for things currently implemented in silicon rather than contiuing down the path of "how can continue to take it up the whazoo endlessly refining our silicon to satisfy Microsoft bloatware requirements".

Microsoft and Google are racing towards iOS dominance (even though
at times I think M$ forgets this race is even on), but of the two, Google is clearly the leader here. What would be cool is if Google would get guys like Carl Sassenrath (Amiga Exec/OS developer) to glue it's Google apps together on Carl's REBOL platform.
Shmak - Friday, November 21, 2008 - link
While I follow your reasoning with newer, better, faster, and smaller being crammed down our throats, but speed issues cannot be resolved as easily through the software side of things. The software industry has come to rely on the current cycle of ever speedier hardware.

And rightfully so, as nobody wants to have to translate millions of lines of C into Assembler for better efficiency than compliers do. Not to mention the fact that every piece of software with a decent budget uses 3rd party platforms like Direct X. Sure somebody could probably write something more efficient for the specific aims of their game or whatever, but it would probably take them far longer than just integrating the platform as is. Not to mention the fact that many of these things are like "black boxes" that have to be opened up and figured out in order to be improved upon, and people generally don't like messing with code that isn't their own. Yet these 3rd party bits are necessary in the end.

The final thing is that when you program anything decent sized for a PC, you are depending on other people's code. The drivers, the OS, Open CL, whatever, all these things make it possible for consumers to use a variety of hardware. Game consoles and the Amiga could be fine tuned because the developers knew the machine they were testing on was EXACTLY the same as the one that was sitting in your office/living room however many years ago.
USSSkipjack - Thursday, November 20, 2008 - link
Yes, but... physics also limit the amount of data that a current CPU can process either way you see it. One can of course fake things, but there are applications (like ours) where this is not an option. Faking and smoke and mirrors and optimizations are something that works very well for games and even for movie effects work, but we can not do that.
It is already a miracle that we do the stuff we do with the current hardware (most others try to use GPUs or even expensive special purpose hardware for that, with all the issues that come with this).
The advantage of using the CPU is scalability. We do fine on a current Q9450, now imagine what we can do on two?
My issue is whether we should do that, or whether Nehalem will indeed bring benefits even for us, even it has a smaller cache. Maybe the new cache configuration and the architecture outdo the lack of cache. This is why I would like to see a direct comparison of Nehalem CPU with a Q9750 in apps that are particularily benefited by larger caches and/or that process a lot of data.
Also, does anyone know when the 8-core Nehalems will come to market?
chizow - Thursday, November 20, 2008 - link
Glad you grilled them about Nehalem's lack of L2 Anand. Lots of good info there, but you should've asked him point-blank if he knew Nehalem didn't show any improvement and in many cases, was slower than Penryn in games. It would've been interesting to hear Ronak's response to that. There's the Guru3D article that shows significant gains in Tri-SLI with an i7 but I haven't seen any other reviews that show nearly as much gain. Hopefully the L3 latency tweaks in Westmere improve gaming performance, but for now there doesn't seem to be much reason for gamers to upgrade from Core 2.
ltcommanderdata - Thursday, November 20, 2008 - link
Does having a L3 cache inherently impose a latency constraint on the L2 cache? Afterall, the last time Intel had an independent L2 cache it was on Dothan which was 2MB with 10 cycle latency. Now Nehalem's independent L2 cache is only 256k at 10 cycles and they say going to 512k would have made it 12 cycle.

So Westmere is really just going to be a die shrink? I was hoping it'll be something like Penryn, which even though it didn't change that much, I believe it still outperformed Conroe by 5-10% on average. I believe there are some more SSE instructions coming for Westmere for AES and other things.

Supposedly the OpenCL spec has been completed in record time thanks to pressure from Apple to get it out in time for Snow Leopard. It's only awaiting lawyer IP approval. Any chance of getting the details?
IntelUser2000 - Saturday, November 22, 2008 - link
To: ltcommanderdata

Actually you can't compare to Dothan. You have to compare to Conroe/Penryn. Conroe's L2 latency is at 14 cycles. I think it went up to make up for the complexity of the core(which is more than Dothan). Nehalem makes it even more complex.

The reason individual transistors can run at 200GHz+ within certain research labs but nowhere near with a commercial chip is they have to synchronize every part of the chip with the clock.

The CPU designers seem to take some chances when making a chip. Likely that's the reason for the delays for certain products as if you make a wrong decision then the prototypes might not come up as you wanted and you gotta make up for it.

That's probably the reason that Conroe didn't come with SMT as the Israeli team managing the chip wasn't experienced as the team that made the P4. They probably could have but risking it would not have been a good idea.

The Israeli team clings on proven technologies while the Hillsbro team makes up more radical ones, like Trace Cache, Out of Order, SMT, etc.
JonnyDough - Friday, November 21, 2008 - link
It should be exactly like Penryn. Die shrink = less heat = higher clocks = performance increase.
ltcommanderdata - Friday, November 21, 2008 - link
The point is that Penryn was not just a dumb shrink of Conroe with added cache as Presler was of Smithfield. Penryn wasn't a major redesign, but it did have architectural tweaks over Conroe including speeding up how the execution units divide numbers and execute shuffles. The FSB was also reworked to allow half multipliers while lower power states were added in mobile versions. VT support was enhanced and of course SSE4.1 was added.

I believe clock-for-clock Penryn is on average 5% faster than Conroe while the difference can be substantially higher for SSE4.1 optimized apps. When I say I hope Westmere is more like Penryn, I'm hoping for similar tweaks to be made to increase performance clock-for-clock, rather than just relying on 32nm to increase clock speeds. I don't believe Intel is releasing another SSE instruction set before AVX in Sandy Bridge, so I guess they'll have to dig deeper for a performance boost.
VaultDweller - Thursday, November 20, 2008 - link
"We’re finally getting wind of X58 motherboards at well below $300"

Oh, please do share! This is what I'm interested in. Without this I would not even consider touching Nehalem with a ten foot pole.

In the past I brushed off X38 and X48 completely, as it was so hard to find reasonable motherboards based on these chipsets. X58 is shaping up to be the same.

The problem is that when I found X38 to be too expensive, I was able to find my peace with a P35 board (a P5K Premium). If I had building a system when X48 was hot off the press, I could find comfort knowing that P45 was right around the corner. There is no such comfort with Nehalem - the only lower-priced chip platform on the radar is based on a different socket, like S754 all over again.

I don't want to cripple or limit the options for my next system build by going with LGA1156, but I don't want to pay $300-450 for a motherboard either.
heavyglow - Thursday, November 20, 2008 - link
this is exactly what im thinking. im concerned that intel will abandon LGA1156 and ill be left with nothing.
3DoubleD - Thursday, November 20, 2008 - link
I can think of the reverse scenario where AMD abandoned the 940 platform and released all FX processors on 939. Neither option is safe, just pick one you don't mind sticking with if you have to.
Kiijibari - Thursday, November 20, 2008 - link
It's so small because Nehalem is a 100% Server design.

Because of this Intel went ahead with the inclusive cache design. It comes in quite handy in MP systems, if you just have to probe one L3 only instead of 4 L1/L2 caches.

But there is one drawback, bigger L2 kills the benefit of the L3 size.
Neglecting the L1 Caches, Nehalem has an effective L3 size of 7 MB, as 4x256kb are just copied data from the L2.
Now imagine what would happen if intel would double the L2. Effective L3 cache size would have shrunk to 6MB, 2 MB waste .. that a lot of transistors.

To make L2 problems worse, Intel reintroduced Hyperthreading. Great technique, no doubt, but now we even have 2 threads struggling for the tiny, little 256kb cache.

I guess all the decisions pay off in a server environment, but to state that intel designed the small size L2 Caches because of the latency only is just a fine excuse for all the wanna-be gamers, who once heard that CL3 memory is better than CL5.

cheers

Kiiji
plonk420 - Thursday, November 20, 2008 - link
If 8core i7s will work on x58, i'll likely bite sooner rather than doing a "wait and see."

does this seem highly likely? or is it anyone's guess?
Casper42 - Thursday, November 20, 2008 - link
Speaking of which, I ran across this today on accident:

http://www.ecs.com.tw/ECSWebSite/Downloads/Product...">http://www.ecs.com.tw/ECSWebSite/Downlo...ilName=M...

The ECS X58B-A
Contains:
6 DDR3 Slots
2 x16 Slots
1 x4 Slot
2 x1 Slots
1 PCI Slot

The Manual makes mention of SLI as well which was surprising to me.

I can see that a machine with this ECS Board, a 920 proc and 2 x 9800GTX+ cards (Currently going for around $150 each) and you could have a pretty potent little machine for around $1000
iwodo - Thursday, November 20, 2008 - link
So we wont see new Mobile Part till 2010 ?

That doesn't sound right to me at all. If that is the case then the rumours about it being a 32nm part may be right.

However, the idea Intel not updating their Mobile Part for 18 months doesn't sound right to me at all.
blyndy - Thursday, November 20, 2008 - link
Yeah I think that Intel has failed to be consistant between Penryn/Nehalem, or at least bit off more than it could chew...

I mean, tick-tock is fine and all, but Penryn has really held up as an ideal architecture, as something to grow off, not as something that should be immediately succeeded by 'the biggest architectural redesigns since the Pentium 1'. After all Core/2 IS the fruit of the P3/Pentium M. Nehalem on the other hand smells unpleasantly P4-like, due in large part to hyperthreading.

HT's something that you either see as 'reduces single thread performance, consumes transistors adds arch bloat and adds heat' or as 'OMG MOAR CPUs IN TASK MANAGERZOMG!!!'. And it's funny because at the end of the day it still runs into the same question as the core-count issue --'are these additional execution units adequately utilisable by a DIVERSE set of applications (i.e. NOT JUST vid encode...)?'. Because we know threading is HARD.

OK I'll come clean-I just what to know when are they going to dust off the wolfdale masks and shrink 'em onto 32nm? :D
esgreat - Thursday, November 20, 2008 - link
Well tell me where does the mainstream user want for in performance other THAN video encoding and image processing?

Games? Games have shown to be much more limited to GPU side. Most CPU enhancements can't really make huge impacts on games nowadays. This is unless they try to utilize more cores, which is what they have done: provide the hardware so that software could use it. The software is fast right now because things like SSE were introduced years ago (although they weren't beneficial then).

As for video encoding, cutting down encoding time from 30 minutes to 10 minutes IS A BIG DEAL. And this is one application where many users (non gamers) would really use.

Enabling of multicore now means fantastic applications in the future.
cocoviper - Thursday, November 20, 2008 - link
Pretty good read.

I'm hoping we can get some more detailed info about Intel's 32nm process in the next couple months- especially what they're planning to do with Atom and 32nm.
CEO Ballmer - Wednesday, November 19, 2008 - link
AMD is still in the game?
I had written them off!

http://fakesteveballmer.blogspot.com">http://fakesteveballmer.blogspot.com
Derbalized - Thursday, November 20, 2008 - link
AMD is still in the game.
AMD is designing Intels next chip.
Probaly with an integrated memory controller also.
Derbalized - Thursday, November 20, 2008 - link
I probaly should have spelled probably right. LOL
piesquared - Wednesday, November 19, 2008 - link
Nope, don't give a shit. But do want to know what keeps happening to all these AMD and ATI reviews you keep promising over and over.
chizow - Thursday, November 20, 2008 - link
LOL. There was a brilliant post on DT that basically claims AMD has now shifted their focus to producing Roadmaps. A bit harsh, but honestly pretty accurate.

Wait til AMD actually releases a new product before getting all emo about a lack of AMD reviews.
whatthehey - Thursday, November 20, 2008 - link
You want an AMD review? Here's one for you: AMD's current products suck for the vast majority of users. The only place they're worthwhile is in the 8S server space; otherwise, they cost too much and deliver too little. Their dual-core parts were awesome when all they had to do was beat Pentium D, but Intel has progressed substantially since then and all AMD has got is a bloated, buggy, slow POS known as Phenom. At least the name is right: it's a phenomenal failure.

Or maybe you mean the various ATI reviews posted during the past couple months?
http://www.anandtech.com/video/showdoc.aspx?i=3441">http://www.anandtech.com/video/showdoc.aspx?i=3441
http://www.anandtech.com/video/showdoc.aspx?i=3437">http://www.anandtech.com/video/showdoc.aspx?i=3437
http://www.anandtech.com/video/showdoc.aspx?i=3420">http://www.anandtech.com/video/showdoc.aspx?i=3420
http://www.anandtech.com/video/showdoc.aspx?i=3415">http://www.anandtech.com/video/showdoc.aspx?i=3415
http://www.anandtech.com/video/showdoc.aspx?i=3405">http://www.anandtech.com/video/showdoc.aspx?i=3405

Oh, but that's not good enough for the AMD fanboyz! Everyone needs to baby AMD and talk about how awesome they are, when AMD is busily circling the drain and getting ready to spin off their fabrication to a separate company. ATI is doing pretty well, and AMD made some good hardware in the past; unfortunately, it doesn't look like they were able to continue to compete.

And honestly, it's no big surprise. Even Intel is having a tough time competing with their own products. Nehalem is a nice design, but as I've told others we are at the point where 95% of people don't need anything more than a three year old Athlon 64 X2. Quad-core only matters to a small number of desktop users at best, and here Intel and AMD are both looking to hex-core and octal-core in the not too distant future. That's great if you do video work or 3D rendering, but pretty much useless for everyone else.

I lust after the new Nehalem upgrades as much as the next guy, but invariably I come back to the realization that my pathetic Q6600 @ 3.00GHz (yes, I backed off from 3.6GHz when I realized that the extra voltage and stress on the system wasn't actually improving performance in any of the applications I use on a regular basis) was more than fast enough for any current program. About the only thing I need right now is an upgrade in the video card department, and I don't need Nehalem for that!
Griswold - Thursday, November 20, 2008 - link
After reading

"AMD's current products suck for the *vast majority* of users"

I knew that your entire posting would have less substance than a steaming pile of cow dung. Why is it that the most clueless people always type up the biggest shitstorm of incoherent garbage..?
whatthehey - Thursday, November 20, 2008 - link
Okay, my first sentence or two was off base, I admit. It's because piesquared made an assinine comment about an article. Anand gives an interesting piece about cache sizes, and some prick responds with, "nope - all I care about is hearing about AMD's same-old same-old designs!" Most of us like to think about the ramifications of cache sizes and CPU architectures, and frankly AMD doesn't have a lot to discuss in that area right now. Nehalem is a pretty major change to Intel's recent architectures, and as such it's worth discussing.

If you'd climb off your high horse for a minute and read the rest of my post (rather than getting your "I Love AMD" dander up, oh great and noble Griswold, defender of AMD), you'd see a lot of facts that are hard to argue with. Performance wise, AMD is sucking Intel's dust in pretty much every area except 8S heavily loaded servers, and there the bigger deal is they do better in performance per watt. Pricing on their CPUs is good, but only because they need to lower prices in order to compete - and Intel has been matching them quite well.

That said, anyone that doesn't see MASSIVE problems for AMD right now has some serious blinders on. This new Foundry Company split is going to put a stake in their heart, mark my words. I only hope someone steps in to fill the void when AMD inevitably fails and disappears, because you just can't compete with Intel by breaking your business into smaller pieces that will have more problems working together than they do when they're all under the same umbrella. If AMD is already behind schedules repeatedly with the current setup, how are they going to do better when they become fabless and have to go through a third party for the various stages of production?
Regs - Thursday, November 20, 2008 - link
AMD's primary problem in the industry is setting a tangible goal/dead line and actually meeting that goal or dead line. "Well at least they won't release it all buggy en' what not". Maybe that line works for video games, but not in this cut throat industry. Intel has been beating AMD to the punch time and again in the CPU market and eroding AMD's sales and market share. Which is why AMD has to retool, reorganize, and follow through with their roadmaps or else they'll have to figure out what products they can actually compete with. Shanghai, even though a little 18 months late, is a good sign of execution by AMD; delaying a CPU/GPU combination platform for a notebook until 2011 is not. Notebooks are a big source a revenue AMD will be passing up in the next 2 years to Intel and they're really going to need something highly competitive if they wish to earn any market share back by 2011.
Lonyo - Wednesday, November 19, 2008 - link
If the Lynnfield has x16 PCIe, and the diagram shows no SB/MCH, does that mean the P55 will be a single chip design and include the extra PCIe slots, and might it be possible to do triple SLI if manufacturers use the PCIe slots from the CPU as well as the chipset?

Nehalem Part 3: The Cache Debate, LGA-1156 and the 32nm Future

Post Your Comment

33 Comments

Back to Article

SiliconDoc - Saturday, December 27, 2008 - link

JonnyDough - Friday, November 21, 2008 - link

Bullfrog2099 - Friday, November 21, 2008 - link

USSSkipjack - Thursday, November 20, 2008 - link

bollux78 - Thursday, November 20, 2008 - link

mutarasector - Monday, November 24, 2008 - link

Shmak - Friday, November 21, 2008 - link

USSSkipjack - Thursday, November 20, 2008 - link

chizow - Thursday, November 20, 2008 - link

ltcommanderdata - Thursday, November 20, 2008 - link

IntelUser2000 - Saturday, November 22, 2008 - link

JonnyDough - Friday, November 21, 2008 - link

ltcommanderdata - Friday, November 21, 2008 - link

VaultDweller - Thursday, November 20, 2008 - link

heavyglow - Thursday, November 20, 2008 - link

3DoubleD - Thursday, November 20, 2008 - link

Kiijibari - Thursday, November 20, 2008 - link

plonk420 - Thursday, November 20, 2008 - link

Casper42 - Thursday, November 20, 2008 - link

iwodo - Thursday, November 20, 2008 - link

blyndy - Thursday, November 20, 2008 - link

esgreat - Thursday, November 20, 2008 - link

cocoviper - Thursday, November 20, 2008 - link

CEO Ballmer - Wednesday, November 19, 2008 - link

Derbalized - Thursday, November 20, 2008 - link

Derbalized - Thursday, November 20, 2008 - link

piesquared - Wednesday, November 19, 2008 - link

chizow - Thursday, November 20, 2008 - link

whatthehey - Thursday, November 20, 2008 - link

Griswold - Thursday, November 20, 2008 - link

whatthehey - Thursday, November 20, 2008 - link

Regs - Thursday, November 20, 2008 - link

Lonyo - Wednesday, November 19, 2008 - link

Log in

Don't have an account? Sign up now