memtest under Windows

mikeblas · Jul 17, 2015

HCI Designs memtest claims to test memory under Windows. How can it do so when Windows controls the physical-to-logical memory mapping? The program might test some memory, but any other program can cause memory demand, a logical-to-physical remapping, and invalidate the coverage the test believes it has.

What is it really doing?

jimmyb · Jul 17, 2015

Good question.

They have this caveat on their website:

No Windows program can directly check the RAM used by the OS; this is a fundamental limitation of using a modern OS. If you need to check every byte, consider purchasing MemTest Deluxe, which boots off of CD for unfettered access to RAM.

Unfortunately, it doesn't provide much insight into how much coverage you can actually get out of the tool, practically speaking.

My recollection, having very occasionally run memory testers on non-Windows OSes, is that to increase coverage you are supposed to decrease the memory pressure as much as possible prior to running the tester. The tester process then requests a (large) amount of memory to be allocated to it, and proceeds to test it.

I suppose, if the OS doesn't swap any memory to secondary storage, then there is a guarantee that you can test an amount of memory equal to what was allocated. Maybe I'm missing something here?

jimmyb · Jul 17, 2015

Somewhat off-topic: In the case of memory demand from other programs that causes swapping to storage, if we model the memory mapping algorithm as unconstrained random process, then the question of coverage becomes a rephrased version of the coupon collector problem, and we can put hard analytical bounds on the coverage probabilities.

This assumption may be totally unrealistic, as I'm totally unfamiliar with page replacement and virtual memory mapping algorithms. I get some small joy whenever I can restate questions as classic ones in probability theory.

mikeblas · Jul 17, 2015

I tried running their app, and it told me that "windows limits the amount of memory a single program can allocate", and it could only test 2047 megs at a time. My machine has 64 gigs. This program could test the memory it allocates, sure. But it has to allocate the memory. LOL!

An OS free boot-to-test arrangement means you're running close to the metal, and can map your memory. When the OS is there, you've got no idea if that memory is physically contiguous or not. The 1.999 gigs tested with this program might be on one stick, or spread across all of your sticks, or in any pattern inbetween.

In neither case can you tell what byte of error was caused by which stick, but in the latter case I think you have a much higher chance of passing a test after only testing one stick when there are many other sticks in the system, and they could be bad.

I wonder if the issue at the core of this thread is caused by the in-OS testing doing this kind of mapping. Given the "recommend hci memtest over anything" response in that thread, I was really surprised to see it's not doing anything special (AFAICT) to map or pin.

jimmyb · Jul 17, 2015

You could run 32 instances of the program in parallel. This is not a serious suggestion.

Not being able to pinpoint the failing memory module(s) seems like a major deficiency, particularly if there are a lot of them as might be the case for you.

I have used similar in-OS testers in OSX as a sanity test after installing new memory. I think I was able to allocate something like 13GB out of 16 without any swapping. If I actually suspected a memory failure then, as you suggest, I would switch to an OS-free at-boot tester.

mikeblas · Jul 17, 2015

jimmyb said:
You could run 32 instances of the program in parallel. This is not a serious suggestion.

But it's the suggestion that the app (and its documentation) make.

pxc · Jul 17, 2015

If you suspect memory failure, could you use WHEA to monitor ECC and parity events while running some things exercising all that memory?

mikeblas · Jul 17, 2015

It's not my memory. I'm asking about this because if feels like the software is not really doing a good job of its advertised function, and on top of that people believe that it's doing a great job of that function.

I'm curious about software -- so I asked here and not in the Memory forum thread. And to that end: does WEHA notice correctable/corrected ECC errors? Most people don't run ECC, and simple parity errors aren't correctable and result in a blue screen.

jimmyb · Jul 17, 2015

It seems that for Xeons you can set them up to count "memory ECC errors". My reading of this is that it's not counting correctable errors.

I presume WHEA or any other monitoring software would leverage this mechanism to do ECC reporting.

In any case, ECC error detection can't exhaustively guarantee that your memory is functioning correctly; I think you need to rely on an explicit write-read of each address if you want multiple sigma confidence that your memory is good.

mikeblas · Jul 17, 2015

What machine wouldn't reboot after a non-correctable memory error? I know Linux can be configured this way, but who in their right mind would do that?

jimmyb · Jul 18, 2015

I could only speculate. Outside of specifically investigating ECC errors, and maybe certain embedded applications, it's hard to imagine why you'd want to keep the machine running.

pxc · Jul 18, 2015

mikeblas said:
It's not my memory. I'm asking about this because if feels like the software is not really doing a good job of its advertised function, and on top of that people believe that it's doing a great job of that function.

I agree. The promises seem to be oversold.

I figured with wanting to test 64GB that you already suspected a problem, which is why I suggested monitoring particular memory events.

mikeblas · Jul 19, 2015

My memory isn't the question. I just ran the tool on my machine to see what it would / could do.

devman · Jul 19, 2015

Is memtest86+ 5.0.1 not recommended anymore?

I've got it hosted on my LAN via TFTP, if I think there is an issue with memory I just PXE boot it and let it run overnight.

michalrz · Jul 21, 2015

Kind of outside your beef with the advertising of this piece of software, I know what you're getting at.
But how about creating a huge ramdisk and torture that instead? Sure, it won't be all RAM, but if you free up enough you increase your chance of detecting something.

mikeblas · Jul 21, 2015

If the OS is loaded, a memory tester is not testing memory. Instead, it is testing virtual memory.

michalrz · Jul 21, 2015

I get that it's protected mode and the term now encompasses pages that have been put away onto the pagefile or that you only see the symptoms but can't pinpoint the broken part because, if anything, your virtual<->physical translator might be broken. I'm not arguing this. I realize this is somewhat beneficial from a security point of view - if the translation uses randomly chosen physical areas.
However, programs like RamMap from sysinternals do show the actual physical addresses that are being occupied - by - like in my example - a ramdrive area.
And, yeah, you are still looking through a layer of abstraction that might very well be obscuring the actual hardware fault. So it's not all bad.
I may come across as ignorant, but even this broken methodology oddly enough uncovered more instability issues related to RAM than an actual bare metal real mode tester. Why do you think that is? just my luck? not arguing, just asking.

mikeblas · Jul 21, 2015

I don't think you're coming across as ignorant.

Do any modern memory testers run in real mode? Wouldn't they be limited to the real-mode address space, which is a megabyte. (It's been since, uh, the 80286? that I played with any of that, so ...) My understanding is that memtest86 starts, gets the machine into protected mode, and then starts poking at memory.

To know why different tests produce different results for the same faults, we need to understand how the tests are different. Testing memory isn't a standardized activity, so we can't think of only "testing memory"; we have to concern ourselves with the details of what they're doing. The differences are probably quite huge: memtest86 is very minimal. When Windows or Linux load, they do tons of initialization, run drivers, and so on. They also have schedulers, which move the testing software from core to core or even socket to socket; but that also go idle every once in a while, and run off to service other things. They turn on modes that memtest86 probably doesn't support: turbo, throttling, and so on. We do don't know what read/write pattern, and what specific bit pattern, each program uses. (HCI doesn't say, but I think Memtest documents it and lets the user pick and choose, IIRC.)

We also don't know what "instability issues" were in the environment you were testing, or if the different results are specific to your situation(s), or they apply to all situations.

Figuring out the differences is (to me) academically interesting. Without knowing the specific differences between the tests, and understanding common failure modes, it's hard to make any complete claim about which might be "better". One difference we do know is that one runs under Windows and only allows testing 2047 megabytes at a time; while the other runs bare metal and tests the whole address space.

jimmyb · Jul 21, 2015

memtest86 says it can disable memory cache, which apparently HCI avoids using deliberately non-local address sequences.

Unsurprisingly, there also exist dedicated memory module testers - although these won't catch interop failures that are specific to memory controller and board of the intended application.

michalrz · Jul 22, 2015

Sorry I thought 'real mode' implied no virtual memory without regard to the amount.
So where does memtest place itself? I see there's a 'reserved' amount given onscreen, around half a megabyte if I remember correctly. If it doesn't fit into cache, where else can it reside other than RAM?
How about things like reserved 'apertures' which you can set in CMOS setup for the VGA chip or other onboard devices? Or a shadowed copy of the computer's BIOS? Memtest can't write to those areas even booted without an OS, can it?

As for my observation - I have a small set of samples so it's not a reliable observation but I'd risk guessing that in 'normal' operation, additional variables are added - things like clock modulation occur for energy saving purposes, power sags occur during sudden changes to the power supply's load (like hard drives waking up while you're playing a 3D game and a write/read/compare tester works in the background). You yourself mentioned some of this.
I guess either temperature changes, power events or clock modulation are more likely to push a failing module's circuits out of spec completely.

The testing machine in question was otherwise unchanged other than trying out various memory modules (DDR2). Same model&make modules that were sent to me via a RMA procedure made the computer stable.

After that debacle I run bootable memtest AND a thread or two of prime95 on 'blend' mode, whilst a hard drive write/read/compare is also running and I'll throw in a looping graphics benchmark into the mix, too.

memtest under Windows

[H]ard|DCer of the Month - May 2006

2[H]4U

2[H]4U

[H]ard|DCer of the Month - May 2006

2[H]4U

[H]ard|DCer of the Month - May 2006

Extremely [H]

[H]ard|DCer of the Month - May 2006

2[H]4U

[H]ard|DCer of the Month - May 2006

2[H]4U

Extremely [H]

[H]ard|DCer of the Month - May 2006

2[H]4U

Supreme [H]ardness

[H]ard|DCer of the Month - May 2006

Supreme [H]ardness

[H]ard|DCer of the Month - May 2006

2[H]4U

Supreme [H]ardness