4P BA Rig started failing all work units

Linden · Feb 21, 2014

This morning before leaving for work, HFM.net revealed that one of my BA/4P boxes was not actively Folding. I checked the client log and showed 69 frames completed successfully, the a client-core communication error.
[

Code:

15:37:29] Completed 172500 out of 250000 steps  (69%)
[15:38:27] CoreStatus = 8B (139)
[15:38:27] Client-core communications error: ERROR 0x8b
[15:38:27] Deleting current work unit & continuing...

This machine has been rock stable for months, so I suspected a bad work unit and deleted some files - Work, machineindependent.dat, queue.dat. Restarted FAH.
Same problem.

Code:

[16:37:58] Project: 8101 (Run 6, Clone 10, Gen 373)
[16:37:58] 
[16:37:58] Assembly optimizations on if available.
[16:37:58] Entering M.D.
[16:38:05] Mapping NT from 48 to 48 
[16:38:07] CoreStatus = 86 (134)
[16:38:07] Client-core communications error: ERROR 0x86
[16:38:07] - Attempting to download new core...

I lowered the clock from the OC of refclock 220 to default, rerfclock 200. No difference - immediate failure of work units.

Code:

[18:43:40] Project: 8104 (Run 0, Clone 26, Gen 376)
[18:43:40] 
[18:43:40] Entering M.D.
[18:43:47] Mapping NT from 48 to 48 
[18:43:50] CoreStatus = 8B (139)
[18:43:50] Client-core communications error: ERROR 0x8b
[18:43:50] Deleting current work unit & continuing...

I also cleaned out the FAH folder completely, and pasted in a saved, known good copy. It failed also, in the same way as above.

I've never had this happen before on any of my machines, that is, work unit failures at default clock. The first failure occurred seemingly out of the blue.

Troubleshooting - where should I start?

sbinh · Feb 21, 2014

Might want to run memtest

Grandpa_01 · Feb 21, 2014

I would start with a reboot and set the bios to optimised defaults. boot back into the OS reset the OC and shut down, resart start folding again and see if the problem has gone away. I have run into the same scenerio before (several times) after a power surge or sudden power loss and that is how I was able to fix the issue. It seems to be a sporadic problem but all of my SM G34 boards have expirenced it at 1 time or another. So maybe it will help.

Linden · Feb 21, 2014

Grandpa, thanks. I will follow your guidance to the letter when I'm home tonight.
Sbinh, your advice is good as well, but I'll try Grandpa's recommendation first.

Grandpa, your thinking is along the same lines as mine. The subject machine has been rock stable for months, including during times of much higher ambient temperatures. Then all of a sudden, with no hardware changes, out of the blue, work units start failing at OC and default clocks. I was not monitoring HT retries at the time of failure, but before and after, HT retries are/were perfect.

BTW, the subject machine is '2.' in my signature.

Grandpa_01 · Feb 21, 2014

Yeah I have had the same type of thing happen before several times mostly due to my own self caused power outtages, but occasinally from power surges. The first few times it happend from surges I was baffeled, because I did not know there had been surges and it took me a few days to get them running right again. Now one of the first thing's I do when any of the rigs have problems is reset everything to optimal defaults and start over, that usually fixes any problems. I believe it is a good possibility the same has happend in your case.

Patriot · Feb 21, 2014

Thats interesting.... Have you tried installing Boinc to see if that solves the problem?

tear · Feb 21, 2014

sbinh said:
Might want to run memtest

^^ +1

Check idle temps as well to see if anything stands out (sudo tpc -temp).

Linden · Feb 21, 2014

I'm still at work. When I have some kind of results, positive or negative, I'll report back.

jojo69 · Feb 21, 2014

BOINC?

Linden · Feb 21, 2014

Grandpa, your advice was spot on. Following the steps you provided, the system resumed folding perfectly.

To you gentlemen suggesting BOINC: I respect your choice of distributed computing and salute you in your quest to improve the lot of humankind. But with that said, I will be remaining with FAH until at least the end of Big Advanced project work unit distribution in early 2015. I will re-evaluate at that time. Rest assured though, that in the interim, I am learning about BOINC DC projects.

Grandpa_01 · Feb 22, 2014

Good to hear, happy I was able to help, keep it in your notes for future refrence.

tear · Feb 23, 2014

Iiiinteresting.

Linden, next time it happens, can you ping me?
Given that re-application of OC helped, now I know where to look....

For starters, this would sched add'l light:

Code:

sudo modprobe nvram
sudo hexdump -vC /dev/nvram | pastebinit

Thanks much,
tear

Linden · Feb 23, 2014

Tear, I sent you a PM.

tear · Feb 24, 2014

Responded

MaddMutt · Mar 10, 2014

I have a dumb question for you guys. I have a SM H8QME-2 running 4 x 8439se @ 2.8 and I just got a SM H8DGI-F running 2 x 6276's @ 2.6. Is there a way for me to get more PPD out of the new system???? I know that it's only 8 more cores and 200MHz slower but I think it should be faster than my older setup

My choice of OS is Cent-OS 6.5

Thank You For Your Time

Patriot · Mar 10, 2014

MaddMutt said:
I have a dumb question for you guys. I have a SM H8QME-2 running 4 x 8439se @ 2.8 and I just got a SM H8DGI-F running 2 x 6276's @ 2.6. Is there a way for me to get more PPD out of the new system???? I know that it's only 8 more cores and 200MHz slower but I think it should be faster than my older setup
My choice of OS is Cent-OS 6.5

Thank You For Your Time

How much slower is it?
Despite DDR2, Socket F isn't much slower than G34 core/core just more power hungry.
And interlagos dropped per core performance considerably while improving power savings.
16c IL = 12c MC. IL uses 1 double wide FP unit per 2 Int units...which in theory should work well, doesn't.
The system you replaced has 24c w/ 24int and 24FP.
The new system has 32c w/ 32 int and 16FP...and is lower clocked.

How much slower is it and why are you still folding?

Spazturtle · Mar 10, 2014

Just wait for the new bigadv core they promised us that would use the new version of GROMACS that supports all the features of bulldozer thus improving performance.

Oh wait...

MaddMutt · Mar 10, 2014

Patriot said:
How much slower is it?
Despite DDR2, Socket F isn't much slower than G34 core/core just more power hungry.
And interlagos dropped per core performance considerably while improving power savings.
16c IL = 12c MC. IL uses 1 double wide FP unit per 2 Int units...which in theory should work well, doesn't.
The system you replaced has 24c w/ 24int and 24FP.
The new system has 32c w/ 32 int and 16FP...and is lower clocked.

How much slower is it and why are you still folding?

The Socket F is running DDR2-800 and the G-34 is DDR3-1333 but they will run @ 1600
How much slower is it?? It is anywhere from 1min to 6min slower. I can pull 11:07min on a 8103 but the G-34 will be stuck at 12:02min, 16:24min, or 18:36min. Do I need to Throw out the IL's and pick up 2 x MC's?????
I didn't replace the Socket F system, I now have two folding rigs.

I fold because I found out at the end of AUG 2012 that I have Leukemia. I'm 45yrs old and medically retired from the Marine Corps. I did 4 tours of combat in Iraq and medically, they will not tell me of all the stuff I got exposed to.

Patriot · Mar 10, 2014

Are you using The Kraken? If DLB is not engaging one one and it is on the other then the times are not very comparable. I would not recommend spending more money on a 2p setup or on older chips.

You can compare your tpf data to historical records to see if your 2p is just under-performing. Do you have 8 dimms for the IL 2p?
said historical data:
https://docs.google.com/spreadsheet...dHdTdUdmUjhWalpXWVZ2S2xvejBDcHc&usp=drive_web

I don't recommend spending more money on f@h ... but towards other DC projects that also do bio med research.
A lot of progress in cancer treatments has happened this year. I hope you can benefit from them.

MaddMutt · Mar 10, 2014

I'm using thekraken but on the IL's it's not kicking in like on the older system. The mobo has 8 dimms per cpu. I only installed 8 sticks to start out with, I just got the other 8 today to fill all 16 dimms. Hopefully this will help.

4P BA Rig started failing all work units

[H]ard|Gawd

2[H]4U

[H]ard|DCer of the Year 2013

[H]ard|Gawd

[H]ard|DCer of the Year 2013

[H]ard|DCer of the Month - March 2011/June 2013/De

[H]ard|DCer of the Year 2011

[H]ard|Gawd

[H]F Junkie

[H]ard|Gawd

[H]ard|DCer of the Year 2013

[H]ard|DCer of the Year 2011

[H]ard|Gawd

[H]ard|DCer of the Year 2011

n00b

[H]ard|DCer of the Month - March 2011/June 2013/De

[H]ard|Gawd

n00b

[H]ard|DCer of the Month - March 2011/June 2013/De

n00b