4P BA Rig started failing all work units

Linden

[H]ard|Gawd
Joined
Sep 8, 2005
Messages
1,199
This morning before leaving for work, HFM.net revealed that one of my BA/4P boxes was not actively Folding. I checked the client log and showed 69 frames completed successfully, the a client-core communication error.
[
Code:
15:37:29] Completed 172500 out of 250000 steps  (69%)
[15:38:27] CoreStatus = 8B (139)
[15:38:27] Client-core communications error: ERROR 0x8b
[15:38:27] Deleting current work unit & continuing...
This machine has been rock stable for months, so I suspected a bad work unit and deleted some files - Work, machineindependent.dat, queue.dat. Restarted FAH.
Same problem.
Code:
[16:37:58] Project: 8101 (Run 6, Clone 10, Gen 373)
[16:37:58] 
[16:37:58] Assembly optimizations on if available.
[16:37:58] Entering M.D.
[16:38:05] Mapping NT from 48 to 48 
[16:38:07] CoreStatus = 86 (134)
[16:38:07] Client-core communications error: ERROR 0x86
[16:38:07] - Attempting to download new core...
I lowered the clock from the OC of refclock 220 to default, rerfclock 200. No difference - immediate failure of work units.
Code:
[18:43:40] Project: 8104 (Run 0, Clone 26, Gen 376)
[18:43:40] 
[18:43:40] Entering M.D.
[18:43:47] Mapping NT from 48 to 48 
[18:43:50] CoreStatus = 8B (139)
[18:43:50] Client-core communications error: ERROR 0x8b
[18:43:50] Deleting current work unit & continuing...
I also cleaned out the FAH folder completely, and pasted in a saved, known good copy. It failed also, in the same way as above.

I've never had this happen before on any of my machines, that is, work unit failures at default clock. The first failure occurred seemingly out of the blue.

Troubleshooting - where should I start?
 
I would start with a reboot and set the bios to optimised defaults. boot back into the OS reset the OC and shut down, resart start folding again and see if the problem has gone away. I have run into the same scenerio before (several times) after a power surge or sudden power loss and that is how I was able to fix the issue. It seems to be a sporadic problem but all of my SM G34 boards have expirenced it at 1 time or another. So maybe it will help.
 
Grandpa, thanks. I will follow your guidance to the letter when I'm home tonight.
Sbinh, your advice is good as well, but I'll try Grandpa's recommendation first.

Grandpa, your thinking is along the same lines as mine. The subject machine has been rock stable for months, including during times of much higher ambient temperatures. Then all of a sudden, with no hardware changes, out of the blue, work units start failing at OC and default clocks. I was not monitoring HT retries at the time of failure, but before and after, HT retries are/were perfect.

BTW, the subject machine is '2.' in my signature.
 
Yeah I have had the same type of thing happen before several times mostly due to my own self caused power outtages, but occasinally from power surges. The first few times it happend from surges I was baffeled, because I did not know there had been surges and it took me a few days to get them running right again. Now one of the first thing's I do when any of the rigs have problems is reset everything to optimal defaults and start over, that usually fixes any problems. I believe it is a good possibility the same has happend in your case.
 
Thats interesting.... Have you tried installing Boinc to see if that solves the problem?
 
I'm still at work. When I have some kind of results, positive or negative, I'll report back. :)
 
Grandpa, your advice was spot on. Following the steps you provided, the system resumed folding perfectly.

To you gentlemen suggesting BOINC: I respect your choice of distributed computing and salute you in your quest to improve the lot of humankind. But with that said, I will be remaining with FAH until at least the end of Big Advanced project work unit distribution in early 2015. I will re-evaluate at that time. Rest assured though, that in the interim, I am learning about BOINC DC projects.
 
Good to hear, happy I was able to help, keep it in your notes for future refrence.
 
Iiiinteresting.

Linden, next time it happens, can you ping me?
Given that re-application of OC helped, now I know where to look....

For starters, this would sched add'l light:
Code:
sudo modprobe nvram
sudo hexdump -vC /dev/nvram | pastebinit

Thanks much,
tear
 
I have a dumb question for you guys. I have a SM H8QME-2 running 4 x 8439se @ 2.8 and I just got a SM H8DGI-F running 2 x 6276's @ 2.6. Is there a way for me to get more PPD out of the new system???? I know that it's only 8 more cores and 200MHz slower but I think it should be faster than my older setup:(
My choice of OS is Cent-OS 6.5

Thank You For Your Time
 
I have a dumb question for you guys. I have a SM H8QME-2 running 4 x 8439se @ 2.8 and I just got a SM H8DGI-F running 2 x 6276's @ 2.6. Is there a way for me to get more PPD out of the new system???? I know that it's only 8 more cores and 200MHz slower but I think it should be faster than my older setup:(
My choice of OS is Cent-OS 6.5

Thank You For Your Time

How much slower is it?
Despite DDR2, Socket F isn't much slower than G34 core/core just more power hungry.
And interlagos dropped per core performance considerably while improving power savings.
16c IL = 12c MC. IL uses 1 double wide FP unit per 2 Int units...which in theory should work well, doesn't.
The system you replaced has 24c w/ 24int and 24FP.
The new system has 32c w/ 32 int and 16FP...and is lower clocked.

How much slower is it and why are you still folding?
 
Just wait for the new bigadv core they promised us that would use the new version of GROMACS that supports all the features of bulldozer thus improving performance.


Oh wait...
 
How much slower is it?
Despite DDR2, Socket F isn't much slower than G34 core/core just more power hungry.
And interlagos dropped per core performance considerably while improving power savings.
16c IL = 12c MC. IL uses 1 double wide FP unit per 2 Int units...which in theory should work well, doesn't.
The system you replaced has 24c w/ 24int and 24FP.
The new system has 32c w/ 32 int and 16FP...and is lower clocked.

How much slower is it and why are you still folding?

The Socket F is running DDR2-800 and the G-34 is DDR3-1333 but they will run @ 1600
How much slower is it?? It is anywhere from 1min to 6min slower. I can pull 11:07min on a 8103 but the G-34 will be stuck at 12:02min, 16:24min, or 18:36min. Do I need to Throw out the IL's and pick up 2 x MC's?????
I didn't replace the Socket F system, I now have two folding rigs.

I fold because I found out at the end of AUG 2012 that I have Leukemia. I'm 45yrs old and medically retired from the Marine Corps. I did 4 tours of combat in Iraq and medically, they will not tell me of all the stuff I got exposed to.
 
Are you using The Kraken? If DLB is not engaging one one and it is on the other then the times are not very comparable. I would not recommend spending more money on a 2p setup or on older chips.

You can compare your tpf data to historical records to see if your 2p is just under-performing. Do you have 8 dimms for the IL 2p?
said historical data:
https://docs.google.com/spreadsheet...dHdTdUdmUjhWalpXWVZ2S2xvejBDcHc&usp=drive_web


I don't recommend spending more money on f@h ... but towards other DC projects that also do bio med research.
A lot of progress in cancer treatments has happened this year. I hope you can benefit from them.
 
I'm using thekraken but on the IL's it's not kicking in like on the older system. The mobo has 8 dimms per cpu. I only installed 8 sticks to start out with, I just got the other 8 today to fill all 16 dimms. Hopefully this will help.
 
Back
Top