Halting my GPU folding!

SiriusB · 11 Apr 2010 at 20:09

Wow, I didn't expect all these replies. Seems I sparked a pretty good debate!

verbal said:
SiriusB have you lowered your GPU clocks to stock settings and restarted the machine? I've read about people having problems with particular units that magically vanish after a reboot.

The GPU fails on stock settings and using folding clock settings [low core/mem, higher shader clock]. The cards are not unstable at these settings. Prior to changing OS I ran both cards full time for weeks with no errors using those settings.

verbal said:
If I was to speculate I'd say a component (that doesn't have a temp sensor) on SiriusBs cards is too hot now the weather (ambient temperature) is warmer and this is causing instability with a WU that is stressing one part of his GPUs more than the other WUs do.

*for example how hot does the graphics card power regulation hardware get?

If the GPUs failed during the WU run I would be inclined to agree. However I can restart my machine, pick up a "bad" WU and it will fail. Moreover, the WU doesn't DO anything. The GPU temps as monitored by GPU-Z barely even twitches from its idle temps. Regardless of temps the WUs will fail.

verbal said:
Ok, so the next thing to consider is that stanford only lists XP/2003/Vista/7 as compatible operating systems. As I said above I've read on the folding forum about people having problems with particular projects that are solved with a reboot.

SBS 2008 is the only major change. The only hardware change is a PSU, which is a significant upgrade considering the last one, which as you know blew up. There was a brief thought that the PSU blowing did some minor damage somewhere, but I had errors before changing PSU, so I have ruled that out.

I am reluctant to believe that errors of this kind are down to me to solve on my own, especially when it is only certain Work Units. Only Stanford knows the ins and outs of their clients and Work Units. I cannot be expected to troubleshoot an error, when I am given absolutely no meaningful information. Presumably Stanford know exactly what went wrong but that data isn't reaching the folding forum mods or anyone else.

Perhaps there is a conflict between SBS 2008 and the CUDA drivers, that only manifests itself with certain WUs. I am told the problem projects are highly optimised, perhaps different parts of CUDA are being used which triggers the issue. Who knows? Maybe it is a bug/feature in SBS 2008, since it was not really designed for running high-end graphics. Again, who knows? Stanford certainly isn't saying.

Berserker · 11 Apr 2010 at 22:37

I have to agree with you on one point - it's not your job to diagnose the problem. You're a volunteer, just like the rest of us. You can volunteer to try and diagnose the problem if you wish, and you have. Most people if they had such problems wouldn't have a clue where to start (though in this case most people probably wouldn't be running the CUDA-optimised versions in the first place).

I'm also of the opinion that some work is better than none, and rejecting unsuitable work is fine so long as it does not adversely impact the project. All distributed computing projects must be fault tolerant since it is a given that some of the computations will fail. Folding@home is no exception. Indeed, the optimized SETI@home GPU apps also reject unsuitable work.

SiriusB · 11 Apr 2010 at 23:58

This is the huge flaw in the Stanford clients I feel. They will let you grab the same WU over and over then pause your client for 24 hours. Then, if you are really unlucky, you will get the same WU when it resumes, que 5 more failures and another 24 hours.

Going by bruce's replies to me it would seem each WU gets assigned to around 10 people, perhaps more, so it makes no sense to force a WU on one person when I constantly return it as an EUE. I don't think we should be able to manually refuse WUs as that would open the door to people to block WUs with bad PPD, but if in my case a particular project gets an EUE over and over then it should go "hmm, perhaps not". In any case it is bad for the science if one person can't get away from WUs they can't crunch!

Mattus · 12 Apr 2010 at 00:17

The WU assignment logic really needs updating/improving. The servers should be able to tell the difference between a machine which is consistently unstable, and a machine which folds solidly for months until it hits a certain WU. Even if that wasn't possible (which it is), it would surely be easy not to issue a WU to the same person for a second time if it comes back as UNSTABLE_MACHINE. There's no point giving you five 'tries' if you just get the same WU and therefore the same circumstances twice. Doing the same thing and expecting a different result is the definition of insanity

That said, according to HFM, the client still detects my quad-core as 'Pentium II/III', so hoping for any more sophisticated changes is probably too optimistic...!

dekez · 12 Apr 2010 at 11:04

I'm getting a bit fed up as well of going away for a weekend or something and having the unstable machine thing happen. My machine is stable for months but once you get that unstable machine message it just doesn't clear by itself and I need to restart the machine. It definitely needs improving where it can properly reset itself and get a new work unit, as said above constantly repeating the same senario and expecting a different outcome is a bit mad.

Fair enough if your machine is unstable on multiple work units to shut it down for 24hrs, but this just seems to get locked and requires a system restart.

miniyazz · 12 Apr 2010 at 11:41

Oddly enough, I had an EUE today :rolleyes:

deleting the queue, changing machine ID and dropping clocks to minimum didn't help. A restart, on the other hand, did seem to sort it out

SiriusB · 12 Apr 2010 at 12:35

Unfortunately a restart every time it EUEs is not particularly feasable. SBS 2008 takes a small age to restart - plus it's not much use as a server if it spends 50% of its time restarting!

Biffa · 12 Apr 2010 at 13:54

GPU EUE's always require a restart I find unless you use some old driver the version of which escapes me. What driver are you using?

dekez · 12 Apr 2010 at 14:41

SiriusB said:
Unfortunately a restart every time it EUEs is not particularly feasable. SBS 2008 takes a small age to restart - plus it's not much use as a server if it spends 50% of its time restarting!

Yep same here I'm using WHS, so every time it happens I need to check no one in the house is using the shared drives, then go out into the shed and restart everything. Doesn't take that long but still a PITA

Biffa said:
GPU EUE's always require a restart I find unless you use some old driver the version of which escapes me. What driver are you using?

I'm using the drivers which are a few months old, probably the latest ones, if anyone knows of a driver that doesn't require a restart it would be much appreciated.

SiriusB · 12 Apr 2010 at 14:49

I have always used the latest drivers, so I doubt I have the same issue. My machine has been restarted a couple of times anyway and I don't believe it has helped. However, my machine is currently updating, so perhaps I may get lucky and whatever is wrong gets sorted out. I shant be holding my breath, though!

Biffa · 12 Apr 2010 at 14:55

This is the very reason I stopped gpu filing on my server. Just put a dumb card in and fold on the CPU only.

shadowscotland · 12 Apr 2010 at 20:39

try the 182.xx drivers SB - they are slower than the newer ones but are before the drivers had all the other clients bundled into them
(but after CUDA was intergrated).
I can see how a server OS and CAD have simular coding on resource shairing - might be worth a shot

SiriusB · 12 Apr 2010 at 21:15

I have fully updated my server this evening and has been restarted 3 times. I have started both GPUs folding and they both got a 6xxx WU. Oddly now I want a "bad" WU to see if updates plus a few restarts fixed anything!

When it fails [optimistic I know!] I will try the 182 drivers, Mr SS.

dekez · 12 Apr 2010 at 22:24

Cheers SS I will give those a go.

Biffa · 12 Apr 2010 at 23:39

shadowscotland said:
try the 182.xx drivers SB

Thats the one! Thanks SS

SiriusB · 13 Apr 2010 at 11:46

Good news!!!

My GPU is folding again and is not EUEing on those bloody WUs!

It seems the FAH_GPU_IDLE environment variable works. Though why it decided to work after a 5th restart I have no idea! I set this option days ago, and restarted the machine at the time but it didn't work. I did some experimenting this morning and if I remove the above variable folding falls over.

The server updating and restarting several times obviously did something to allow the GPUs to work with the GPU_IDLE variable. I will be monitoring my machine closely for the day and see if it hiccups anywhere. If I get chance I will see if I can make the idle time as close to 0 as possible - minimising the amount of PPD lost.

I still don't understand why my GPUs fail in the first place. This has tempted me to installing XP on a spare HDD even sooner - I am now wondering if my motherboard is struggling rather than the GPUs. Though again, I am not entirely convinced of it myself. My P5B Deluxe has handled everything I have thrown at it over the years, it is a bloody good piece of hardware and I disbelieve a WU can make it fall over lol

Stelly · 13 Apr 2010 at 11:49

hurray SiriusB != FAIL!

Stelly

SiriusB · 13 Apr 2010 at 11:53

How are your clients Stelly - judging by the stats your production has dropped too - I had you rated at at least 35K on your own.

Stelly · 13 Apr 2010 at 11:55

yer, I have been away got back today and the misses decided to switch of ALL folding machines, I'm going home soon to stamp down my authority on the wench!

Stelly

SiriusB · 13 Apr 2010 at 12:12

Burn her!