Halting my GPU folding!

SiriusB · 10 Apr 2010 at 23:53

As the title, I am ceasing my GPU folding. This will put a 15-16K dent in mine and Stelly's PPD.

For whatever reason my GPUs absolutely hate projects 5784 and 5786 and unfortunately for me, these seem to make up 80% of the GPU projects out there, judging by how often I get them.

Each time I get one of those WUs it fails 5 times, because the client or server is too bloody thick to realise 5 EUEs suggests I shouldn't keep getting that WU. In order to get my clients going again I have to delete all the WU work and queue files and change my machineid in the client config. If I am lucky I only have to do this less than 5 times before I get a WU I can work on. It is common to need to do it 7 or 8 times. With two clients this gets rather tedious, plus I only have 15 possible machine IDs to work with!

I absolutely refuse to believe it is a faulty hardware issue as all the other GPU WUs run fine. Including other 578x WUs. Plus this is affecting two different GTX260s by two different vendors. OC or stock speeds make absolutely no difference, the error still occurs.

I am willing to concede it may be a combination of those WUs, SBS2008 and the hardware, though again the fact I can do the other WUs makes this more unlikely in my mind. I posted in the official folding forums and got no real useful advice from one of the moderators. Given they have a thread with page after page after page of problematic WUs I would have expected more useful information. Even if it was just "yes, there is a known issue with SBS2008" or something. But no, the guy assumes my GPU is overheating or drawing too much power on those two WUs. Yeah, I am sure my GPU is overheating before a WU even starts, as for power, the voltages don't even twitch. The thread is here if anyone is interested: http://foldingforum.org/viewtopic.php?f=19&t=13936&sid=a960f93b5d657b7ea4a8a6710524eeec

I am slightly irked in that thread too!

If I get a chance I may try one of my GPUs in my main rig, just be a bit of a ballache. If it works it would suggest SBS2008 is to blame, God only knows why. The latest round of GPU WUs just seem incredibly sensitive as of late, which is a crappy thing to have happen on the most productive platform in the project. Stelly did offer to look after my cards if it came to the point where I couldn't be arsed babysitting them anymore - this may be the way forward as I do not want to have two GPUs grinding away in my bedroom! :eek:

I would fold inside a small VM on the server, but as I understand it GPU folding via a VM is not possible.

Sorry for the rant gentlemen, this is just something I have been putting up with for a fortnight! Fold on :cool:

miniyazz · 11 Apr 2010 at 00:04

Hmm, I recently deleted my records of which WUs I've done (kept swapping GPUs and all the speeds got out of kilter and muddled up) but I've not had one of those projects in the last week.
Ordinarily to buck the trend you'd just need to close folding@home, delete the queue.dat file and restart. Occasionally you might need to do that a couple of times (in my experience) before you eventually get a suitable project - are you sure you're not going OTT with a placebo effect?

I have a very similar issue with A1 units on the SMP client, it has five EUEs after developing errors before even starting the work. Thankfully they seem to have stopped coming through, but I was going to write a quick vbscript to check if I downloaded that core and to delete the queue if so. Could you do something similar with this?

If you do decide to pack in the GPU towel, depending on their dimensions I might be able to shoehorn one or two in my rig

, I only have three in there at the moment. I reckon I can fit in five with a bit of cajoling!

SiriusB · 11 Apr 2010 at 00:32

Often times I have to delete the queue and work files and change the machine id 5 or 6 times or more for it to get a WU I can crunch. The specific clone/gen numbers may change, but any 5784/6 will fail. There are a lot of the 5784s and 6s about, so I have to mess about to get a completely different project number. If I let it EUE 5 times, every time is the exact WU gen/clone.

There is no placebo effect. Some WUs work. Some don't. No two ways about it!

miniyazz · 11 Apr 2010 at 00:57

What I mean is, how do you know that changing the machine ID alters the chance of picking up the same WU? If it takes several deletions of queue.dat even with changing the machine ID, how many deletions of queue.dat does it take without you changing the machine ID?

Incidentally, make a copy of your client.cfg file with varying machine IDs if you genuinely do need to change machine ID to get a different WU. Saves you having to go into config each time (you can just swap out the client.cfg file instead, which can be done automatically (example)):

Batch code for closing SMP2 client, deleting client.cfg, replacing with different client.cfg, and restarting SMP2:

Code:

pv -d2000
::close SMP2
pv -c [email protected]
pv -d2000
::del SMP2 config file
del C:\FAHSMP\client.cfg
pv -d500
::copy backup config file configured for 7 logical cores to folding folder
xcopy C:\FAHSMP\Automation\client-smp7.cfg C:\FAHSMP\
pv -d500
::rename config file 
ren C:\FAHSMP\client-smp7.cfg client.cfg
pv -d500
::start SMP2
cd C:\FAHSMP
start C:\FAHSMP\[email protected]

exit

There are a couple of ways you could determine if it is trying to work on a dodgy project. You could monitor GPU temperature and if it stays below a threshold for, say, >10 minutes (depends how long it normally takes to upload and d/l a new project) it could trigger a batch file that closes the client, deletes queue.dat and anything else it needs to, and restarts. And then repeats ad infinitum until GPU temperature increases.
Or, you could use HFM's email function in some way - you can set it to notify you when a client enters a 24-hour pause state - which could I'm sure be manipulated to trigger a queue deletion. Or perhaps by monitoring the core files downloaded - I notice FahCore_11 and FahCore_14 in my GPU folders. Maybe FahCore_14 is the core used in the dodgy projects, in which case you can delete that core, then have a script continually checking for the existence of the core. If it reappears, it will close the GPU client, clear the queue, delete the core file, and restart the GPU client - and go back to checking for it being downloaded, although this method isn't too specific - it would almost certainly prevent you from working on all Projects 57**, and possibly other projects too. But there's many options you can try

SiriusB · 11 Apr 2010 at 02:52

I usually edit the config directly, takes just a second. I have been working on a script to automate it all, but haven't had the time. FaHCore_11 is the culprit im afraid. I don't think I have ever EUEd on a project that uses 14.

As for changing the machine ID, I found simply deleting files isn't enough. The server just reassigns me the exact same WU. I know the machine ID change is working because while I may get a dodgy WU, the Run/Clone/Gen numbers change - ergo different WU. The problem is it seems the majority of WUs about for the GPU client are the 5874s and 6s! Wouldn't mind if it was just the odd rare WU lol.

I fear that even with automation my GPUs are only going to be folding for a short amount of time per day. Which is why it probably isn't worth it. I think I managed about 3 or 4 WUs today. Considering a WU on my GTXs have a frame time of about 60 seconds, I am about 30 WUs down per day.

Biffa · 11 Apr 2010 at 02:56

Hmm, I know this isn't going to help but I've had 5784's and 5786's on my GTX260 without issue. But I'm not running SBS2008 thats for sure. Just Vista 64. Just checked my benchmarks in HFM.NET

Project ID: 5784
Core: GROGPU2
Credit: 783
Frames: 100

Name: GTX 260 WC
Path: C:\Users\Biffa\AppData\Roaming\Folding@home-gpu\
Number of Frames Observed: 300

Min. Time / Frame : 00:01:06 - 10,250.2 PPD
Avg. Time / Frame : 00:01:20 - 8,456.4 PPD

Project ID: 5786
Core: GROGPU2
Credit: 783
Frames: 100

Name: GTX 260 WC
Path: C:\Users\Biffa\AppData\Roaming\Folding@home-gpu\
Number of Frames Observed: 300

Min. Time / Frame : 00:01:13 - 9,267.3 PPD
Avg. Time / Frame : 00:01:22 - 8,250.1 PPD

SiriusB · 11 Apr 2010 at 03:02

Thats the annoying thing. I know the errors are specific to my machine, I just dont know where the issue is. Server 2008 and Vista are largely similar code-bases to my knowledge so there shouldn't be a problem.

I would absolutely love a specific error to at least point me in the right direction, but NaNs detected tells me nothing. Worse still, Stanford admit this could be hardware or the WU itself.

It is such a pain because somewhere between the client, the WU, the OS, the mobo, and GPU an issue is arising. Throw in the fact a lot of WUs work just fine and... well.. go figure.

As said I will be trying one of the GPUs in my Windows 7 machine when I have a chance. Or I may throw a copy of XP on this box on a spare HDD and see what happens. Ball aches all round either way!

Biffa · 11 Apr 2010 at 03:15

Tell me about it, I've been out trying to get weebeastie up and running, the cpu runs bloomin hot on it compared to the one on my main rig. Seems it needs a lot more volts to be stable.

Mattus · 11 Apr 2010 at 03:44

Stanford probably aren't too keen on this, but they'll get over it: as all the 578x WUs come from the same server, you could make a firewall rule blocking all communication with http://171.67.108.21. That way, when Stanford tries to issue you a 578x, the client will just fail to connect until the assignment server gives you something else. That might take a little while, but it's better than being down for 24 hours!

NickK · 11 Apr 2010 at 08:42

There's always climatepridiction.net .. it uses GPUs too

miniyazz · 11 Apr 2010 at 09:59

It's odd, I'm running three GPUs and haven't had a 5784/6 yet, though I've had 5785, 5782 and 5781, so they can't be that prevalent - just in your part of the world apparently!

I wonder what would happen if you use VMWare to install XP/7 on top of SBS2008, then run the GPU client from within there? It's a bit of a hassle to set up but means you don't have to start moving hardware around and once it's set up, should be minimal hassle.

Edit: oops just reread OP

verbal · 11 Apr 2010 at 10:31

SiriusB have you lowered your GPU clocks to stock settings and restarted the machine? I've read about people having problems with particular units that magically vanish after a reboot.

verbal · 11 Apr 2010 at 10:43

miniyazz said:
There are a couple of ways you could determine if it is trying to work on a dodgy project.

There's no such thing as a dodgy project. It has been known that a bad WU (amoung thousands) can be distributed that will fail to work ragardless of the users PC but I've never had one in over 7 years of folding. My current and previous folding PCs have both ran for months in the past without being modified or rebooted. I've been away from home for a month and the folding PC I've left on unattended is fine. I think it's bad for the science if people start avoiding WUs and putting the blame on Stanford.

miniyazz · 11 Apr 2010 at 10:52

What I mean by a dodgy project is one that is incompatible with his hardware/software/something, I'm not suggesting it won't run on any system.

However, it can not be down to the end-user to try to figure out where the incompatibility is. Most folders do this as a side-thing, devoting very little personal time to it and simply letting their computer get on with it. If there are problems, they will simply stop folding, much like Sirius is considering doing. It's not helpful to suggest that problems are due to the end-user's hardware and leave it at that, when it manages to complete any other project perfectly fine and doesn't even start working on this project before failing.
What's bad for the science is people no longer folding because Stanford puts the blame on the end-user instead of having some mechanism in place to either figure out what's going wrong, or at the least ensure incompatible projects are not forwarded to those who cannot accept them.

r3loaded · 11 Apr 2010 at 11:00

I get 5781 and other 578x projects quite often on my 8600M - been causing me grief with these UNSTABLE_MACHINE errors, even though temps are well below limits. So it's not just you (and I doubt it's SBS2008 either).

verbal · 11 Apr 2010 at 11:21

Miniyazz: you make a lot of good points but I disagree with your use of the word incompatible. The project WILL run on the hardware that SiriusB has (unless he has a faulty component). In my experience it's clear that individual WUs will stress systems in different ways so it can quite easily occur that one type of word unit will fail where other ones don't. The user blames the WU or project but really it's just that with that users combination of clock speed and cooling their system cannot process that kind of WU. It's sensible to set up your overclocks and wait until your system has completed a range of different WUs without ANY problems before having confidence in it. Then if you want to leave it alone without having to deal with problems and make changes you must check all the temperatures* and then lower your clocks a bit to give you headroom for different WUs that may come along and different ambient temperatures.

If I was to speculate I'd say a component (that doesn't have a temp sensor) on SiriusBs cards is too hot now the weather (ambient temperature) is warmer and this is causing instability with a WU that is stressing one part of his GPUs more than the other WUs do.

*for example how hot does the graphics card power regulation hardware get?

miniyazz · 11 Apr 2010 at 11:43

The project will run on that hardware, and that software, but not as a whole package (including software) for whatever reason. And I agree with your point that different WUs stress the system in different ways, but I don't believe it's a stress/power/temperature-related problem though, as he states in the thread linked to earlier that it fails before even starting to work - that sounds very similar to the issue I had with A1 cores on SMP2, which failed with an error before ever starting processing.

Equally, he has the same error when using lower-than-stock clock speeds. It could possibly be a problem with a non-temperature-sensored part of the graphics cards, yes, but - if it were a part which rapidly gets hot under load, it would have a heatsink to prevent that, and would be cool enough from cold boot that it would at least start to fold these projects (which it doesn't do, even on cold boot). And if the heatsink had come loose, it'd likely fail on other WUs - and almost certainly wouldn't fail on the other GPU too.

verbal · 11 Apr 2010 at 12:34

Ok, so the next thing to consider is that stanford only lists XP/2003/Vista/7 as compatible operating systems. As I said above I've read on the folding forum about people having problems with particular projects that are solved with a reboot.

miniyazz · 11 Apr 2010 at 12:41

That could well be it then

Mattus · 11 Apr 2010 at 14:02

verbal said:
Miniyazz: you make a lot of good points but I disagree with your use of the word incompatible. The project WILL run on the hardware that SiriusB has (unless he has a faulty component). In my experience it's clear that individual WUs will stress systems in different ways so it can quite easily occur that one type of word unit will fail where other ones don't. The user blames the WU or project but really it's just that with that users combination of clock speed and cooling their system cannot process that kind of WU. It's sensible to set up your overclocks and wait until your system has completed a range of different WUs without ANY problems before having confidence in it. Then if you want to leave it alone without having to deal with problems and make changes you must check all the temperatures* and then lower your clocks a bit to give you headroom for different WUs that may come along and different ambient temperatures.

I highly doubt it's an issue with the GPU. Even if that WU stresses the GPU in a different way, I can't see a GPU which runs OK for hours on other WUs overheating in less than one second on these WUs.

It's some kind of compatibility problem which only crops up on certain combinations of hardware and software. I agree with miniyazz - Folding is supposed to run in the background. The onus should be on Stanford to pick up on compatibility problems based on the returns they have - or at least prevent the same WU from being issued five times to a machine that's perfectly stable on all other WUs.

verbal said:
I think it's bad for the science if people start avoiding WUs and putting the blame on Stanford.

Avoiding WUs isn't great, but it's better for the science to crunch some WUs than no WUs.