GTX295 F@H - Help Please

Soldato
Joined
23 Nov 2002
Posts
9,762
Location
Near Bristol
Hi Guys,

Basically I sold a BFG GTX295 Single PCB to someone who uses them for folding. The card had of course been tested and used under gaming, vantage etc with zero problems.

The guy got back to me saying: "GPU 0 keeps on generating application errors, but GPU 1 is fine. The application errors are accompanied by a message from windows that the video card is 'not responding, but has recovered', and thin black horizontal line which travels quickly vertically down the screen. The card then trashes the unit it was working on, and then carries on. This is an intermittent fault, which may not be visible when playing games."

The card was returned to me but before I RMA it to BFG I wanted to test that it was indeed the card at faulty. Being rather new GPU folding I followed the instructions here: http://folding.stanford.edu/English/WinGPUGuide, especially the part at the bottom. I connected up two monitors to the card to make sure it would use both GPUs.

But I get this issue here: http://www.markljlewis.com/folding295pic1.jpg

It is not the same issue as he described but similar. One of the cores appears to be folding fine, the other states that the core is not running now. Its progress appears to resets every time I press display and seems to go from the speed shown to next to nothing. Both GPUs appear to be loaded, although the values mean nothing as it could be for either one.

Is there anything I can try before RMAing it? Just incase I've missed something obvious.

Thanks,

Mark
 
Ok, couple of things to rule out all possibility its not something other than the card.

1. First Its not easy getting a 295 to fold :(
2. Don't use the GUI client, try downloading the console client from here
4. If at all possible don't use Vista or Windows 7, it can work but its more of a pain getting it to work especially if you are new. Use Windows XP if you have a dual boot system or pop an old hard drive in and install XP on that. Vista/7's driver model is much changed for XP and isn't as happy about using cards with dual gpus.
5. Disable SLI, in the Nvidia control panel
6. Whatever OS you use make sure you extend the desktop to both graphics cards
7. Create two folders call them c:\folding\gpu0 and c:\folding\gpu1 and copy the console app into each folder.
9. Go to the first folder c:\folding\gpu0 and type "[email protected] -gpu 0 -verbosity 9" that should start the first core folding
10. Now we need to config the second client. Go to the second folder c:\folding\gpu1 and type: "[email protected] -configonly"
11. Hit enter for all the answers but when it asks to "Change advance options (yes/no)?" say "yes"
12. Hit enter for all the advanced options until you get to "Machine ID" (1-16)" and put 3
13. Then hit enter for all the other answers till you get out of the config.
14. Then type "[email protected] -gpu 1 -verbosity 9"

Its not unusual for 295's to have the sort of problem you are experiencing. Out of the 4 I had one had the same prob. The difficulty is getting them configured, then if they are all configured fine and you still have problems then its likely you may have a problem, but just because it doesn't fold isn't necessarily grounds for RMA'ing it unless you can prove that its broken in other applications such as gaming etc.

If it does prove that folding is still having troubles, try Furmark in multi-gpu mode to stress test both cores and see if you get artifacts or crashes.
 
Well from following your instructions it appears to be working.

I first tried Furmark in multi-gpu mode with no problems along with a few loops of Vantage.

I then uninstalled the GUI client, disabled SLI which I think the drivers call multi gpu mode or something. Installed the console client using your instructions and it appears to be working. I've done two work units, one for each core.

I think I need to do a fair few to be certain.
 
Sounds good to me :)

You might get the odd one crashing (EUE) every now and then, esp if it gets too hot, but no more than one a week or so.
 
What does EUE stand for?

Is HFM.NET still the best monitoring software? I seem to recall not getting on with it when I last used it.
 
Early Unit End. Usually means either a duff WU or more likely an unstable o/c or overheating card. The odd one is frustrating but if it happens regularly then you need to find the cause & fix it.

As for HFM.NET most of the mega Folders have gone over to it. Might have been updated since you last used it. As I only run a single client FahMon does me as I'm familiar with it;)
 
Thanks.

Right I left it running while I was at Kemble this afternoon.

http://www.markljlewis.com/hfm.jpg

Strangely however hfm and the folding@home stats don't match the logs. HFM states 4 WUs completed per core, however going through the logs for both cores using the 100% as a reference it seems that 5 WU have been completed.

Could someone check my logs to make sure I'm not miscounting or missing something obvious: http://www.markljlewis.com/fahlogs.zip

Thanks guys!
 
The version I'm using at the moment has this:

Add Toggle Switch (F10) to change between 'ClientTotal' or 'CurrentClientRun' Completed Units Count.

If you see my logs, it has done three since the last client start, so that can't be right. It would appear it is set to ClientTotal and it is missing one unit for some reason. Unless of course it counts units after they have been sent, in which case you may be right.

Either way it isn't hugely important, I just wanted to be sure that one of the WU's wasn't failing for some reason. Hopefully it will show the correct number when Folding@Home stats update...
 
Well the folding@home website is showing 11WUs, so god knows whats going on:

http://fah-web.stanford.edu/cgi-bin/main.py?qtype=userpage&username=Speed&teamnum=10

Should be no more than 10 according to the logs, but I guess that can't be a bad thing and might be related to when the internet went down or something.

Putting it in the storage and download pc in the loft to see how it goes for a longer run. It is hot up there as well, so should be a good test of the card.
 
Have a look in the view drop down in HFM it'll give you an option for number of complete units either since stared or for the life of the client
 
Have a look in the view drop down in HFM it'll give you an option for number of complete units either since stared or for the life of the client

Unfortunately the card is installed elsewhere at the moment, but I'll give that a go tomorrow.

As long as it is not showing less than the logs, I think it should be fine.

I'd imagine what you mentioned is also changed by the F10 hotkey as I stated earlier, not that I tried it mind.
 
Last edited:
Ah right, didn't know that. Pretty neat feature for remote viewing, I'll have to set it up.

The loft was 37c mid day today, so I doubt I'll run it for 24 hours, just until the morning or when the temps start getting silly.

Assuming it lasts that long! lol
 
Ah right, didn't know that. Pretty neat feature for remote viewing, I'll have to set it up.

The loft was 37c mid day today, so I doubt I'll run it for 24 hours, just until the morning or when the temps start getting silly.

Assuming it lasts that long! lol

It's really handy. I work on an oil rig for 4 weeks at a time and it's really handy for keeping an eye on things while I'm away. Here's my page so you can see what the output looks like. Free webspace 4TW :D

Be interesting to see how it copes at that sort of temperature.
 
It's really handy. I work on an oil rig for 4 weeks at a time and it's really handy for keeping an eye on things while I'm away. Here's my page so you can see what the output looks like. Free webspace 4TW :D

Be interesting to see how it copes at that sort of temperature.

Nice, wouldn't want to see your electricity bill! lol

I woke up this morning to find that HFM stated that 5 WU per core failed, only after I changed the completed settings (F10) did it state 1 was completed per core. At least it is even and not one core causing a problem as far as I can tell.

Looking at the logs I have this:

[00:30:21] Folding@home Core Shutdown: UNSTABLE_MACHINE x5

[00:30:28] EUE limit exceeded. Pausing 24 hours.

Full logs can be found here: http://www.markljlewis.com/fahlogs2.zip ; I'd appreciate someone knowledgeable to take a look at both.

I'm guessing that temps were not an issue as they didn't really get above 75c, if I had to guess I'd say it was down to the latest nVidia drivers, the power supplies or some other problem related to this new setup. It is a completely different system than I used previously.

What I don't get is why it did one WU perfectly fine and then had this problem.

In stranger news, the stat page now has me at 23WU when I've only done 2 more than last time. :confused:

http://fah-web.stanford.edu/cgi-bin/main.py?qtype=userpage&username=Speed&teamnum=10
 
The stats only update every hour/every three hours and have a time delay of 6 hours.

More commonly an EUE is caused by an unstable overclock, try reducing somewhat and seeing if it happens again.
 
The stats only update every hour/every three hours and have a time delay of 6 hours.

That doesn't explain why I'm being credited with WUs I never completed.

Everything is at stock, currently running it using a different PSU (instead of the two I used previously), although ambient temps are getting high.
 
Last edited:
Back
Top Bottom