GTX295 F@H - Help Please

Right, different GTX295, different system. Getting this strange problem after completing a WU. Then it just loops from "Attempting to send results" until it does too many Unstable Machine errors to continue. Happens on both cores, driver issue maybe?:

[03:26:35] Completed 98%
[03:27:25] Completed 99%
[03:28:14] Completed 100%
[03:28:14] Successful run
[03:28:14] DynamicWrapper: Finished Work Unit: sleep=10000
[03:28:25] Reserved 84576 bytes for xtc file; Cosm status=0
[03:28:25] Allocated 84576 bytes for xtc file
[03:28:25] - Reading up to 84576 from "work/wudata_01.xtc": Read 84576
[03:28:25] Read 84576 bytes from xtc file; available packet space=786345888
[03:28:25] xtc file hash check passed.
[03:28:25] Reserved 25248 25248 786345888 bytes for arc file=<work/wudata_01.trr> Cosm status=0
[03:28:25] Allocated 25248 bytes for arc file
[03:28:25] - Reading up to 25248 from "work/wudata_01.trr": Read 25248
[03:28:25] Read 25248 bytes from arc file; available packet space=786320640
[03:28:25] trr file hash check passed.
[03:28:25] Allocated 560 bytes for edr file
[03:28:25] Read bedfile
[03:28:25] edr file hash check passed.
[03:28:25] Allocated 0 bytes for logfile
[03:28:25] Could not open/read logfile=<work/wudata_01.log>; Cosm status=-1
[03:28:25] GuardedRun: success in DynamicWrapper
[03:28:25] GuardedRun: done
[03:28:25] Run: GuardedRun completed.
[03:28:27] + Opened results file
[03:28:27] - Writing 110896 bytes of core data to disk...
[03:28:27] Done: 110384 -> 109756 (compressed to 99.4 percent)
[03:28:27] ... Done.
[03:28:27] DeleteFrameFiles: successfully deleted file=work/wudata_01.ckp
[03:28:27] Shutting down core
[03:28:27]
[03:28:27] Folding@home Core Shutdown: FINISHED_UNIT
[03:28:31] CoreStatus = 64 (100)
[03:28:31] Unit 1 finished with 99 percent of time to deadline remaining.
[03:28:31] Updated performance fraction: 0.993553
[03:28:31] Sending work to server
[03:28:31] Project: 6601 (Run 9, Clone 693, Gen 205)
[03:28:31] + Attempting to send results [July 25 03:28:31 UTC]
[03:28:31] - Reading file work/wuresults_01.dat from core
[03:28:31] (Read 110268 bytes from disk)
[03:28:31] Connecting to http://171.64.65.61:8080/
[03:28:36] Posted data.
[03:28:36] Initial: 0000; - Uploaded at ~21 kB/s
[03:28:36] - Averaged speed for that direction ~21 kB/s
[03:28:36] + Results successfully sent
[03:28:36] Thank you for your contribution to Folding@Home.
[03:28:36] + Starting local stats count at 1
[03:28:40] Trying to send all finished work units
[03:28:40] + No unsent completed units remaining.
[03:28:40] - Preparing to get new work unit...
[03:28:40] + Attempting to get work packet
[03:28:40] - Will indicate memory of 6135 MB
[03:28:40] - Connecting to assignment server
[03:28:40] Connecting to http://assign-GPU.stanford.edu:8080/
[03:28:42] Posted data.
[03:28:42] Initial: 40AB; - Successful: assigned to (171.64.65.61).
[03:28:42] + News From Folding@Home: Welcome to Folding@Home
[03:28:42] Loaded queue successfully.
[03:28:42] Connecting to http://171.64.65.61:8080/
[03:28:43] Posted data.
[03:28:43] Initial: 0000; - Receiving payload (expected size: 74336)
[03:28:44] - Downloaded at ~72 kB/s
[03:28:44] - Averaged speed for that direction ~72 kB/s
[03:28:44] + Received work.
[03:28:44] Trying to send all finished work units
[03:28:44] + No unsent completed units remaining.
[03:28:44] + Closed connections
[03:28:44]
[03:28:44] + Processing work unit
[03:28:44] Core required: FahCore_11.exe
[03:28:44] Core found.
[03:28:44] Working on queue slot 02 [July 25 03:28:44 UTC]
[03:28:44] + Working ...
[03:28:44] - Calling '.\FahCore_11.exe -dir work/ -suffix 02 -checkpoint 15 -verbose -lifeline 3432 -version 623'

[03:28:44]
[03:28:44] *------------------------------*
[03:28:44] Folding@Home GPU Core
[03:28:44] Version 1.31 (Tue Sep 15 10:57:42 PDT 2009)
[03:28:44]
[03:28:44] Compiler : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86
[03:28:44] Build host: amoeba
[03:28:44] Board Type: Nvidia
[03:28:44] Core :
[03:28:44] Preparing to commence simulation
[03:28:44] - Looking at optimizations...
[03:28:44] DeleteFrameFiles: successfully deleted file=work/wudata_02.ckp
[03:28:44] - Created dyn
[03:28:44] - Files status OK
[03:28:44] - Expanded 73824 -> 383588 (decompressed 519.5 percent)
[03:28:44] Called DecompressByteArray: compressed_data_size=73824 data_size=383588, decompressed_data_size=383588 diff=0
[03:28:44] - Digital signature verified
[03:28:44]
[03:28:44] Project: 6600 (Run 10, Clone 922, Gen 6)
[03:28:44]
[03:28:44] Assembly optimizations on if available.
[03:28:44] Entering M.D.
[03:28:51] Tpr hash work/wudata_02.tpr: 1542977918 4078650811 266603496 1470683992 346959629
[03:28:51]
[03:28:51] Calling fah_main args: 14 usage=100
[03:28:51]
[03:28:51] mdrun_gpu returned
[03:28:51] Going to send back what have done -- stepsTotalG=0
[03:28:51] Work fraction=0.0000 steps=0.
[03:28:55] logfile size=0 infoLength=0 edr=0 trr=25
[03:28:55] + Opened results file
[03:28:55] - Writing 637 bytes of core data to disk...
[03:28:55] Done: 125 -> 124 (compressed to 99.2 percent)
[03:28:55] ... Done.
[03:28:55] DeleteFrameFiles: successfully deleted file=work/wudata_02.ckp
[03:28:55]
[03:28:55] Folding@home Core Shutdown: UNSTABLE_MACHINE
[03:28:59] CoreStatus = 7A (122)
[03:28:59] Sending work to server
[03:28:59] Project: 6600 (Run 10, Clone 922, Gen 6)

[03:29:31] + Attempting to send results [July 25 03:29:31 UTC]
 
Last edited:
Very odd, I've tried loads of solutions, new drivers, downclocking, nothing helps. Does one WU fine, then UNSTABLE_MACHINE until it reaches the EUE limit.

Even stranger, I switched to a GTX260, removing the GTX295, fresh driver install and the same issue happens again! :confused:

Any suggestions? Both cards appear to be fine under Vantage and Furmark.

Board or PSU issue maybe? Both are good models! Asus P6T V2 and BeQuiet 650w PSU.
 
Last edited:
The joy of the challenge, justification for spending money on the kit & the warm glow that we are contributing to a noble cause plus the added bonus of tiffies & stompage:)

Speed: I had an unstable machine unit recently when my card, a 260 has been rock solid for a v long time. Was at 70% & was killed by a reboot!
Maybe you have got some bad WU's - have you checked the FAH forum?
Have you tried a different PCI-e slot?

Just tried a different slot, similar issue:

[17:10:57] Completed 28%
[17:12:02] Run: exception thrown during GuardedRun
[17:12:02] Run: exception thrown in GuardedRun -- Gromacs cannot continue further.
[17:12:02] Going to send back what have done -- stepsTotalG=15000000
[17:12:02] Work fraction=0.2890 steps=15000000.
[17:12:06] logfile size=0 infoLength=0 edr=0 trr=23
[17:12:06] + Opened results file
[17:12:06] - Writing 642 bytes of core data to disk...
[17:12:06] Done: 130 -> 129 (compressed to 99.2 percent)
[17:12:06] ... Done.
[17:12:06] DeleteFrameFiles: successfully deleted file=work/wudata_07.ckp
[17:12:06]
[17:12:06] Folding@home Core Shutdown: UNSTABLE_MACHINE

In Windows Event Viewer I found this: Display driver nvlddmkm stopped responding and has successfully recovered.

Time stamp is 18:12 which matches.
 
Last edited:
Right, I unstalled the latest drivers and unlike the previous driver changes also unistalled the PhysX driver. Gone back to 197.45, it has just completed its first WU and has finally moved on to a second.

So the problem could well be solved, hopefully I'm not speaking too soon!
 
It is becoming a PITA!

Second WU failed, same error:

[20:20:24] Completed 68%
[20:21:23] Completed 69%
[20:22:12] Run: exception thrown during GuardedRun
[20:22:12] Run: exception thrown in GuardedRun -- Gromacs cannot continue further.
[20:22:12] Going to send back what have done -- stepsTotalG=10000000
[20:22:12] Work fraction=0.6980 steps=10000000.
[20:22:17] logfile size=0 infoLength=0 edr=0 trr=23
[20:22:17] + Opened results file
[20:22:17] - Writing 642 bytes of core data to disk...
[20:22:17] Done: 130 -> 127 (compressed to 97.6 percent)
[20:22:17] ... Done.
[20:22:17] DeleteFrameFiles: successfully deleted file=work/wudata_03.ckp
[20:22:17]
[20:22:17] Folding@home Core Shutdown: UNSTABLE_MACHINE
[20:22:19] CoreStatus = 7A (122)
[20:22:19] Sending work to server
[20:22:19] Project: 6606 (Run 5, Clone 506, Gen 191)

Although it did go on and carry on folding with a new WU, before I stopped it.

The loft where it is running is hot, but the card seems fine at 77c. Project numbers I have at the moment as the logs keep overwriting are:

Project: 5765 (Run 13, Clone 381, Gen 1082) Completed
Project: 6606 (Run 5, Clone 506, Gen 191) Failed @ 69%

Currently running it with a different PSU. Might be an issue with Project 66XX's but I've not read anything about it.
 
Last edited:
EUE Pause on a 6606, this is getting annoying now. Either both the cards are faulty, unlikely but possible or there is something wrong elsewhere. I guess aside from the cards there is the installation and the motherboard.

[00:57:37] Completed 71%
[00:58:36] Completed 72%
[00:59:13] Run: exception thrown during GuardedRun
[00:59:13] Run: exception thrown in GuardedRun -- Gromacs cannot continue further.
[00:59:13] Going to send back what have done -- stepsTotalG=10000000
[00:59:13] Work fraction=0.7260 steps=10000000.
[00:59:17] logfile size=0 infoLength=0 edr=0 trr=23
[00:59:17] + Opened results file
[00:59:17] - Writing 642 bytes of core data to disk...
[00:59:17] Done: 130 -> 127 (compressed to 97.6 percent)
[00:59:17] ... Done.
[00:59:17] DeleteFrameFiles: successfully deleted file=work/wudata_06.ckp
[00:59:17]
[00:59:17] Folding@home Core Shutdown: UNSTABLE_MACHINE
[00:59:20] CoreStatus = 7A (122)
[00:59:20] Sending work to server
[00:59:20] Project: 6606 (Run 0, Clone 206, Gen 207)


From what I've read "CoreStatus = 7A (122)" means either graphics hardware or a WU issue. Either way I've deleted the whole folding GPU folder, started a fresh. Should have some more GTX260s arriving in the next few days, so if I can't solve it then maybe new GPU hardware will.
 
Last edited:
It has happened again, I have to say I'm getting annoyed. Either I'm doing something seriously stupid or there is something seriously wrong. Here is the very lastest full log, this is from a fresh install of the GPU3 client, if someone could take a look I'd appriecate it:

http://www.markljlewis.com/FAHlog.txt
 
I have yes, unfortunately I can't get it to test all the memory. Does about 750 for 25 tests and 700 for 100 tests. Either way they both passed.

I've improved the cooling in the case and changed the fan curve with MSI Afterburner so it is at 65%+ regardless of the temps. Hopefully that might sort it, but only time will tell. I've had it do 10/11WU and then fail, so until it does 30+ I won't be happy its stable.
 
Last edited:
Back
Top Bottom