Advanced search

Message boards : Graphics cards (GPUs) : acemd2 stops checkpointing?

Author Message
ETQuestor
Send message
Joined: 11 Jul 09
Posts: 27
Credit: 1,000,618,568
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 17814 - Posted: 2 Jul 2010 | 6:11:02 UTC

Just got a new video card (ASUS ENGTX260) which is in the same system that has been successfully running GPUGRID for over a year (config below). With the new card, it's not erroring out (yet), but the "current CPU time" is increasing normally, but the "checkpoint CPU time" has stopped updating. Any idea if this is normal or something is wrong?

Athlon 64 X2 6000+
Linux 2.6.34 x86_64 (Fedora 13)
NVIDIA driver Linux x86_64 256.35
ASUS ENGTX260 (GeForce GTX 260/216)

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1957
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 17816 - Posted: 2 Jul 2010 | 7:39:08 UTC - in response to Message 17814.

it might be hanged. The progress indicator should always work.

gdf

ETQuestor
Send message
Joined: 11 Jul 09
Posts: 27
Credit: 1,000,618,568
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 17825 - Posted: 2 Jul 2010 | 16:23:21 UTC - in response to Message 17816.

It definitely seems hung, even though the "current CPU time" continues to climb and the card appears to be crunching away at full speed. I noticed that once the "checkpoint CPU time" stops increasing, the "fraction done" stops increasing, as well.

If I stop and restart the BOINC client, it starts over from the last checkpoint and seems to work normally for 30-60 minutes, then gets hung again.

Any tips on further troubleshooting or diagnosis? It seems like overheating is a very common problem, but how do I determine if that is the issue? I already have as many case fans packed into this thing as I can...

Thank you.

Snow Crash
Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 17826 - Posted: 2 Jul 2010 | 16:29:53 UTC - in response to Message 17825.

To see if it is heat download and install GPUZ or nvidiaInspector or Precision or RealTemp (all free, just do a quick search to find download sites).

They will not only tell you the GPU temp but also tell you what speeds you are actually running at.

Are you sure you are not soft crashing down into 2D mode?
Are you sure no power settings on your PC are throttling it down?
____________
Thanks - Steve

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 17828 - Posted: 2 Jul 2010 | 17:16:41 UTC - in response to Message 17826.
Last modified: 2 Jul 2010 | 17:17:11 UTC

ETQuestor,
How much RAM does your system have?
What else are you crunching?
What version of Boinc are you using?
Is your BIOS configured to put devices to sleep?

ETQuestor
Send message
Joined: 11 Jul 09
Posts: 27
Credit: 1,000,618,568
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 17831 - Posted: 2 Jul 2010 | 18:24:18 UTC - in response to Message 17828.

Snow Crash,

Do any of those info utilities support linux? I don't have Windows on this system. I have no idea if I'm dropping to 2D mode (how do I check?) or have a problem with power settings, but I ran GPUGRID on a a GeForce 9600 GSO in this same system with no problem for almost a year.

skgiven,

I have 3GB of RAM and the OS reports only 25% RAM utilization. The only other thing running is SETI@Home, which only uses CPU (they don't have a GPU app for linux). I have BOINC 6.10.56 (x86_64 Linux). I don't think the BIOS is configured for put devices to sleep, but I'll check once I get home.

Thanks to both of you for your suggestions.

Profile nenym
Send message
Joined: 31 Mar 09
Posts: 137
Credit: 1,308,230,581
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 17832 - Posted: 2 Jul 2010 | 18:38:25 UTC - in response to Message 17831.
Last modified: 2 Jul 2010 | 18:39:43 UTC

(they don't have a GPU app for linux). I have BOINC 6.10.56 (x86_64 Linux)

You can try Crunch3r's app. 2.2 http://calbe.dw70.de/mb/viewtopic.php?f=9&t=116 or 3.0 http://calbe.dw70.de/mb/viewtopic.php?f=9&t=120.
Needs a little ldd and ldconfing, but nothing difficult. For more info you can see Lunatic's Seti forum.

ETQuestor
Send message
Joined: 11 Jul 09
Posts: 27
Credit: 1,000,618,568
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 17833 - Posted: 2 Jul 2010 | 19:29:36 UTC - in response to Message 17826.

I just found the "nvidia-smi" utility which is installed as part of the nvidia driver. It doesn't report back much, but it does read the GPU core temp, which is pegged at 79 degrees C. I *think* that is well within the operating range, so I think this means overheating is unlikely as the cause, right?

ETQuestor
Send message
Joined: 11 Jul 09
Posts: 27
Credit: 1,000,618,568
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 17842 - Posted: 3 Jul 2010 | 1:22:46 UTC - in response to Message 17816.

OK, this is getting frustrating. I managed to finish one WU, but they are mostly erroring out now. The common error is listed below. I also noticed that the hangs (and the final errors) always occur on multiples of 15 minutes of CPU time (e.g., current CPU time = 1800 seconds). That can't be a coincidence.


SWAN : FATAL : Failure executing kernel sync [transpose_float2] [700]
acemd2_6.04_x86_64-pc-linux-gnu__cuda: ../swan/swanlib_nv.cpp:203: void swanRunKernel(const char*, int3, int3, size_t, ...): Assertion `0' failed.

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1957
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 17861 - Posted: 4 Jul 2010 | 11:02:29 UTC - in response to Message 17842.

Can you show your computers?

gdf

ETQuestor
Send message
Joined: 11 Jul 09
Posts: 27
Credit: 1,000,618,568
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 17874 - Posted: 4 Jul 2010 | 20:25:59 UTC - in response to Message 17861.

GDF, I'm not sure if you are just asking me to identify which computer is mine or if you're asking me to do something.

My computer is http://www.gpugrid.net/show_host_detail.php?hostid=43352

Please clarify if you wanted to be do something specific and I'll be happy to do it.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 17879 - Posted: 5 Jul 2010 | 9:17:59 UTC - in response to Message 17874.

I think it could be a bit warm and the WUs may be crashing when the system is in use? The tolerance of one card is not always the same as another, and changes during its life span.
Leave the case door off, it should let the card cool down a bit more. Then don’t use the system while a task is running and see how it gets on.

ETQuestor
Send message
Joined: 11 Jul 09
Posts: 27
Credit: 1,000,618,568
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 17907 - Posted: 6 Jul 2010 | 18:17:58 UTC - in response to Message 17879.

Thanks for the advice, everyone. I invested some energy into cooling fans and whatnot a while ago, so I have a lot of air movement through my system. This is also a server, so it is 99.9% idle other than GPUGrid. Also, the core temp of the GTX 260 never exceeds 80 degrees C, which is well within normal operating range. I ran memtestg80 and consistently got errors in the high memory, so I am returning the card as defective.

ETQuestor
Send message
Joined: 11 Jul 09
Posts: 27
Credit: 1,000,618,568
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 17915 - Posted: 9 Jul 2010 | 23:59:42 UTC - in response to Message 17861.

I swapped in a replacement card and it's happily running. Looks like the old card was defective.

Post to thread

Message boards : Graphics cards (GPUs) : acemd2 stops checkpointing?

//