Message boards : Graphics cards (GPUs) : acemd2 stops checkpointing?
Author | Message |
---|---|
Just got a new video card (ASUS ENGTX260) which is in the same system that has been successfully running GPUGRID for over a year (config below). With the new card, it's not erroring out (yet), but the "current CPU time" is increasing normally, but the "checkpoint CPU time" has stopped updating. Any idea if this is normal or something is wrong? | |
ID: 17814 | Rating: 0 | rate: / Reply Quote | |
it might be hanged. The progress indicator should always work. | |
ID: 17816 | Rating: 0 | rate: / Reply Quote | |
It definitely seems hung, even though the "current CPU time" continues to climb and the card appears to be crunching away at full speed. I noticed that once the "checkpoint CPU time" stops increasing, the "fraction done" stops increasing, as well. | |
ID: 17825 | Rating: 0 | rate: / Reply Quote | |
To see if it is heat download and install GPUZ or nvidiaInspector or Precision or RealTemp (all free, just do a quick search to find download sites). | |
ID: 17826 | Rating: 0 | rate: / Reply Quote | |
ETQuestor, | |
ID: 17828 | Rating: 0 | rate: / Reply Quote | |
Snow Crash, | |
ID: 17831 | Rating: 0 | rate: / Reply Quote | |
(they don't have a GPU app for linux). I have BOINC 6.10.56 (x86_64 Linux) You can try Crunch3r's app. 2.2 http://calbe.dw70.de/mb/viewtopic.php?f=9&t=116 or 3.0 http://calbe.dw70.de/mb/viewtopic.php?f=9&t=120. Needs a little ldd and ldconfing, but nothing difficult. For more info you can see Lunatic's Seti forum. | |
ID: 17832 | Rating: 0 | rate: / Reply Quote | |
I just found the "nvidia-smi" utility which is installed as part of the nvidia driver. It doesn't report back much, but it does read the GPU core temp, which is pegged at 79 degrees C. I *think* that is well within the operating range, so I think this means overheating is unlikely as the cause, right? | |
ID: 17833 | Rating: 0 | rate: / Reply Quote | |
OK, this is getting frustrating. I managed to finish one WU, but they are mostly erroring out now. The common error is listed below. I also noticed that the hangs (and the final errors) always occur on multiples of 15 minutes of CPU time (e.g., current CPU time = 1800 seconds). That can't be a coincidence. | |
ID: 17842 | Rating: 0 | rate: / Reply Quote | |
Can you show your computers? | |
ID: 17861 | Rating: 0 | rate: / Reply Quote | |
GDF, I'm not sure if you are just asking me to identify which computer is mine or if you're asking me to do something. | |
ID: 17874 | Rating: 0 | rate: / Reply Quote | |
I think it could be a bit warm and the WUs may be crashing when the system is in use? The tolerance of one card is not always the same as another, and changes during its life span. | |
ID: 17879 | Rating: 0 | rate: / Reply Quote | |
Thanks for the advice, everyone. I invested some energy into cooling fans and whatnot a while ago, so I have a lot of air movement through my system. This is also a server, so it is 99.9% idle other than GPUGrid. Also, the core temp of the GTX 260 never exceeds 80 degrees C, which is well within normal operating range. I ran memtestg80 and consistently got errors in the high memory, so I am returning the card as defective. | |
ID: 17907 | Rating: 0 | rate: / Reply Quote | |
I swapped in a replacement card and it's happily running. Looks like the old card was defective. | |
ID: 17915 | Rating: 0 | rate: / Reply Quote | |
Message boards : Graphics cards (GPUs) : acemd2 stops checkpointing?