Author |
Message |
|
Hello,
I have been running GPU BOINC jobs for some time successfully, but was always a bit worried that my graphics card could overheat. It would run at around 67deg Celsius. An so I had developed the habit of Aborting the "Long Run" tasks preemptively, unaware that this was recorded as a failure by the server, until today.
But then for some dumb reason I decided to update my driver, to version 296.10 . I don't really know why I did this. And then last night I had set my BOINC manager to start computing again as normal overnight.
But this morning I found my PC in a corrupted state, and needed to force a shutdown with the power button. I ran a file system check just in case, but luckily there were no FS errors detected in the process. And I'm still somewhat enthusiastic about running GPU BOINC jobs.
When I checked the Web site, what I found was that one work unit was completed on the GPU successfully, but that some computation error must have taken place. And then after this computation error, every other job which was launched on the GPU, failed again with the message that the resource (the GPU or its CUDA interface) could not be found. Thus I've amassed over 30 compute errors this way, even though the jobs mainly never really came to run properly on my GPU.
My personal estimation of what must have happened, was that when a GPU computation error does take place, it *can* leave the GC in a logic state in which it can no longer be used, until a reboot has been done. The role of my new driver might have been, that it was pushing the graphic card /harder/ than before, so that the GC may have actually gotten hotter than before (maybe 70deg Celsius)? But I'm still guessing that my actual computation error was probably just random.
I've run some tests on the GPU now, using the CUDA SDK which I have installed, and since my reboot the GPU is accepting CUDA programs fine again.
I'm just asking myself, IF it was entirely the fault of my new driver, then WHY would the BOINC client have started and completed one GPU job successfully? In that case even the first job should not have run properly.
But in any case what happened in an automated way, was that your server just downloaded one job after the other, and that they all reported back as unsuccessful, over 30 jobs, and that the server simply seems to have reached the end of its job cue for now.
My question is, whether as soon as the next job becomes available, should I just 'have at it' again, hoping that this type computation error was just a fluke and probably won't recur? Or am I just taking too many risks with my GC?
And Is there any way to go back into the Web site, and mark the jobs which had compute errors (which were _not_ Long Run tasks), just to retry them? Because right now I'm not getting any more jobs, because I seem to have reached the end of the cue.
Dirk
[I should add:] This morning my GPU fan was not running at its accelerated speed anymore, indicating that the GPU itself as no longer hot. And yet, it was an unavailable resource for new BOINC jobs overnight. And after the reboot, it instantly became an available resource again...
Oh yes, and I am running on Windows 7 Professional, 64-bit, with an Intel i7 950 quad-core, max 3.07GHz . I've seen many stability issues on this box, which can fray my nerves, but run into very few actual hardware-problems.
Oddly enough, this O/S is capable of displaying its desktop fine, without any error messages, and yet not really be responding due to a GC not working properly. And yet the mouse still moves its cursor around... I've seen it before, and chalk up this state to 'some of the CPU cores still running, but not all the cores or all the processes still running'.
[Further:] After reading elsewhere in the forum on the 296.10 driver, I've just learned that it was probably in error of me, having installed it, to continue allowing my display to go to sleep. So one thing I've done for now, in preparation for a retry, is to change my power settings never to turn off the monitor. But that (GPU) resource might just as easily have been unavailable 30x over, due to the monitor sleeping... {:-/} |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
295 and 296 drivers are bad on W7. This is a well reported problem, in this and other forums.
Uninstall your driver and install an older driver.
Your best bet might be a driver between 280 and 285.
____________
FAQ's
HOW TO:
- Opt out of Beta Tests
- Ask for Help |
|
|
|
Sounds like a great idea.
But just before I installed the 296.10 driver, I had allowed Windows Update to install its version of the recently-approved drivers. I suspect that it was trying to install the 280.26 version, because that one's not listed on the nVidia Web site as a 'Beta', rather officially being a 'WHQL Release'.
And as many of you already know, we should *not* go through WU to update such things. It's a /bad/ thing to do, because apparently it does poor device-driver /management/. So after this major gaffe and lapse in attention span on my part last night, I was unable to open the nVidia Control Panel anymore.
I felt at the time that 296.10 might be a good way to keep WU from nagging me with its updates, because at least this version is theoretically ahead of the version that WU offers. Also, having had a screwed-up driver once, makes me leery of reinstalling the same one again.
The installers we download from the nVidia Web site at least do a clean install, which left me without apparent corruption again.
What I've just done, is to download the installer (.EXE) file from the nVidia Web site, for 280.26 . If I run into more problems like this, I will _use_ that .EXE File to downgrade.
Dirk |
|
|
|
I'm sorry to post to myself, but do have an afterthought.
I strongly tend to disfavor downgrades, if there's some chance to stick with an upgrade.
And I admit that one reason fw I'm not downgrading just yet, is actually the need to rearrange all those icons on my 1920x1080 desktop. When we reinstall a GPU device driver, the size of the screen temporarily shrinks, so that not all the usual icons can be shown.
After I reboot, I find that many of the icons outside that smaller rectangle end up displaced, so that next I need to spend about 1/2 hour rearranging them again. I've actually taken a screen-shot of my desktop, just to make that a bit easier.
But then simply never to let one's monitor go to sleep - would be /even easier/ than that!
I should be able to tell you very soon, how well that worked out for me. Just to let any reader know, I'm not dumb enough at this point to leave my PC doing GPU computing overnight again. One reason is the mere fact that I wouldn't sleep very comfortably, knowing what could go wrong. Another reason is knowing, that if this gets screwed up again, as may well happen, it won't just mess up one work unit. It would mess up 30 work units in a row.
So just for tonight, I'm letting my two BOINC-capable computers do regular CPU computing. And then by tomorrow morning, I'm hoping that the servers' quota system allows me to try a fresh GPU work unit.
And then the work unit in question will need to be closely supervised by me, also when I'm fully awake, not directly after my first cup of coffee.
Dirk |
|
|
|
My experiment was based on this forum posting:
http://www.gpugrid.net/forum_thread.php?id=2921&nowrap=true#24161
What I found on my box, was that driver 296.10 did not kill all the GPU work units when the monitor goes to sleep. However, the monitor being asleep definitely prevents software - including BOINC - from launching any new GPU jobs, which BOINC in turn experiences as computation errors. And then, if there is in fact a back-log of 30+ failed attempts to use the GPU, all due to device driver policy, then what can happen next is that desktop compositing no longer works properly, because the desktop management built into Windows 7 is also trying to access the GPU (open a render context).
Further, Windows 7 still seems to have a weakness these days, in that some aspects of the UI work synchronously and with no timeout. This means that if the GPU is unavailable, and maybe not clear about this in its state, for 10 minutes, the mouse cursor will have a spinning circle and indicate a Busy state - for 10 minutes ! And of course in practice nobody waits for 10 minutes, for his desktop just to recover. So at that point any user might fear that his computer is screwed up in a major way, which wouldn't really be true. And yet this can cripple a session, because such a desktop has become unresponsive, after having been responsive, when a program was launched which would normally display an effect.
Thus, if one can set the monitor never to go to sleep, it should be perfectly feasible to run GPU work units even with the 296.10 driver version.
--------
What I found out, is that the GPU temperature is in fact slightly hotter with this driver, being at 69deg Celsius instead of at 67deg Celsius. But I don't see this as dangerous yet, because it doesn't prompt the GC to rev up the fan to full speed. The fan runs at roughly 50% at this GPU temperature.
There seems to be a considerable performance improvement with the 296.10 driver. With the previously-working 275.xx driver, if the BOINC manager predicted that a work unit would take 18 hours to complete, this was low-baling the real time, and that work unit would take over 20 hours to complete.
With the higher driver version, a work unit predicted to take 18 hours to complete, actually just now completed within 10 hours and 8 minutes!
--------
One behavior of my computer did surprise me tonight though. Even though I had newly set it in my _power settings_ , never to let the monitor go to sleep, it went to sleep briefly anyway, at 1:00am on a Sunday morning. This is certainly Windows strangeness at its best. But unless I can find another setting to prevent this from happening, it might undermine the feasibility of using the 296.10 driver after all.
I had forbidden it in the general power settings, and was unable to find another such setting under the _screensaver_ itself. Is anybody out there aware, /where else/ I'd need to set this?
Dirk
P.S. Actually, there is some discussion on how /really/ to prevent the monitor from going to sleep, elsewhere on the Web (using Group Policies)... This means that further trials are in order on my box... |
|
|
|
In the new experiment: did the WUs crash again, after the screen went blank?
And regarding desktop icons: there are tools to do this, see here and possibly in the comments.
MrS
____________
Scanning for our furry friends since Jan 2002 |
|
|
|
In the new experiment: did the WUs crash again, after the screen went blank?
Perhaps my writing style was not to-the-point enough. No, in my new experiment none of the work units crashed again, even though the monitor had gone to sleep for a few seconds. I find this to explain, why in my first trial with the 296.10 driver, the first work unit started and ran to completion fine, although subsequent work units had crashed. My screen-saver had been running at the time BOINC becomes active, but shut the monitor off an hour later...
Also, I run the "3D Text" screensaver that ships with Windows, because it has minimal GPU usage. That way, if I interrupt the screensaver and go back into my desktop, usually this ensures that a small reserve of GPU cores is available, because BOINC was not using them, for example available to do some desktop compositing, in place of the screensaver running. This assumes though, that the graphics driver interface itself is available...
Strangely, with my setup, a sleeping monitor simply means the client cannot accept new jobs. It does no seem to kill already-running jobs.
But I would not count on that working! For example, it can also happen that the suspend/resume behavior fails to kill the work unit one time, but that it succeeds the next time.
I suppose that the perfect solution for BOINC usage, would in fact be to downgrade now. But I don't see BOINC as so central in the use of my computer. I like the idea that my drivers are fully up-to-date, because I'm also interested in Game Design, in the latest "PhysX" capabilities and so forth...
And so a risk presents itself to the BOINC programmers, in that their insistence on earlier driver versions might leave them behind at some point in the future. Unless that is, they can convince nVidia to change the behavior of the GC in the case of the monitor sleeping.
FWIW, before Friday it had never happened to me that a graphics card had crashed, because I had never had a GC so packed with its own computing power. So I felt that in any case this was worth reporting.
And regarding desktop icons: there are tools to do this, see here and possibly in the comments.
MrS
Thank you for the tip. But because at times I'm stubborn as a mule (and distrustful of certain software tools), I'm more likely to use the screen-shot I took, even if doing so is not as convenient as some tools out there.
BTW, my 3 other computers are Linux-based, for which platform I do not expect to find 3rd-party tools like that. But then again under Linux, no such (icons) problem has ever crept up before.
Cheers,
Dirk |
|
|