Author |
Message |
AlezSend message
Joined: 17 Nov 12 Posts: 10 Credit: 185,958,753 RAC: 0 Level
Scientific publications
|
MY GTX 660 TI and GTX 650 just suddenly started erroring out every task all with the same error as far as I can tell.
Name 2x11_8-NOELIA_hfXA_long-0-2-RND7200_1
Workunit 3977330
Created 1 Jan 2013 | 5:44:59 UTC
Sent 1 Jan 2013 | 10:14:59 UTC
Received 1 Jan 2013 | 10:23:43 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status 1 (0x1)
Computer ID 138949
<core_client_version>7.0.33</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
]]>
Everything was working fine until last night. nVidia drivers 306,97
any ideas whats wrong ? |
|
|
|
MY GTX 660 TI and GTX 650 just suddenly started erroring out every task all with the same error as far as I can tell.
<core_client_version>7.0.33</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
]]>
Everything was working fine until last night. nVidia drivers 306,97
any ideas whats wrong ?
Sometimes the card (driver, or the OS) gets stuck, and only a restart can resolve it.
Have you tried a system restart? |
|
|
AlezSend message
Joined: 17 Nov 12 Posts: 10 Credit: 185,958,753 RAC: 0 Level
Scientific publications
|
Just reset GPUgrid and away to restart system. Was wondering if there was a known error as i've already trashed 32 units so didn't want to keep trashing more. |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
I appear to have had a similar problem. It started last night, just after midnight CET.
http://www.gpugrid.net/results.php?hostid=139265&offset=0&show_names=0&state=5&appid=
Long tasks just started failing, one after the other. Most failed after ~200sec. They might have just been failing on my GTX660Ti, and not my GTX470's; a task was running on it. After I restarted the same task started to run on my GTX660Ti, and now seems to be progressing normally...
GPUGrid stopped sending me work, so I will have to run some jobs from other projects and wait for my rating to improve before getting new tasks (only ~4h if the one task I have completes and reports successfully).
As well as the possibility that this was cause by bad tasks, this could have been cause by a CPU Boinc project, Boinc, or be down to the driver (306.97 in my case). W7x64.
Of the failed WU's, two tasks also failed on other systems:
http://www.gpugrid.net/workunit.php?wuid=3977079
http://www.gpugrid.net/workunit.php?wuid=3977023
However some resends ran successfully, suggesting it's not an issue with GPUGrid.
____________
FAQ's
HOW TO:
- Opt out of Beta Tests
- Ask for Help |
|
|
AlezSend message
Joined: 17 Nov 12 Posts: 10 Credit: 185,958,753 RAC: 0 Level
Scientific publications
|
Reset the project, did a clean nVidia driver update to 310.70 and rebooted. So far got 1 task and that seems to be running to completion. 15 more % to go and we will see.... |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
The task I had failed!
http://www.gpugrid.net/result.php?resultid=6235285
Name 2x12_4-NOELIA_hfXA_long-0-2-RND6878_0
Workunit 3977346
Created 19 Dec 2012 | 20:37:02 UTC
Sent 1 Jan 2013 | 5:58:06 UTC
Received 1 Jan 2013 | 17:01:28 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status 98 (0x62)
Computer ID 139265
Report deadline 6 Jan 2013 | 5:58:06 UTC
Run time 35,731.98
CPU time 30,914.86
Validate state Invalid
Credit 0.00
Application version Long runs (8-12 hours on fastest card) v6.16 (cuda42)
ERROR: file deven.cpp line 1106: # Energies have become nan
Perhaps it was one of the earlier tasks that failed on completion?
It wasn't resent.
____________
FAQ's
HOW TO:
- Opt out of Beta Tests
- Ask for Help |
|
|
|
The task I had failed!
ERROR: file deven.cpp line 1106: # Energies have become nan
Perhaps it was one of the earlier tasks that failed on completion?
It wasn't resent.
Since then, it was resent to another host, so we will see.
We have 17880 unsent workunits (and as low as 2174 in progress) at the moment, so a resend takes more time than usual. |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
I have identified the root of the problem I was encountering, and it was simply that the GTX660Ti's fan remained/was stuck at 40%. I had it on a profile, so fan speed would increase with temperature, but after updating MSI Afterburner a couple of days back the profile was not applied to the GTX660Ti, it only applied to the GTX470.
That's what I get for 'upgrading' software without any real need.
____________
FAQ's
HOW TO:
- Opt out of Beta Tests
- Ask for Help |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
I've had another error on that system (GTX660Ti now at 62°C):
6286250 4012377 2 Jan 2013 | 13:14:43 UTC 2 Jan 2013 | 20:26:00 UTC Error while computing 18,859.67 1,537.28 --- Long runs (8-12 hours on fastest card) v6.16 (cuda42)
Stderr output
<core_client_version>7.0.42</core_client_version>
<![CDATA[
<message>
- exit code 98 (0x62)
</message>
<stderr_txt>
MDIO: cannot open file "restart.coor"
ERROR: file deven.cpp line 1106: # Energies have become nan
called boinc_finish
</stderr_txt>
]]>
It also failed on another system using the 3.1app.
6285719 79738 2 Jan 2013 | 8:55:50 UTC 2 Jan 2013 | 10:36:53 UTC Error while computing 9.51 0.05 --- Long runs (8-12 hours on fastest card) v6.16 (cuda31)
6286250 139265 2 Jan 2013 | 13:14:43 UTC 2 Jan 2013 | 20:26:00 UTC Error while computing 18,859.67 1,537.28 --- Long runs (8-12 hours on fastest card) v6.16 (cuda42)
6287663 142106 2 Jan 2013 | 23:35:09 UTC 7 Jan 2013 | 23:35:09 UTC In progress --- --- --- Long runs (8-12 hours on fastest card) v6.16 (cuda42)
I went through earlier WU failures and while most WU's eventually succeeded most of the resends failed on at least one other system, some failing numerous times. The issue seems to be the same for Long and Short WU's:
http://www.gpugrid.net/results.php?hostid=139265&offset=0&show_names=0&state=5&appid=
While the errors are mostly early in the runs, some occur late into the run. It's also an issue for both apps (3.2 and 4.2), and there seems to be quite a few 'error while downloading' failures.
____________
FAQ's
HOW TO:
- Opt out of Beta Tests
- Ask for Help |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
These probably belong in the Energies have become nan thread, but,
6294105 4017217 139265 4 Jan 2013 | 23:14:14 UTC 5 Jan 2013 | 12:31:59 UTC Error while computing 42,955.90 3,395.41 --- Long runs (8-12 hours on fastest card) v6.17 (cuda42)
6293240 112581 4 Jan 2013 | 18:40:13 UTC 4 Jan 2013 | 18:49:20 UTC Error while computing 2.16 2.09 --- Long runs (8-12 hours on fastest card) v6.17 (cuda42)
6294105 139265 4 Jan 2013 | 23:14:14 UTC 5 Jan 2013 | 12:31:59 UTC Error while computing 42,955.90 3,395.41 --- Long runs (8-12 hours on fastest card) v6.17 (cuda42)
6296657 --- --- --- Unsent --- --- ---
____________
FAQ's
HOW TO:
- Opt out of Beta Tests
- Ask for Help |
|
|