Advanced search

Message boards : Graphics cards (GPUs) : Error units - Noelia

Author Message
Ebonydogx
Send message
Joined: 19 Jun 12
Posts: 11
Credit: 51,704,550
RAC: 0
Level
Thr
Scientific publications
watwatwatwat
Message 26457 - Posted: 26 Jul 2012 | 2:14:23 UTC
Last modified: 26 Jul 2012 | 2:33:31 UTC

Noelia, welcome aboard. Been looking for your wus, finally got some. Unfortunately, I just had 11 Noelia wu's crash after about 13 secs each with a computational error. Entries from my event log for one wu are below

7/25/2012 9:03:49 PM | GPUGRID | Starting task run2_replica29-NOELIA_sh2fragment_run-0-4-RND8749_2 using acemdlong version 616 (cuda42) in slot 6
7/25/2012 9:04:04 PM | GPUGRID | Computation for task run2_replica29-NOELIA_sh2fragment_run-0-4-RND8749_2 finished
7/25/2012 9:04:04 PM | GPUGRID | Output file run2_replica29-NOELIA_sh2fragment_run-0-4-RND8749_2_1 for task run2_replica29-NOELIA_sh2fragment_run-0-4-RND8749_2 absent
7/25/2012 9:04:04 PM | GPUGRID | Output file run2_replica29-NOELIA_sh2fragment_run-0-4-RND8749_2_2 for task run2_replica29-NOELIA_sh2fragment_run-0-4-RND8749_2 absent
7/25/2012 9:04:04 PM | GPUGRID | Output file run2_replica29-NOELIA_sh2fragment_run-0-4-RND8749_2_3 for task run2_replica29-NOELIA_sh2fragment_run-0-4-RND8749_2 absent

Wu's are of this variety:
run9_replica7-NOELIA_sh2fragment_run-0-4-RND8072_2
Workunit 3598139

Stderr output

<core_client_version>7.0.28</core_client_version>
<![CDATA[
<message>
- exit code 98 (0x62)
</message>
<stderr_txt>
MDIO: cannot open file "restart.coor"
ERROR: file deven.cpp line 1106: # Energies have become nan

called boinc_finish

</stderr_txt>
]]>


Thought you'd want to know - let me know if I should have forwarded other details.

Edit: Win7 64, 2x560ti, AMD FX 6100 6 core @ 3.3 GHZ, 850 watt psu

neilp62
Send message
Joined: 23 Nov 10
Posts: 14
Credit: 7,876,790,536
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 26458 - Posted: 26 Jul 2012 | 2:49:30 UTC

Hmm, I've experience the same error with two back-to-back NOELIA WUs. My PC finished a PAOLA WU just before the NOELIA WUs with no error. For now, I'll suspend the GPUGRID project until something is posted about this...

Profile [PUGLIA] kidkidkid3
Avatar
Send message
Joined: 23 Feb 11
Posts: 96
Credit: 1,258,555,544
RAC: 2,745,270
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 26462 - Posted: 26 Jul 2012 | 5:40:05 UTC - in response to Message 26458.

Hi Noelia,
same error (twice) also for me in
http://www.gpugrid.net/result.php?resultid=5664840
http://www.gpugrid.net/result.php?resultid=5664472

<core_client_version>7.0.28</core_client_version>
<![CDATA[
<message>
- exit code 98 (0x62)
</message>
<stderr_txt>
MDIO: cannot open file "restart.coor"
ERROR: file deven.cpp line 1106: # Energies have become nan
called boinc_finish
</stderr_txt>
]]>


I'll stop or cancel your WU until something is posted about this error.
k.
____________
Dreams do not always come true. But not because they are too big or impossible. Why did we stop believing.
(Martin Luther King)

5pot
Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 26463 - Posted: 26 Jul 2012 | 6:10:54 UTC

Ditto.

All run2 WUs if I'm not mistaken and all failing on all other sent hosts.

Definitely a problem with these.

werdwerdus
Send message
Joined: 15 Apr 10
Posts: 123
Credit: 1,004,473,861
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 26464 - Posted: 26 Jul 2012 | 7:40:27 UTC

yep some errors, also gpu utilization is pretty low, currently at 79% on my GTX 470

task rundig8_run5-NOELIA_smd2-1-5-RND4856_0 using acemdlong version 616 (cuda42)


noelia
Send message
Joined: 5 Jul 12
Posts: 35
Credit: 393,375
RAC: 0
Level

Scientific publications
wat
Message 26469 - Posted: 26 Jul 2012 | 15:34:01 UTC - in response to Message 26464.

Hi guys,

I apologize for this inconvenience. It is the first time I run the system after doing the equilibration phase in acemdbeta (the first step commented on this other thread: http://www.gpugrid.org/forum_thread.php?id=3088 ), and works quite differently as when we run it locally, so that's why all the simulations where crashing within a few seconds. Now the procedure is automatized and this should not be a problem in the future when running this way. Thank you for you time :)

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 26472 - Posted: 26 Jul 2012 | 17:17:15 UTC - in response to Message 26469.

Shouldn't these workunits processed by the 6.47 beta client?

5pot
Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 26483 - Posted: 27 Jul 2012 | 5:12:26 UTC

Just had another one recently. It was sent about 5 or so hours ago. I hope these are out of the system now, because I've produced 53 errors on these tasks.

Just seems like a lot of wasted bandwidth on my end and on yours.

I understand things happen, but please take these out of the hopper if you have not done so already.

Cheers

Ebonydogx
Send message
Joined: 19 Jun 12
Posts: 11
Credit: 51,704,550
RAC: 0
Level
Thr
Scientific publications
watwatwatwat
Message 26489 - Posted: 27 Jul 2012 | 18:32:04 UTC - in response to Message 26483.

Started processing one replacement Noelia wu. run10_replica37-NOELIA_sh2fragment_fixed-0-4-RND1582_0

It is only 10% complete after about 90 mins. At start-up the wu projected 9:36 to complete but is now on track for a bit over 14 hours, and prolly significantly more. http://www.gpugrid.net/workunit.php?wuid=3601656

This wu will never qualify for max bonus bc there just isn't enough time to process and return within 24 hours. Same problem that Nathan wus had back in Feb/March if I remember correctly.

I'll let this wu run another couple of hours, see how it is tracking, then update this post.

In the meanwhile, may I suggest you visit with Nathan on proc time as he has lived through this before & was able to adjust the wu's so proc returned to "8-12 hours on fastest cards." On my 560 ti's his wu's typically take about 8 hours to crunch and another 10-12 mins to upload.

Thank you!

Ebonydogx
Send message
Joined: 19 Jun 12
Posts: 11
Credit: 51,704,550
RAC: 0
Level
Thr
Scientific publications
watwatwatwat
Message 26490 - Posted: 27 Jul 2012 | 21:12:25 UTC - in response to Message 26489.
Last modified: 27 Jul 2012 | 21:13:19 UTC


I'll let this wu run another couple of hours, see how it is tracking, then update this post.


Edit: after 4.5 hours, still on track to finish in a bit over 14 hours

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 26491 - Posted: 28 Jul 2012 | 0:30:20 UTC - in response to Message 26490.

I'll let this wu run another couple of hours, see how it is tracking, then update this post.

Edit: after 4.5 hours, still on track to finish in a bit over 14 hours

After the release of the GTX 6xx series, I wouldn't consider a GTX 560 Ti as one of "the fastest cards". Besides the GTX 560 Ti 448, it has only 256 usable shaders (by the GPUGrid client) because it is a CC2.1 card (while the Ti 448 'limited edition' is a CC2.0 card, so all of it's shaders can be used by the GPUGrid client).
At the moment the fastest cards are:
GTX 690, 680, 670, 590, 580, 570, 480, 470, 560 Ti 448, 465

Ebonydogx
Send message
Joined: 19 Jun 12
Posts: 11
Credit: 51,704,550
RAC: 0
Level
Thr
Scientific publications
watwatwatwat
Message 26493 - Posted: 28 Jul 2012 | 3:56:45 UTC - in response to Message 26491.

No argument on which cards are currently fastest out there. Nevertheless, the 560ti can comfortably finish all existing tasks in 8-12 hours with the exception of this latest group from Noelia.

That's all I meant.

Post to thread

Message boards : Graphics cards (GPUs) : Error units - Noelia

//