Advanced search

Message boards : Graphics cards (GPUs) : WUs fall after resume [NOELIA]

Author Message
Profile [VENETO] sabayonino
Send message
Joined: 4 Apr 10
Posts: 50
Credit: 645,641,596
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 27997 - Posted: 9 Jan 2013 | 22:32:35 UTC

Hi
I turn-off my PC's and suspend all Wu for GPUGrid (using a cronjob)

When I resume the WUs they fall after a while

<core_client_version>7.0.29</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
MDIO: cannot open file "restart.coor"
SWAN : FATAL : Cuda driver error 700 in file 'swanlibnv2.cpp' in line 1574.
acemd.2562.x64.cuda42: swanlibnv2.cpp:59: void swan_assert(int): Assertion `a' failed.
SIGABRT: abort called
Stack trace (15 frames):
../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(boinc_catch_signal+0x4d)[0x551f6d]
/lib64/libc.so.6(+0x38030)[0x7f4d28f25030]
/lib64/libc.so.6(gsignal+0x35)[0x7f4d28f24fb5]
/lib64/libc.so.6(abort+0x148)[0x7f4d28f26438]
/lib64/libc.so.6(+0x30f92)[0x7f4d28f1df92]
/lib64/libc.so.6(+0x31042)[0x7f4d28f1e042]
../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x482916]
../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x4848da]
../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x44d4bd]
../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x44e54c]
../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x41ec14]
../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(sin+0xb6c)[0x407d6c]
../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(sin+0x256)[0x407456]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f4d28f116c5]
../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(sinh+0x49)[0x4072f9]

Exiting...

</stderr_txt>
]]>


my PCs

Error Hosts and Others

but this happen not for all PC's

All Operative Systems (Gentoo Linux) are cloned.


Tnx

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 28008 - Posted: 10 Jan 2013 | 21:56:30 UTC

I'd try to send BOINC the "quit client" command instead of suspending WUs with the cron job.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile [VENETO] sabayonino
Send message
Joined: 4 Apr 10
Posts: 50
Credit: 645,641,596
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 28010 - Posted: 10 Jan 2013 | 23:15:22 UTC - in response to Message 28008.
Last modified: 10 Jan 2013 | 23:16:23 UTC

I'd try to send BOINC the "quit client" command instead of suspending WUs with the cron job.

MrS

i won't quit the client but only suspend this project to elaborate another, and then shut down the pc

I suspend the project not the WU(s)

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 28018 - Posted: 11 Jan 2013 | 22:50:22 UTC - in response to Message 28010.

Yes.. that's why I'm suggesting trying something else :)

MrS
____________
Scanning for our furry friends since Jan 2002

Profile [VENETO] sabayonino
Send message
Joined: 4 Apr 10
Posts: 50
Credit: 645,641,596
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 28034 - Posted: 14 Jan 2013 | 12:30:54 UTC
Last modified: 14 Jan 2013 | 12:38:47 UTC

doh !

I turned On my PC and resume a task ... after 5 min. WU crashed

Same Prob
http://www.gpugrid.net/result.php?resultid=6313064

<core_client_version>7.0.29</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
# Using device 0
# There are 2 devices supporting CUDA
# Device 0: "GeForce GTX 560"
# Clock rate: 1.62 GHz
# Total amount of global memory: 1073283072 bytes
# Number of multiprocessors: 7
# Number of cores: 56
# Device 1: "GeForce GTX 560"
# Clock rate: 1.62 GHz
# Total amount of global memory: 1073414144 bytes
# Number of multiprocessors: 7
# Number of cores: 56
MDIO: cannot open file "restart.coor"
# Using device 0
# There are 2 devices supporting CUDA
# Device 0: "GeForce GTX 560"
# Clock rate: 1.62 GHz
# Total amount of global memory: 1073283072 bytes
# Number of multiprocessors: 7
# Number of cores: 56
# Device 1: "GeForce GTX 560"
# Clock rate: 1.62 GHz
# Total amount of global memory: 1073414144 bytes
# Number of multiprocessors: 7
# Number of cores: 56
SWAN: FATAL : swanMemcpyDtoH failed

acemd.linux64.2352: swanlib_nv.c:390: error: Assertion `0' failed.
SIGABRT: abort called
Stack trace (13 frames):
../../projects/www.gpugrid.net/acemd.linux64.2352(boinc_catch_signal+0x4d)[0x482bed]
/lib64/libc.so.6(+0x38030)[0x7f9e9f122030]
/lib64/libc.so.6(gsignal+0x35)[0x7f9e9f121fb5]
/lib64/libc.so.6(abort+0x148)[0x7f9e9f123438]
/lib64/libc.so.6(+0x30f92)[0x7f9e9f11af92]
/lib64/libc.so.6(+0x31042)[0x7f9e9f11b042]
../../projects/www.gpugrid.net/acemd.linux64.2352[0x491b33]
../../projects/www.gpugrid.net/acemd.linux64.2352[0x474510]
../../projects/www.gpugrid.net/acemd.linux64.2352[0x413c60]
../../projects/www.gpugrid.net/acemd.linux64.2352[0x407cba]
../../projects/www.gpugrid.net/acemd.linux64.2352[0x40857e]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f9e9f10e6c5]
../../projects/www.gpugrid.net/acemd.linux64.2352[0x407a19]

Exiting...

</stderr_txt>
]]>


SWAN: FATAL : swanMemcpyDtoH failed ---- What is it ?[/b]

over 6 hours lost

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 28038 - Posted: 14 Jan 2013 | 15:36:18 UTC - in response to Message 28034.
Last modified: 14 Jan 2013 | 15:37:05 UTC

Probably something to do with using the cuda3.1 app:

9 Jan 2013 | 21:04:19 UTC 14 Jan 2013 | 12:08:27 UTC Error while computing 62,671.43 662.52 --- ACEMD2: GPU molecular dynamics v6.16 (cuda31)

Even at that it seems very long for a 'normal' length task; most of your NOELIA_hfXA tasks ran in 8.5K to 9K seconds (albeit on the 4.2app). On the 3.1app I would have expected it to take around twice that, but it ran for ~9times as long. Perhaps there was a problem with the task or it was a long WU that ended up in the wrong queue somehow?
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile [VENETO] sabayonino
Send message
Joined: 4 Apr 10
Posts: 50
Credit: 645,641,596
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 28046 - Posted: 14 Jan 2013 | 19:41:30 UTC
Last modified: 14 Jan 2013 | 19:44:30 UTC

I tried this :

The following WUs went wrong when the system restarts
http://www.gpugrid.net/result.php?resultid=6329901 (NATHAN)
http://www.gpugrid.net/result.php?resultid=6327036 (NOELIA)
http://www.gpugrid.net/result.php?resultid=6326898 (NOELIA)

Where is the problem ?

I'm running Gentoo Linux:

emerge --info
Portage 2.1.11.31 (default/linux/amd64/10.0, gcc-4.7.2, glibc-2.15-r3, 3.6.11-gentoo x86_64)
=================================================================
System uname: Linux-3.6.11-gentoo-x86_64-Pentium-R-_Dual-Core_CPU_E6500_@_2.93GHz-with-gentoo-2.1
Timestamp of tree: Wed, 09 Jan 2013 18:45:01 +0000
ld GNU ld (GNU Binutils) 2.22
distcc 3.1 x86_64-pc-linux-gnu [enabled]
app-shells/bash: 4.2_p37
dev-lang/python: 2.7.3-r2, 3.2.3
dev-util/pkgconfig: 0.27.1
sys-apps/baselayout: 2.1-r1
sys-apps/openrc: 0.11.8
sys-apps/sandbox: 2.5
sys-devel/autoconf: 2.68
sys-devel/automake: 1.11.6, 1.12.4
sys-devel/binutils: 2.22-r1
sys-devel/gcc: 4.5.4, 4.6.3, 4.7.2
sys-devel/gcc-config: 1.7.3
sys-devel/libtool: 2.4-r1
sys-devel/make: 3.82-r4
sys-kernel/linux-headers: 3.6 (virtual/os-headers)
sys-libs/glibc: 2.15-r3
Repositories: gentoo science
ACCEPT_KEYWORDS="amd64"
ACCEPT_LICENSE="* -@EULA"
CBUILD="x86_64-pc-linux-gnu"
CFLAGS="-O2 -march=core2 -pipe"
CHOST="x86_64-pc-linux-gnu"
[...]


NVIDIA GPU 0: GeForce GTX 660 (driver version unknown, CUDA version 5.0, compute capability 3.0, 133801984MB, 134215644MB available, 1982 GFLOPS peak)
OpenCL: NVIDIA GPU 0: GeForce GTX 660 (driver version 310.19, device version OpenCL 1.1 CUDA, 2048MB, 134215644MB available)

Nvidia-Drivers : 310.19


This happens for all my PCs and only for GPUGrid

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 28048 - Posted: 14 Jan 2013 | 22:47:46 UTC - in response to Message 28046.

Did you already try ending BOINC instead of suspending the project, prior to a restart?

MrS
____________
Scanning for our furry friends since Jan 2002

Profile [VENETO] sabayonino
Send message
Joined: 4 Apr 10
Posts: 50
Credit: 645,641,596
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 28050 - Posted: 15 Jan 2013 | 0:09:11 UTC - in response to Message 28048.

Did you already try ending BOINC instead of suspending the project, prior to a restart?

MrS

Yes,I Did.

If WU goes on ... nothing happens. WU is reported without errors.



ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 28055 - Posted: 15 Jan 2013 | 21:42:56 UTC - in response to Message 28050.

Not sure if I understand you correctly. So it solves your problem?

MrS
____________
Scanning for our furry friends since Jan 2002

Profile [VENETO] sabayonino
Send message
Joined: 4 Apr 10
Posts: 50
Credit: 645,641,596
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 28057 - Posted: 15 Jan 2013 | 23:05:43 UTC - in response to Message 28055.

Not sure if I understand you correctly. So it solves your problem?

MrS


no.

WU(s) fail if I resume them after reboot Pc(s).

As I wrote , I suspend the project via crontab+boinc_command_line(suspend).Then turn Off

PCs can not be switched on 24/24. When I resume the project (and WU(s) ), WU(s) fail.

I don't know why.


all processings hint at SWAN

SWAN: FATAL : swanMemcpyDtoH failed

SWAN : FATAL : Cuda driver error 700 in file 'swanlibnv2.cpp' in line 1574.
acemd.2562.x64.cuda42: swanlibnv2.cpp:59: void swan_assert(int): Assertion `a' failed.



and
MDIO: cannot open file "restart.coor"


... restart.... resume ...

Profile [VENETO] sabayonino
Send message
Joined: 4 Apr 10
Posts: 50
Credit: 645,641,596
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 28058 - Posted: 16 Jan 2013 | 0:37:58 UTC
Last modified: 16 Jan 2013 | 0:45:51 UTC

I solved ( maybe...)

Cuda library wasn t linked properly for acemd.2562.x64.cuda42 app.

I reinstalled cuda-toolkit-4.x.x

libcufft.so.4 link was missed
libcudart.so.4 link was missed

Now :

ldd acemd.2562.x64.cuda42
linux-vdso.so.1 (0x00007fff46dff000)
libcufft.so.4 => /opt/cuda/lib64/libcufft.so.4 (0x00007fccc23a2000)
libcuda.so.1 => /usr/lib64/libcuda.so.1 (0x00007fccc17a2000)
libcudart.so.4 => /opt/cuda/lib64/libcudart.so.4 (0x00007fccc1543000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007fccc133f000)
libstdc++.so.6 => /usr/lib/gcc/x86_64-pc-linux-gnu/4.7.2/libstdc++.so.6 (0x00007fccc1038000)
libm.so.6 => /lib64/libm.so.6 (0x00007fccc0d41000)
libgcc_s.so.1 => /usr/lib/gcc/x86_64-pc-linux-gnu/4.7.2/libgcc_s.so.1 (0x00007fccc0b2b000)
libc.so.6 => /lib64/libc.so.6 (0x00007fccc0783000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fccc0566000)
libz.so.1 => /lib64/libz.so.1 (0x00007fccc0350000)
librt.so.1 => /lib64/librt.so.1 (0x00007fccc0147000)
/lib64/ld-linux-x86-64.so.2 (0x00007fccc43c6000)


I tried to suspend/reusme after reboot and WU(s) goes on now ...

I will check better tomorrow.

:)



[edit] nothing ...
After 10 minutes this WU (NOELIA) fails

Another WU (NATHAN) still goes on ...


I think I give up :(

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 28059 - Posted: 16 Jan 2013 | 1:02:03 UTC - in response to Message 28057.
Last modified: 16 Jan 2013 | 1:11:23 UTC

You can ignore this error, MDIO: cannot open file "restart.coor"
Perhaps this is another lib problem?
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile [VENETO] sabayonino
Send message
Joined: 4 Apr 10
Posts: 50
Credit: 645,641,596
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 28063 - Posted: 16 Jan 2013 | 12:04:01 UTC - in response to Message 28059.
Last modified: 16 Jan 2013 | 12:16:17 UTC

You can ignore this error, MDIO: cannot open file "restart.coor"
Perhaps this is another lib problem?



Ok.
Today I resume the project and NATHAN's WU still crunching without error

As soon as possible I'll try to reset the project for all PCs and will check all dinamic libraries for all GPUgrid apps

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 28065 - Posted: 16 Jan 2013 | 22:49:00 UTC

So you suspend GPU-Grid via the cron-job, set the PC to standby, restart, resume GPU-Grid and occasionally (or mostly) get these errors (as you said before). I'm not an expert in this, but to me your error message sound like "something was in a state it should not have been in - be it driver, CUDA, GPU memory or whatever". I think we can agree on this, right?

When you suspend GPU-Grid and have "leave applications in memory while suspended" active, BOINC and GPU-Grid think they can just continue exactly where they left upon resuming. However, your PC went into standby in the mean time. The main memory contents should be preserved, but what about GPU memory, GPU caches and registers? I would not rule out the possibility that something goes wrong here. Some state is reset during standby, but the app is not expecting this, as it was only temporarly suspended.

So I'm asking you again to test the following: instead of suspending GPU-Grid with your cron-job just shut down the entire BOINC core client. Now upon resuming from standby BOINC and GPU-Grid know they're starting new from the last checkpoint, which should definitely work. Much better than giving up, isn't it?

MrS
____________
Scanning for our furry friends since Jan 2002

Profile [VENETO] sabayonino
Send message
Joined: 4 Apr 10
Posts: 50
Credit: 645,641,596
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 28067 - Posted: 16 Jan 2013 | 23:50:30 UTC - in response to Message 28065.


When you suspend GPU-Grid and have "leave applications in memory while suspended" active

No "leave applications in memory..." is set


with your cron-job

It is a simple cli command line : boinccmd --project http://www.gpugrid.net/ suspend

see
$ boinccmd --help

or "Command-line options " ---> http://boinc.berkeley.edu/wiki/Client_configuration




instead of suspending GPU-Grid with your cron-job just shut down the entire BOINC core client

I can't suspend the entire client . I must suspend GPUGrid and all GPU projects for my own reasons. Crunching still goes on for a while with CPU only . then shutdown the PC (not standby or something like..)
All PCs are not equipped with monitor,keyboard,mouse and are controlled via ssh /SecureSHell)



but the app is not expecting this, as it was only temporarly suspended


I never had problems like this until a few months ago
I'll do tests



Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 28068 - Posted: 17 Jan 2013 | 9:52:13 UTC - in response to Message 28067.
Last modified: 17 Jan 2013 | 9:57:01 UTC

Try turning LAIM off, and if that doesn't work shut down Boinc completely, as MrS said.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile [VENETO] sabayonino
Send message
Joined: 4 Apr 10
Posts: 50
Credit: 645,641,596
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 28071 - Posted: 17 Jan 2013 | 12:00:22 UTC - in response to Message 28068.

Try turning LAIM off, and if that doesn't work shut down Boinc completely, as MrS said.


If I try to Torn-Off the client when GPUGrid is running (not suspend) , WU(s) fail.

:D

Only GPUGrid app has this problems . Other GPU Projects work fine (turn-off/on , suspend,reusme etc ...)

;|

Profile [VENETO] sabayonino
Send message
Joined: 4 Apr 10
Posts: 50
Credit: 645,641,596
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 28072 - Posted: 17 Jan 2013 | 12:25:15 UTC
Last modified: 17 Jan 2013 | 12:27:30 UTC

I'm trying to update nvidia-drivers-313.18

http://www.nvidia.com/object/linux-display-amd64-313.18-driver.html

I see several BugsFix


Fixed a regression that could cause OpenGL applications to crash while compiling shaders.


...hope this well

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 28075 - Posted: 17 Jan 2013 | 19:56:57 UTC - in response to Message 28071.

If I try to Torn-Off the client when GPUGrid is running (not suspend) , WU(s) fail.

Well.. this doesn't happen under Windows. Maybe suspend+resume just causes the same error to appear later?

The new driver could help, but not the OpenGL fixes, as CUDA is something completely separate.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile [VENETO] sabayonino
Send message
Joined: 4 Apr 10
Posts: 50
Credit: 645,641,596
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 28102 - Posted: 21 Jan 2013 | 20:28:36 UTC
Last modified: 21 Jan 2013 | 20:29:24 UTC

I think I partially solved.

Nvidia-Drivers was compiled without "acpid" support.

This solved the problem for two pc. Now I can Suspend and resume the project without errors.

Other 3 PCs still have the same prob.

I'm trying for to solve ...

Tnx

Post to thread

Message boards : Graphics cards (GPUs) : WUs fall after resume [NOELIA]

//