Message boards : Graphics cards (GPUs) : WUs fall after resume [NOELIA]
Author | Message |
---|---|
Hi <core_client_version>7.0.29</core_client_version> <![CDATA[ <message> process exited with code 193 (0xc1, -63) </message> <stderr_txt> MDIO: cannot open file "restart.coor" SWAN : FATAL : Cuda driver error 700 in file 'swanlibnv2.cpp' in line 1574. acemd.2562.x64.cuda42: swanlibnv2.cpp:59: void swan_assert(int): Assertion `a' failed. SIGABRT: abort called Stack trace (15 frames): ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(boinc_catch_signal+0x4d)[0x551f6d] /lib64/libc.so.6(+0x38030)[0x7f4d28f25030] /lib64/libc.so.6(gsignal+0x35)[0x7f4d28f24fb5] /lib64/libc.so.6(abort+0x148)[0x7f4d28f26438] /lib64/libc.so.6(+0x30f92)[0x7f4d28f1df92] /lib64/libc.so.6(+0x31042)[0x7f4d28f1e042] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x482916] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x4848da] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x44d4bd] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x44e54c] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x41ec14] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(sin+0xb6c)[0x407d6c] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(sin+0x256)[0x407456] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f4d28f116c5] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(sinh+0x49)[0x4072f9] Exiting... </stderr_txt> ]]> my PCs Error Hosts and Others but this happen not for all PC's All Operative Systems (Gentoo Linux) are cloned. Tnx | |
ID: 27997 | Rating: 0 | rate: / Reply Quote | |
I'd try to send BOINC the "quit client" command instead of suspending WUs with the cron job. | |
ID: 28008 | Rating: 0 | rate: / Reply Quote | |
I'd try to send BOINC the "quit client" command instead of suspending WUs with the cron job. i won't quit the client but only suspend this project to elaborate another, and then shut down the pc I suspend the project not the WU(s) | |
ID: 28010 | Rating: 0 | rate: / Reply Quote | |
Yes.. that's why I'm suggesting trying something else :) | |
ID: 28018 | Rating: 0 | rate: / Reply Quote | |
doh ! <core_client_version>7.0.29</core_client_version> <![CDATA[ <message> process exited with code 193 (0xc1, -63) </message> <stderr_txt> # Using device 0 # There are 2 devices supporting CUDA # Device 0: "GeForce GTX 560" # Clock rate: 1.62 GHz # Total amount of global memory: 1073283072 bytes # Number of multiprocessors: 7 # Number of cores: 56 # Device 1: "GeForce GTX 560" # Clock rate: 1.62 GHz # Total amount of global memory: 1073414144 bytes # Number of multiprocessors: 7 # Number of cores: 56 MDIO: cannot open file "restart.coor" # Using device 0 # There are 2 devices supporting CUDA # Device 0: "GeForce GTX 560" # Clock rate: 1.62 GHz # Total amount of global memory: 1073283072 bytes # Number of multiprocessors: 7 # Number of cores: 56 # Device 1: "GeForce GTX 560" # Clock rate: 1.62 GHz # Total amount of global memory: 1073414144 bytes # Number of multiprocessors: 7 # Number of cores: 56 SWAN: FATAL : swanMemcpyDtoH failed acemd.linux64.2352: swanlib_nv.c:390: error: Assertion `0' failed. SIGABRT: abort called Stack trace (13 frames): ../../projects/www.gpugrid.net/acemd.linux64.2352(boinc_catch_signal+0x4d)[0x482bed] /lib64/libc.so.6(+0x38030)[0x7f9e9f122030] /lib64/libc.so.6(gsignal+0x35)[0x7f9e9f121fb5] /lib64/libc.so.6(abort+0x148)[0x7f9e9f123438] /lib64/libc.so.6(+0x30f92)[0x7f9e9f11af92] /lib64/libc.so.6(+0x31042)[0x7f9e9f11b042] ../../projects/www.gpugrid.net/acemd.linux64.2352[0x491b33] ../../projects/www.gpugrid.net/acemd.linux64.2352[0x474510] ../../projects/www.gpugrid.net/acemd.linux64.2352[0x413c60] ../../projects/www.gpugrid.net/acemd.linux64.2352[0x407cba] ../../projects/www.gpugrid.net/acemd.linux64.2352[0x40857e] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f9e9f10e6c5] ../../projects/www.gpugrid.net/acemd.linux64.2352[0x407a19] Exiting... </stderr_txt> ]]> SWAN: FATAL : swanMemcpyDtoH failed ---- What is it ?[/b] over 6 hours lost | |
ID: 28034 | Rating: 0 | rate: / Reply Quote | |
Probably something to do with using the cuda3.1 app: | |
ID: 28038 | Rating: 0 | rate: / Reply Quote | |
I tried this : emerge --info Portage 2.1.11.31 (default/linux/amd64/10.0, gcc-4.7.2, glibc-2.15-r3, 3.6.11-gentoo x86_64) ================================================================= System uname: Linux-3.6.11-gentoo-x86_64-Pentium-R-_Dual-Core_CPU_E6500_@_2.93GHz-with-gentoo-2.1 Timestamp of tree: Wed, 09 Jan 2013 18:45:01 +0000 ld GNU ld (GNU Binutils) 2.22 distcc 3.1 x86_64-pc-linux-gnu [enabled] app-shells/bash: 4.2_p37 dev-lang/python: 2.7.3-r2, 3.2.3 dev-util/pkgconfig: 0.27.1 sys-apps/baselayout: 2.1-r1 sys-apps/openrc: 0.11.8 sys-apps/sandbox: 2.5 sys-devel/autoconf: 2.68 sys-devel/automake: 1.11.6, 1.12.4 sys-devel/binutils: 2.22-r1 sys-devel/gcc: 4.5.4, 4.6.3, 4.7.2 sys-devel/gcc-config: 1.7.3 sys-devel/libtool: 2.4-r1 sys-devel/make: 3.82-r4 sys-kernel/linux-headers: 3.6 (virtual/os-headers) sys-libs/glibc: 2.15-r3 Repositories: gentoo science ACCEPT_KEYWORDS="amd64" ACCEPT_LICENSE="* -@EULA" CBUILD="x86_64-pc-linux-gnu" CFLAGS="-O2 -march=core2 -pipe" CHOST="x86_64-pc-linux-gnu" [...] NVIDIA GPU 0: GeForce GTX 660 (driver version unknown, CUDA version 5.0, compute capability 3.0, 133801984MB, 134215644MB available, 1982 GFLOPS peak) OpenCL: NVIDIA GPU 0: GeForce GTX 660 (driver version 310.19, device version OpenCL 1.1 CUDA, 2048MB, 134215644MB available) Nvidia-Drivers : 310.19 This happens for all my PCs and only for GPUGrid | |
ID: 28046 | Rating: 0 | rate: / Reply Quote | |
Did you already try ending BOINC instead of suspending the project, prior to a restart? | |
ID: 28048 | Rating: 0 | rate: / Reply Quote | |
Did you already try ending BOINC instead of suspending the project, prior to a restart? Yes,I Did. If WU goes on ... nothing happens. WU is reported without errors. | |
ID: 28050 | Rating: 0 | rate: / Reply Quote | |
Not sure if I understand you correctly. So it solves your problem? | |
ID: 28055 | Rating: 0 | rate: / Reply Quote | |
Not sure if I understand you correctly. So it solves your problem? no. WU(s) fail if I resume them after reboot Pc(s). As I wrote , I suspend the project via crontab+boinc_command_line(suspend).Then turn Off PCs can not be switched on 24/24. When I resume the project (and WU(s) ), WU(s) fail. I don't know why. all processings hint at SWAN SWAN: FATAL : swanMemcpyDtoH failed SWAN : FATAL : Cuda driver error 700 in file 'swanlibnv2.cpp' in line 1574. acemd.2562.x64.cuda42: swanlibnv2.cpp:59: void swan_assert(int): Assertion `a' failed. and MDIO: cannot open file "restart.coor" ... restart.... resume ... | |
ID: 28057 | Rating: 0 | rate: / Reply Quote | |
I solved ( maybe...) ldd acemd.2562.x64.cuda42 linux-vdso.so.1 (0x00007fff46dff000) libcufft.so.4 => /opt/cuda/lib64/libcufft.so.4 (0x00007fccc23a2000) libcuda.so.1 => /usr/lib64/libcuda.so.1 (0x00007fccc17a2000) libcudart.so.4 => /opt/cuda/lib64/libcudart.so.4 (0x00007fccc1543000) libdl.so.2 => /lib64/libdl.so.2 (0x00007fccc133f000) libstdc++.so.6 => /usr/lib/gcc/x86_64-pc-linux-gnu/4.7.2/libstdc++.so.6 (0x00007fccc1038000) libm.so.6 => /lib64/libm.so.6 (0x00007fccc0d41000) libgcc_s.so.1 => /usr/lib/gcc/x86_64-pc-linux-gnu/4.7.2/libgcc_s.so.1 (0x00007fccc0b2b000) libc.so.6 => /lib64/libc.so.6 (0x00007fccc0783000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fccc0566000) libz.so.1 => /lib64/libz.so.1 (0x00007fccc0350000) librt.so.1 => /lib64/librt.so.1 (0x00007fccc0147000) /lib64/ld-linux-x86-64.so.2 (0x00007fccc43c6000) I tried to suspend/reusme after reboot and WU(s) goes on now ... I will check better tomorrow. :) [edit] nothing ... After 10 minutes this WU (NOELIA) fails Another WU (NATHAN) still goes on ... I think I give up :( | |
ID: 28058 | Rating: 0 | rate: / Reply Quote | |
You can ignore this error, MDIO: cannot open file "restart.coor" | |
ID: 28059 | Rating: 0 | rate: / Reply Quote | |
You can ignore this error, MDIO: cannot open file "restart.coor" Ok. Today I resume the project and NATHAN's WU still crunching without error As soon as possible I'll try to reset the project for all PCs and will check all dinamic libraries for all GPUgrid apps | |
ID: 28063 | Rating: 0 | rate: / Reply Quote | |
So you suspend GPU-Grid via the cron-job, set the PC to standby, restart, resume GPU-Grid and occasionally (or mostly) get these errors (as you said before). I'm not an expert in this, but to me your error message sound like "something was in a state it should not have been in - be it driver, CUDA, GPU memory or whatever". I think we can agree on this, right? | |
ID: 28065 | Rating: 0 | rate: / Reply Quote | |
No "leave applications in memory..." is set with your cron-job It is a simple cli command line : boinccmd --project http://www.gpugrid.net/ suspend see $ boinccmd --help or "Command-line options " ---> http://boinc.berkeley.edu/wiki/Client_configuration instead of suspending GPU-Grid with your cron-job just shut down the entire BOINC core client I can't suspend the entire client . I must suspend GPUGrid and all GPU projects for my own reasons. Crunching still goes on for a while with CPU only . then shutdown the PC (not standby or something like..) All PCs are not equipped with monitor,keyboard,mouse and are controlled via ssh /SecureSHell) but the app is not expecting this, as it was only temporarly suspended I never had problems like this until a few months ago I'll do tests | |
ID: 28067 | Rating: 0 | rate: / Reply Quote | |
Try turning LAIM off, and if that doesn't work shut down Boinc completely, as MrS said. | |
ID: 28068 | Rating: 0 | rate: / Reply Quote | |
Try turning LAIM off, and if that doesn't work shut down Boinc completely, as MrS said. If I try to Torn-Off the client when GPUGrid is running (not suspend) , WU(s) fail. :D Only GPUGrid app has this problems . Other GPU Projects work fine (turn-off/on , suspend,reusme etc ...) ;| | |
ID: 28071 | Rating: 0 | rate: / Reply Quote | |
I'm trying to update nvidia-drivers-313.18 Fixed a regression that could cause OpenGL applications to crash while compiling shaders. ...hope this well | |
ID: 28072 | Rating: 0 | rate: / Reply Quote | |
If I try to Torn-Off the client when GPUGrid is running (not suspend) , WU(s) fail. Well.. this doesn't happen under Windows. Maybe suspend+resume just causes the same error to appear later? The new driver could help, but not the OpenGL fixes, as CUDA is something completely separate. MrS ____________ Scanning for our furry friends since Jan 2002 | |
ID: 28075 | Rating: 0 | rate: / Reply Quote | |
I think I partially solved. | |
ID: 28102 | Rating: 0 | rate: / Reply Quote | |
Message boards : Graphics cards (GPUs) : WUs fall after resume [NOELIA]