Message boards : Number crunching : some hosts won't get tasks
Author | Message |
---|---|
this one is a real head-scratcher for me. ever since the new app was released, two of my hosts have not been able to receive tasks. they don't give any error, or other obvious sign that anyting is wrong, they just always get the "no tasks available" response. | |
ID: 57146 | Rating: 0 | rate: / Reply Quote | |
We have had discussions before about that. It appears there is a limit on the number of machines it will send work to. I expect it is part of their (undisclosed) anti-ddos system, but I don't think we know much about it other than it happens. | |
ID: 57148 | Rating: 0 | rate: / Reply Quote | |
We have had discussions before about that. It appears there is a limit on the number of machines it will send work to. I expect it is part of their (undisclosed) anti-ddos system, but I don't think we know much about it other than it happens. that situation is different. what you're referencing is a temporary communications block when too many computers are at the same physical location (sharing the same external IP address). In that case, a schedule request would fail, but occasionally get through. I've worked around this problem for a long time and nothing has changed in that regard. these systems are spread across 3 physical locations and one of the systems (the 8x 2070) is actually the only host at it's IP, it's not competing with any other system. so that's not the issue here. I have no problem making schedule requests, and it's always asking for work, but these two for some reason always get the response that no tasks are available. it seems unlikely that they would be THAT unlucky to never get a resend when 3 other systems are occasionally picking them up ____________ | |
ID: 57149 | Rating: 0 | rate: / Reply Quote | |
that situation is different. what you're referencing is a temporary communications block when too many computers are at the same physical location (sharing the same external IP address). I am well familiar with the temporary block. There are two problems present, the second problem is longer-term. | |
ID: 57153 | Rating: 0 | rate: / Reply Quote | |
that situation is different. what you're referencing is a temporary communications block when too many computers are at the same physical location (sharing the same external IP address). can you link to some additional information about this second case? I've never seen that discussed here. only the one I mentioned. but again, the server is responding, so it's not actually being blocked from communication, the server just always responds that there are no tasks, even when there probably is at some times. ____________ | |
ID: 57154 | Rating: 0 | rate: / Reply Quote | |
It has been over a year ago since I last saw it mentioned. I searched own posts, but unfortunately the search function does not work correctly. | |
ID: 57156 | Rating: 0 | rate: / Reply Quote | |
I found this interesting Message #54344 from Retvari Zoltan | |
ID: 57157 | Rating: 0 | rate: / Reply Quote | |
Thanks for digging, but those are both describing different situations. In the case from Zoltan, the user was getting a message that tasks won’t finish in time. I am not getting any such message. Only that tasks are not available. | |
ID: 57158 | Rating: 0 | rate: / Reply Quote | |
I'm assuming GPUGrid is your only gpu project? | |
ID: 57159 | Rating: 0 | rate: / Reply Quote | |
GPUGRID is the only non-zero resource GPU project. | |
ID: 57160 | Rating: 0 | rate: / Reply Quote | |
this "feels" like a similar issue being described here. maybe not exactly the same, but something similar at least. | |
ID: 57175 | Rating: 0 | rate: / Reply Quote | |
Richard, do you remember anything about this? Yes, I remember it well. Message 150509 was one of my better bits of bug-hunting. But I also draw your attention to Message 150489: All my machines have global_prefs_override.xml files, so are functioning normally in spite of the oddities. | |
ID: 57176 | Rating: 0 | rate: / Reply Quote | |
I have an override file as well. What’s the significance of that? It has my local settings, but that’s the significance to this issue? | |
ID: 57179 | Rating: 0 | rate: / Reply Quote | |
I have an override file as well. What’s the significance of that? It has my local settings, but that’s the significance to this issue? An override file always takes precedence over any project preference file. It is completely local to the host. | |
ID: 57180 | Rating: 0 | rate: / Reply Quote | |
The significance lies in the fact that Einstein has re-written large parts of their server code in Drupal, In some respects, their re-write didn't exactly correspond with the original Berkeley PHP (or whatever it was) version of the code. | |
ID: 57181 | Rating: 0 | rate: / Reply Quote | |
I figured the exact Einstein issue was not causing any issue at GPUGRID, just that some of the aspects of that situation feel similar to whats happening now. | |
ID: 57182 | Rating: 0 | rate: / Reply Quote | |
...the other two hosts have not received anything since July 1st. always getting the "no tasks available" message... July 1st is exactly the date when new application version ACEMD 2.12 (cuda1121) was launched. And both your problematic hosts haven't received any task of this new version. Simply coincidence? I think not. | |
ID: 57183 | Rating: 0 | rate: / Reply Quote | |
...the other two hosts have not received anything since July 1st. always getting the "no tasks available" message... I agree with this. but so far can find no difference between the setup of the two bad hosts which would prevent it getting work. it's the same as hosts that are getting work. ____________ | |
ID: 57184 | Rating: 0 | rate: / Reply Quote | |
I was thinking of any subtle change in requirements for task sending from server side, more than from your hosts side... | |
ID: 57185 | Rating: 0 | rate: / Reply Quote | |
I was thinking of any subtle change in requirements for task sending from server side, more than from your hosts side... yeah but if the hosts look the same from the outside, they should meet the same requirements. I think it's something not so obvious where the server isnt telling me what the problem is. ____________ | |
ID: 57186 | Rating: 0 | rate: / Reply Quote | |
I suggest you to force BOINC / GPUGrid to assign a new host ID for your non-working hosts. | |
ID: 57195 | Rating: 0 | rate: / Reply Quote | |
Might be a solution. Easy enough to do and you can always merge the old hostID back into the new ID. | |
ID: 57196 | Rating: 0 | rate: / Reply Quote | |
I suggest you to force BOINC / GPUGrid to assign a new host ID for your non-working hosts. Could be a solution. But right now since not much work is available anyway. I will wait until work is plentiful again and reassess. If I’m still not getting work when there are thousands of tasks ready to send, then I’ll do it. Really prefer not to though. ____________ | |
ID: 57197 | Rating: 0 | rate: / Reply Quote | |
Last ACEMD3 work unit seen was 27077654 (8th July 2021). It errored out. This same error seems to happen on other's hosts too, yet one has successfully completed it: <core_client_version>7.9.3</core_client_version> <![CDATA[ <message> process exited with code 195 (0xc3, -61)</message> <stderr_txt> 10:36:10 (18462): wrapper (7.7.26016): starting 10:36:10 (18462): wrapper (7.7.26016): starting 10:36:10 (18462): wrapper: running acemd3 (--boinc input --device 0) acemd3: error while loading shared libraries: libboost_filesystem.so.1.74.0: cannot open shared object file: No such file or directory 10:36:11 (18462): acemd3 exited; CPU time 0.000578 10:36:11 (18462): app exit status: 0x7f 10:36:11 (18462): called boinc_finish(195) </stderr_txt> ]]> Perhaps some bugs waiting to be solved? | |
ID: 57198 | Rating: 0 | rate: / Reply Quote | |
Last ACEMD3 work unit seen was 27077654 (8th July 2021). It errored out. This same error seems to happen on other's hosts too, yet one has successfully completed it: you need to install the boost 1.74 package from your distribution or from a PPA. no idea what system you have since your computers are hidden, the install process will vary from distribution to distribution. On Ubuntu there is a PPA for it. that will fix your error. ____________ | |
ID: 57199 | Rating: 0 | rate: / Reply Quote | |
Ok, thanks for the info. My computers run mostly CentOS 6/7, but there is one Linux Mint and one Win10 also. | |
ID: 57200 | Rating: 0 | rate: / Reply Quote | |
I think it's resolved now. | |
ID: 57267 | Rating: 0 | rate: / Reply Quote | |
Ian, Tue 07 Sep 2021 09:03:21 BST | | CUDA: NVIDIA GPU 0: GeForce GTX 1660 SUPER (driver version 460.91, CUDA version 11.2, compute capability 7.5, 4096MB, 3974MB available, 5153 GFLOPS peak) Tue 07 Sep 2021 09:03:21 BST | | OpenCL: NVIDIA GPU 0: GeForce GTX 1660 SUPER (driver version 460.91.03, device version OpenCL 1.2 CUDA, 5942MB, 3974MB available, 5153 GFLOPS peak) - all of which seems to match your settings, but I've still never been sent a task beyond version 212. Any ideas? | |
ID: 57279 | Rating: 0 | rate: / Reply Quote | |
i have to assume that their CUDA version is "11.2.1" the .1 denoting the Update 1 version. based on the fact that their app plan class is cuda1121. | |
ID: 57282 | Rating: 0 | rate: / Reply Quote | |
OK, I'll see your 465 and raise you 470 (-: Wed 08 Sep 2021 12:04:41 BST | | CUDA: NVIDIA GPU 0: NVIDIA GeForce GTX 1660 Ti (driver version 470.57, CUDA version 11.4, compute capability 7.5, 4096MB, 3972MB available, 5530 GFLOPS peak) Wed 08 Sep 2021 12:04:41 BST | | OpenCL: NVIDIA GPU 0: NVIDIA GeForce GTX 1660 Ti (driver version 470.57.02, device version OpenCL 3.0 CUDA, 5942MB, 3972MB available, 5530 GFLOPS peak) It sounds plausible, coproc_info had <cudaVersion>11020</cudaVersion>: it now has <cudaVersion>11040</cudaVersion>. No tasks on the first request, but as you say, they're as rare as hen's teeth. I'll leave it trying and see what happens. | |
ID: 57283 | Rating: 0 | rate: / Reply Quote | |
OK, so I've got a Crypic_Scout task running with v217 and cuda 1121. | |
ID: 57285 | Rating: 0 | rate: / Reply Quote | |
Initial observations are that the Einstein tasks are running far slower than usual - implying that both sets of tasks are running on device zero, as other people have reported. | |
ID: 57286 | Rating: 0 | rate: / Reply Quote | |
Initial observations are that the Einstein tasks are running far slower than usual - implying that both sets of tasks are running on device zero, as other people have reported. nvidia-smi command will quickly confirm this | |
ID: 57287 | Rating: 0 | rate: / Reply Quote | |
Yup, so it has. +-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 GeForce GTX 166... Off | 00000000:01:00.0 On | N/A | | 55% 87C P2 126W / 125W | 1531MiB / 5941MiB | 100% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 166... Off | 00000000:05:00.0 Off | N/A | | 31% 37C P8 11W / 125W | 8MiB / 5944MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1133 G /usr/lib/xorg/Xorg 89MiB | | 0 N/A N/A 49977 C bin/acemd3 302MiB | | 0 N/A N/A 50085 C ...nux-gnu__GW-opencl-nvidia 1135MiB | | 1 N/A N/A 1133 G /usr/lib/xorg/Xorg 4MiB | +-----------------------------------------------------------------------------+ acemd3 running on GPU 0 is conclusive. And so to bed. | |
ID: 57288 | Rating: 0 | rate: / Reply Quote | |
After not crunching for several months I started back again about a month ago. It took some time due to limited work units, but I received some GPUGrid WUs starting the first week of October, but now haven't received any since October 6th. I have tried snagging one when some are showing as available and only receive a message "No tasks are available for New version of ACEMD" on BOINC Manager Event log. Any ideas what I may have changed/not set correctly? (I am receiving and crunching Einstein and Milkway WUs. GPUGrid resource share is set 15 times higher than Einstein and 50 times higher than Milkyway.) | |
ID: 57583 | Rating: 0 | rate: / Reply Quote | |
No tasks available. Your system looks fine to me. | |
ID: 57584 | Rating: 0 | rate: / Reply Quote | |
Message boards : Number crunching : some hosts won't get tasks