Author |
Message |
ToniVolunteer moderator Project administrator Project developer Project tester Project scientist Send message
Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level
Scientific publications
|
If anybody wants to help debug a new application, please enable the above mentioned app. |
|
|
|
I don't see anything new on https://www.gpugrid.net/apps.php yet? |
|
|
AzmodesSend message
Joined: 7 Jan 17 Posts: 34 Credit: 1,371,429,518 RAC: 0 Level
Scientific publications
|
GPUGRID 10/26/2021 2:01:26 PM No tasks are available for Python apps for GPU hosts |
|
|
|
One system queued up and waiting.
____________
|
|
|
AzmodesSend message
Joined: 7 Jan 17 Posts: 34 Credit: 1,371,429,518 RAC: 0 Level
Scientific publications
|
Got a 2080 Ti and two 2070 Supers ready to roll. |
|
|
|
I don't see anything new on https://www.gpugrid.net/apps.php yet?
OT, but I've always found it weird that this link is not linked from anywhere on the main GPUGRID site. can only find it via a google search or previous bookmark. if you're just browsing through the GPUGRID site it doesn't exist.
____________
|
|
|
|
I don't see anything new on https://www.gpugrid.net/apps.php yet?
OT, but I've always found it weird that this link is not linked from anywhere on the main GPUGRID site. can only find it via a google search or previous bookmark. if you're just browsing through the GPUGRID site it doesn't exist.
Yes, there is the link, just go to the home page and click on "Join us" and on the page that opens in the "Configuring your participation" section in point 2 click on "apps" and you will find it. |
|
|
|
Thanks. You’re right it’s there.
But I’ll follow up that it’s a very odd place for it. Nearly all other BOINC project puts a link near/with credit statistics, directly on the main page, or as a link on the bottom of every page.
____________
|
|
|
Bill F Send message
Joined: 21 Nov 16 Posts: 32 Credit: 146,419,535 RAC: 54,024 Level
Scientific publications
|
Well I am checked and enabled including "run Test Apps" we will see of I get a task assigned.
Thanks
Bill F
|
|
|
dthononSend message
Joined: 26 Aug 21 Posts: 1 Credit: 495,055,253 RAC: 398,988 Level
Scientific publications
|
This application is enabled in my preferences, and I accept test applications, but I am not getting any python task :
mer. 27 oct. 2021 15:51:04 | GPUGRID | Scheduler request completed: got 0 new tasks
Server status shows 10 tasks waiting to be sent. |
|
|
|
Why are you not running more test tasks for the new app? Almost all of the tasks ended on one of Ian’s hosts … or is that enough feedback for now?
Anyway, credit calculation looks almost random to me. At least for these tasks. Any chance you will fix that before this gets into production? (Comparison: 700sec runtime awarded ~100k vs. admittedly lower end card 110k sec runtime getting 565k credit. Seems out of scope. |
|
|
|
wow, I didnt even notice, I was out all day. I just set the system (7x GPU) to check for work every 100s or so and only checking for beta GPU work, so it doesn't surprise me that it got so many. it would ask for 7 at once, and I guess it got lucky that it asked for some work before anyone else.
beta tasks have always paid a lot of credit here for some reason.
but as with previous beta tasks, I see no indication that these tasks actually did anything on the GPU. my guess is that they ran some stuff on the CPU then finished. I've asked before what their intentions are with these tasks, and it's clear they are doing some type or machine learning kind of thing, but they dont appear to be even using the GPU at all, which is very strange when they are labelled as a cuda app.
____________
|
|
|
|
The apps finally appeared on the application page yesterday afternoon. So far, they are for Linux only (not mentioned in the original announcement), and with the same cuda101 / cuda 1121 variants as the current acemd runs. |
|
|
|
The apps finally appeared on the application page yesterday afternoon. So far, they are for Linux only (not mentioned in the original announcement), and with the same cuda101 / cuda 1121 variants as the current acemd runs.
cuda100* but yeah, looks to be the same app as listed in the Anaconda Python 3 category, same versioning.
____________
|
|
|
AzmodesSend message
Joined: 7 Jan 17 Posts: 34 Credit: 1,371,429,518 RAC: 0 Level
Scientific publications
|
So, uh, that was it? |
|
|
|
Not quite, but...
Got a new Python task. It failed:
14:05:29 (821885): wrapper: running ./gpugridpy/bin/python (run.py)
Running command git clone -q https://github.com/Unity-Technologies/obstacle-tower-env.git /tmp/pip-req-build-i6sgww_u
ERROR: Error [Errno 2] No such file or directory: 'git' while executing command git clone -q https://github.com/Unity-Technologies/obstacle-tower-env.git /tmp/pip-req-build-i6sgww_u
ERROR: Cannot find command 'git' - do you have 'git' installed and in your PATH?
Not yet, but I'll install it before the replacement task I got on report has a chance to start.
We shouldn't need to do that.
(and now my second Linux machine has got one too) |
|
|
|
That looks better - I'd say the GPU is running:
But what's [ObstacleTower (as boinc)]? It's appeared on my task bar, and opens to a tiny, all black, window? |
|
|
|
Second machine has acquired an ObstacleTower, too.
Interesting snip from stderr in running (repeated many times):
[2m[33m(raylet)[0m ModuleNotFoundError: No module named 'aiohttp.signals'
[2m[33m(raylet)[0m /var/lib/boinc-client/slots/5/gpugridpy/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py:57: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`. Please update your install command.
[2m[33m(raylet)[0m warnings.warn(
[2m[33m(raylet)[0m Traceback (most recent call last):
[2m[33m(raylet)[0m File "/var/lib/boinc-client/slots/5/gpugridpy/lib/python3.8/site-packages/ray/new_dashboard/agent.py", line 22, in <module>
[2m[33m(raylet)[0m import ray.new_dashboard.utils as dashboard_utils
[2m[33m(raylet)[0m File "/var/lib/boinc-client/slots/5/gpugridpy/lib/python3.8/site-packages/ray/new_dashboard/utils.py", line 20, in <module>
[2m[33m(raylet)[0m import aiohttp.signals
[2m[33m(raylet)[0m ModuleNotFoundError: No module named 'aiohttp.signals'
WARNING:gym_unity:New seed 57 will apply on next reset.
WARNING:gym_unity:New starting floor 0 will apply on next reset. |
|
|
ToniVolunteer moderator Project administrator Project developer Project tester Project scientist Send message
Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level
Scientific publications
|
This is being solved server-side, no need to install software of course.
Not quite, but...
Got a new Python task. It failed:
14:05:29 (821885): wrapper: running ./gpugridpy/bin/python (run.py)
Running command git clone -q https://github.com/Unity-Technologies/obstacle-tower-env.git /tmp/pip-req-build-i6sgww_u
ERROR: Error [Errno 2] No such file or directory: 'git' while executing command git clone -q https://github.com/Unity-Technologies/obstacle-tower-env.git /tmp/pip-req-build-i6sgww_u
ERROR: Cannot find command 'git' - do you have 'git' installed and in your PATH?
Not yet, but I'll install it before the replacement task I got on report has a chance to start.
We shouldn't need to do that.
(and now my second Linux machine has got one too)
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
The Obstacle Tower environment is a simulated environment for machine learning (Reinforcement Learning) research. Note that in order to research how to train and deploy embodied agents in the real word it is common to research in 3D world simulations like this on. This is the github page of the project: https://github.com/Unity-Technologies/obstacle-tower-env
We use it as a testbench within our efforts to train populations of interacting artificial intelligent agents able to develop complex behaviours and solve complex tasks. The environment runs on GPU, and the Deep Learning models learning how to solve the simulation too.
Most of the bugs we are trying to solve are related to the environment. It is installed via git, but the git-related issues is being solved from the server side as mentioned. The reported stderr message "ModuleNotFoundError: No module named 'aiohttp.signals'" should be solved now. The small black screen is also related to the environment.
____________
|
|
|
|
The Obstacle Tower environment is a simulated environment for machine learning (Reinforcement Learning) research. Note that in order to research how to train and deploy embodied agents in the real word it is common to research in 3D world simulations like this on. This is the github page of the project: https://github.com/Unity-Technologies/obstacle-tower-env
We use it as a testbench within our efforts to train populations of interacting artificial intelligent agents able to develop complex behaviours and solve complex tasks. The environment runs on GPU, and the Deep Learning models learning how to solve the simulation too.
Most of the bugs we are trying to solve are related to the environment. It is installed via git, but the git-related issues is being solved from the server side as mentioned. The reported stderr message "ModuleNotFoundError: No module named 'aiohttp.signals'" should be solved now. The small black screen is also related to the environment.
do you have any plans to utilize the Tensor cores present on many newer Nvidia GPUs? these are designed for machine learning tasks.
____________
|
|
|
|
Thanks for the feedback - on that basis, I'll keep pushing them through.
Had an odd finish:
FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/boinc-client/slots/6/model.state_dict.3073'
[2m[33m(raylet)[0m /var/lib/boinc-client/slots/6/gpugridpy/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py:57: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`. Please update your install command.
[2m[33m(raylet)[0m warnings.warn(
[2m[33m(raylet)[0m Traceback (most recent call last):
[2m[33m(raylet)[0m File "/var/lib/boinc-client/slots/6/gpugridpy/lib/python3.8/site-packages/ray/new_dashboard/agent.py", line 22, in <module>
[2m[33m(raylet)[0m import ray.new_dashboard.utils as dashboard_utils
[2m[33m(raylet)[0m File "/var/lib/boinc-client/slots/6/gpugridpy/lib/python3.8/site-packages/ray/new_dashboard/utils.py", line 20, in <module>
[2m[33m(raylet)[0m import aiohttp.signals
[2m[33m(raylet)[0m ModuleNotFoundError: No module named 'aiohttp.signals'
INFO:mlagents_envs.environment:Environment shut down with return code 0.
15:21:11 (827067): ./gpugridpy/bin/python exited; CPU time 1598.264794
15:21:11 (827067): app exit status: 0x1
15:21:11 (827067): called boinc_finish(195)
"Environment shut down with return code 0" sounds like a happy ending, but "called boinc_finish(195)" is 'Child failed'. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1367 Credit: 7,967,208,214 RAC: 3,050,375 Level
Scientific publications
|
Tried a LOT of the PythonGPU tasks today. Still no joy for a successful run.
Think they are getting further along though since I think I see progress in how far they get before the environment collapses and errors out. |
|
|
|
The next round of testing has started.
e1a10-ABOU_PPOObstacle6-0-1-RND2533_0 - I was going to say 'is running', but it's crashed already. After only 20 seconds, I got an apparently normal finish, followed by
upload failure: <file_xfer_error>
<file_name>e1a10-ABOU_PPOObstacle6-0-1-RND2533_0_0</file_name>
<error_code>-131 (file size too big)</error_code> |
|
|
|
Got another from what looks like the same batch. Limit is
<max_nbytes>100000000.000000</max_nbytes>
I'll catch the output and see how big it is.
Edit - couldn't catch it ('report immediately' operated too fast). But I watched the next one in the slot directory: the output file was created right at the end, but was cleaned up almost immediately. I read it as 169 MB, but can't be certain. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Yes the file should be 170M approx. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Yes the file should be 170M approx.
____________
|
|
|
|
Well, I got one for you to study:
e1a8-ABOU_PPOObstacle7-0-1-RND2466_3
That was done by manually increasing the maximum allowed size in BOINC. I think that's an internal setting in the BOINC system - specifically, the workunit generator or its template files - rather than the Python package.
I've suspended work fetch for now - please let us know when the next iteration is ready to test.
Edit - this it what the upload file contained:
It seems a bit odd to return the ObstacleTower zip back to you unchanged? |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
The git-related errors should be solved now.
ERROR: Cannot find command 'git' - do you have 'git' installed and in your PATH?
We will study the errors related to downloading the Obstacle Tower environment. Thank you for the feedback.
____________
|
|
|
AzmodesSend message
Joined: 7 Jan 17 Posts: 34 Credit: 1,371,429,518 RAC: 0 Level
Scientific publications
|
Got one that ended in 195 (0xc3) EXIT_CHILD_FAILED after 15 minutes:
==> WARNING: A newer version of conda exists. <==
current version: 4.8.3
latest version: 4.10.3
Please update conda by running
$ conda update -n base -c defaults conda
13:14:06 (11501): /usr/bin/flock exited; CPU time 470.306190
13:14:06 (11501): wrapper: running ./gpugridpy/bin/python (run.py)
path: ['/var/lib/boinc-client/slots/34', '/var/lib/boinc-client/slots/34/gpugridpy/lib/python3.8/site-packages/git/ext/gitdb', '/var/lib/boinc-client/slots/34/gpugridpy/lib/python38.zip', '/var/lib/boinc-client/slots/34/gpugridpy/lib/python3.8', '/var/lib/boinc-client/slots/34/gpugridpy/lib/python3.8/lib-dynload', '/var/lib/boinc-client/slots/34/gpugridpy/lib/python3.8/site-packages', '/var/lib/boinc-client/slots/34/gpugridpy/lib/python3.8/site-packages/gitdb/ext/smmap']
git path: /var/lib/boinc-client/slots/34/gpugridpy/lib/python3.8/site-packages/git
Traceback (most recent call last):
File "run.py", line 340, in <module>
main()
File "run.py", line 53, in main
print("GPU available: {}".format(torch.cuda.is_available()))
NameError: name 'torch' is not defined
13:14:10 (11501): ./gpugridpy/bin/python exited; CPU time 1.602758
13:14:10 (11501): app exit status: 0x1
13:14:10 (11501): called boinc_finish(195)
</stderr_txt>
]]> |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1367 Credit: 7,967,208,214 RAC: 3,050,375 Level
Scientific publications
|
Got five PythonGPU tasks to finish and report after about ten minutes that were valid. |
|
|
|
My machine is a dual boot machine (Win10/Ubuntu 20.04). Are there plans for a Windows app for these tasks or should I boot into Linux to get some of these tasks? |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1367 Credit: 7,967,208,214 RAC: 3,050,375 Level
Scientific publications
|
Haven't heard of any posts by admin types that Windows apps will be made.
That stated, often the new beta apps are tested first on Linux to get the bugs out and then the Windows apps are generated. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1367 Credit: 7,967,208,214 RAC: 3,050,375 Level
Scientific publications
|
This task looks to have run through all of its parameter set to complete normally at around 3000 seconds and was validated for ~ 200K credits.
https://www.gpugrid.net/result.php?resultid=32660133 |
|
|
PDWSend message
Joined: 7 Mar 14 Posts: 15 Credit: 5,626,524,525 RAC: 7,675,017 Level
Scientific publications
|
Did you notice if it used the GPU and if it did what percentage ?
I had one that ran for about 3 hours before failing, never saw the fans running during that time. |
|
|
|
just ran this one on my RTX 3080Ti: https://www.gpugrid.net/result.php?resultid=32660184
16:19:48 (1841951): wrapper (7.7.26016): starting
16:19:48 (1841951): wrapper (7.7.26016): starting
16:19:48 (1841951): wrapper: running /usr/bin/flock (/home/ian/BOINC/projects/www.gpugrid.net/miniconda.lock -c "/bin/bash ./miniconda-installer.sh -b -u -p /home/ian/BOINC/projects/www.gpugrid.net/miniconda &&
/home/ian/BOINC/projects/www.gpugrid.net/miniconda/bin/conda install -m -y -p gpugridpy --file requirements.txt ")
0%| | 0/35 [00:00<?, ?it/s]
Extracting : tk-8.6.8-hbc83047_0.conda: 0%| | 0/35 [00:00<?, ?it/s]
Extracting : tk-8.6.8-hbc83047_0.conda: 3%|2 | 1/35 [00:00<00:11, 3.04it/s]
Extracting : urllib3-1.25.8-py37_0.conda: 3%|2 | 1/35 [00:00<00:11, 3.04it/s]
Extracting : libedit-3.1.20181209-hc058e9b_0.conda: 6%|5 | 2/35 [00:00<00:10, 3.04it/s]
Extracting : libgcc-ng-9.1.0-hdf63c60_0.conda: 9%|8 | 3/35 [00:00<00:10, 3.04it/s]
Extracting : ld_impl_linux-64-2.33.1-h53a641e_7.conda: 11%|#1 | 4/35 [00:00<00:10, 3.04it/s]
Extracting : python-3.7.7-hcff3b4d_5.conda: 14%|#4 | 5/35 [00:00<00:09, 3.04it/s]
Extracting : python-3.7.7-hcff3b4d_5.conda: 17%|#7 | 6/35 [00:00<00:06, 4.16it/s]
Extracting : tqdm-4.46.0-py_0.conda: 17%|#7 | 6/35 [00:00<00:06, 4.16it/s]
Extracting : ca-certificates-2020.1.1-0.conda: 20%|## | 7/35 [00:00<00:06, 4.16it/s]
Extracting : wheel-0.34.2-py37_0.conda: 23%|##2 | 8/35 [00:00<00:06, 4.16it/s]
Extracting : libstdcxx-ng-9.1.0-hdf63c60_0.conda: 26%|##5 | 9/35 [00:00<00:06, 4.16it/s]
Extracting : certifi-2020.4.5.1-py37_0.conda: 29%|##8 | 10/35 [00:00<00:06, 4.16it/s]
Extracting : readline-8.0-h7b6447c_0.conda: 31%|###1 | 11/35 [00:00<00:05, 4.16it/s]
Extracting : ncurses-6.2-he6710b0_1.conda: 34%|###4 | 12/35 [00:00<00:05, 4.16it/s]
Extracting : conda-package-handling-1.6.1-py37h7b6447c_0.conda: 37%|###7 | 13/35 [00:00<00:05, 4.16it/s]
Extracting : chardet-3.0.4-py37_1003.conda: 40%|#### | 14/35 [00:00<00:05, 4.16it/s]
Extracting : zlib-1.2.11-h7b6447c_3.conda: 43%|####2 | 15/35 [00:00<00:04, 4.16it/s]
Extracting : six-1.14.0-py37_0.conda: 46%|####5 | 16/35 [00:00<00:04, 4.16it/s]
Extracting : pycparser-2.20-py_0.conda: 49%|####8 | 17/35 [00:00<00:04, 4.16it/s]
Extracting : libffi-3.3-he6710b0_1.conda: 51%|#####1 | 18/35 [00:00<00:04, 4.16it/s]
Extracting : pycosat-0.6.3-py37h7b6447c_0.conda: 54%|#####4 | 19/35 [00:00<00:03, 4.16it/s]
Extracting : cffi-1.14.0-py37he30daa8_1.conda: 57%|#####7 | 20/35 [00:00<00:03, 4.16it/s]
Extracting : _libgcc_mutex-0.1-main.conda: 60%|###### | 21/35 [00:00<00:03, 4.16it/s]
Extracting : pyopenssl-19.1.0-py37_0.conda: 63%|######2 | 22/35 [00:00<00:03, 4.16it/s]
Extracting : idna-2.9-py_1.conda: 66%|######5 | 23/35 [00:00<00:02, 4.16it/s]
Extracting : pysocks-1.7.1-py37_0.conda: 69%|######8 | 24/35 [00:00<00:02, 4.16it/s]
Extracting : xz-5.2.5-h7b6447c_0.conda: 71%|#######1 | 25/35 [00:00<00:02, 4.16it/s]
Extracting : setuptools-46.4.0-py37_0.conda: 74%|#######4 | 26/35 [00:00<00:02, 4.16it/s]
Extracting : ruamel_yaml-0.15.87-py37h7b6447c_0.conda: 77%|#######7 | 27/35 [00:00<00:01, 4.16it/s]
Extracting : cryptography-2.9.2-py37h1ba5d50_0.conda: 80%|######## | 28/35 [00:00<00:01, 4.16it/s]
Extracting : openssl-1.1.1g-h7b6447c_0.conda: 83%|########2 | 29/35 [00:00<00:01, 4.16it/s]
Extracting : sqlite-3.31.1-h62c20be_1.conda: 86%|########5 | 30/35 [00:00<00:01, 4.16it/s]
Extracting : pip-20.0.2-py37_3.conda: 89%|########8 | 31/35 [00:00<00:00, 4.16it/s]
Extracting : yaml-0.1.7-had09818_2.conda: 91%|#########1| 32/35 [00:00<00:00, 4.16it/s]
Extracting : requests-2.23.0-py37_0.conda: 94%|#########4| 33/35 [00:00<00:00, 4.16it/s]
Extracting : conda-4.8.3-py37_0.tar.bz2: 97%|#########7| 34/35 [00:00<00:00, 4.16it/s]
==> WARNING: A newer version of conda exists. <==
current version: 4.8.3
latest version: 4.10.3
Please update conda by running
$ conda update -n base -c defaults conda
16:21:21 (1841951): /usr/bin/flock exited; CPU time 61.036800
16:21:21 (1841951): wrapper: running ./gpugridpy/bin/python (run.py)
Running command git clone -q https://github.com/Unity-Technologies/obstacle-tower-env.git /tmp/pip-req-build-wwv7ghqo
/home/ian/BOINC/slots/15/gpugridpy/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py:57: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`. Please update your install command.
warnings.warn(
Downloading...
From: https://storage.googleapis.com/obstacle-tower-build/v4.1/obstacletower_v4.1_linux.zip
To: /home/ian/BOINC/slots/15/obstacletower_v4.1_linux.zip
0%| | 0.00/170M [00:00<?, ?B/s]
1%| | 2.10M/170M [00:00<00:08, 19.9MB/s]
6%|▌ | 10.5M/170M [00:00<00:02, 56.2MB/s]
11%|█▏ | 19.4M/170M [00:00<00:02, 70.8MB/s]
16%|█▋ | 27.8M/170M [00:00<00:02, 70.6MB/s]
22%|██▏ | 37.7M/170M [00:00<00:01, 76.7MB/s]
28%|██▊ | 47.7M/170M [00:00<00:01, 79.0MB/s]
34%|███▎ | 57.1M/170M [00:00<00:01, 82.8MB/s]
38%|███▊ | 65.5M/170M [00:00<00:01, 80.4MB/s]
43%|████▎ | 73.9M/170M [00:00<00:01, 81.2MB/s]
49%|████▊ | 82.8M/170M [00:01<00:01, 83.4MB/s]
54%|█████▎ | 91.2M/170M [00:01<00:00, 80.8MB/s]
59%|█████▉ | 101M/170M [00:01<00:00, 81.3MB/s]
65%|██████▍ | 110M/170M [00:01<00:00, 83.7MB/s]
70%|██████▉ | 119M/170M [00:01<00:00, 79.0MB/s]
75%|███████▍ | 127M/170M [00:01<00:00, 80.2MB/s]
80%|████████ | 137M/170M [00:01<00:00, 79.2MB/s]
85%|████████▌ | 145M/170M [00:01<00:00, 80.1MB/s]
90%|█████████ | 154M/170M [00:01<00:00, 79.1MB/s]
96%|█████████▌| 163M/170M [00:02<00:00, 82.6MB/s]
100%|██████████| 170M/170M [00:02<00:00, 78.6MB/s]
16:21:54 (1841951): ./gpugridpy/bin/python exited; CPU time 22.798227
16:21:59 (1841951): called boinc_finish(0)
</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>e1a6-ABOU_PPOObstacle6-0-1-RND7771_2_0</file_name>
<error_code>-131 (file size too big)</error_code>
</file_xfer_error>
</message>
ran for about 2 mins and errored out. file size too big? how big could the file get in 2 minutes? lol. looks like everyone in this WU chain is having the same issue though. https://www.gpugrid.net/workunit.php?wuid=27085637 Bad WU?
and I saw no evidence that it ever touched the GPU, refreshing nvidia-smi every 2 seconds showed no process running on the GPU. must still be using only the CPU.
Can an admin please directly comment if these are actually using the GPU or not? I know an admin mentioned that they were only doing CPU work "as a test". Is that still the case? Having GPU tasks that only use the CPU core is very confusing.
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1367 Credit: 7,967,208,214 RAC: 3,050,375 Level
Scientific publications
|
The ones that have partially ran and were validated only used 31% of the gpu in nvidia-smi.
The one task that appears to have successfully run through to normal completion was done while I was out of the house and did not see it run unfortunately.
Will have to wait for more to observe. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1367 Credit: 7,967,208,214 RAC: 3,050,375 Level
Scientific publications
|
Looks like the tasks fluctuate between a few seconds at 1% utilization before returning to hovering around 10-13% utilization. I was watching one on a 2070 and it was running for almost 60 minutes in nvidia-smi. They are marked at C+G type in that program.
I think I killed it when I pulled up htop to look at how much cpu it was using because it finished with an error instantly at the same time as htop populated the screen. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
The contents of the obstacletower.zip downloaded file are necessary to generate the data required for the machine learning agent to train. That is why the file itself is not modified. Only used to generate the training data.
The expected behaviour is for the file to be downloaded, used during the job completion and then deleted. Should not be returned.
Some jobs have already finished successfully. Thank you for the feedback. Current jobs being tested should use around 30% GPU and around 8000MiB GPU memory.
____________
|
|
|
|
The expected behaviour is for the file to be downloaded, used during the job completion and then deleted. Should not be returned.
That makes much more sense. Standing by for the next round of debugging... :-) |
|
|
|
That's the next bad news for me as my GPU is maxed out at 6GB. Without upgrading my GPU and that's not likely gonna be soon, I suppose I have to give up on these types of tasks - at least for the time being. Thanks for the update though |
|
|
|
Some jobs have already finished successfully. Thank you for the feedback. Current jobs being tested should use around 30% GPU and around 8000MiB GPU memory.
why such low GPU utilization? and 8000? or do you mean 800? 8GB? or 800MB?
____________
|
|
|
|
I can only speculate in regards to the former one. But your latter question likely resolves to 8,000 MiB (Mebibyte) which is just another convention to count bits – if he indeed meant to write 8,000.
While k (kilo), M (Mega), G (Giga) and T (Tera) are the SI-prefix units and are computed as base 10 by 10^3, 10^6, 10^9 and 10^12 respectively, the binary prefix units of Ki (Kibi), Mi (Mebi), Gi (Gibi) and Ti (Tebi) are computed as base 2 by 2^10, 2^20, 2^30 and 2^40. As such M/Mi = (10^6/2^20) ~ 95.37% or a difference of ~4.63% between the SI and binary prefix units.
1 kB = 1000 B
1 KiB = 1024 B |
|
|
|
yeah I know the conversions and such. I'm just wondering if it's a typo, Keith ran some of these beta tasks successfully and did not report such high memory usage, he claimed it only used about 200MB
____________
|
|
|
|
ah, all right. didn't mean to offend you if that's what I did. still don't understand their beta testing procedure anyway. so far not many tasks have been run, only few of them successfully, but meanwhile nearly no information has been shared rendering the whole procedure rather intransparent and leaving others in the dark wondering about their piles of unsuccessful tasks. and the little information that is indeed shared seems to conflict a lot with the user experience and observations. for a ML task 8 GB isn't untypical though |
|
|
|
I agree that lots of memory use wouldnt be atypical for AI/ML work. and also agree that the admins should be a little more transparent about what these tasks are doing and the expected behaviors. it seems so far they have tons and tons of errors, then the admins come back and say they fixed the errors, then just more errors again. I'd also like to know if these are using the Tensor cores on RTX GPUs.
____________
|
|
|
|
I think the Beta testing process is (as usual anywhere) very much an incremental process. It will have started with small test units, and as each little buglet surfaces and is solved, the process moves on to test a later segment that wasn't accessible until the previous problem had been overcome.
Thus - Abouh has confirmed that yesterday's upload file size problem was caused by including a source data file in the output - "Should not be returned".
I also noted that some of Keith's successful runs were resends of tasks which had failed on other machines - some of them generic problems which I would have expected to cause a failure on his machine too. So it seems that dynamic fixes may have been applied too. Normally, a new BOINC replication task is an exact copy of its predecessor, but I don't think can be automatically assumed during this Beta phase.
In particular, Keith's observation that one test task only used 200 MB of GPU memory isn't necessarily a foolproof guide to the memory demand of later tests. |
|
|
|
which is why I asked for clarification in light of the disparity between expected and observed behaviors.
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1367 Credit: 7,967,208,214 RAC: 3,050,375 Level
Scientific publications
|
yeah I know the conversions and such. I'm just wondering if it's a typo, Keith ran some of these beta tasks successfully and did not report such high memory usage, he claimed it only used about 200MB
Yes, I have watched tasks complete fully to a proper boinc finish end and I never saw more than 290MB of gpu memory reported in nvidia-smi at a max 13% utilization.
Unless nvidia-smi has an issue in reporting gpu RAM used, the 8GB of memory post is out of line. Or the tasks the scientist-developer mentioned haven't been released to us out of the laboratory yet.
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
We are progressing in our debugging and have managed to solve several errors, but as mentioned in a previous post, it is an incremental process.
We are trying to train AI agent using reinforcement learning, which generally interleaves stages in which the agent collects data (a process less GPU intensive) and stages in which the agent learns from that data. The nature of the problem, in which data in progressively generated, accounts for a lower GPU utilisation that in supervised machine learning, although we will work to progressively make it more efficient once debugging is completed.
Since the obstacle tower environment (https://github.com/Unity-Technologies/obstacle-tower-env), the source of data, also runs in GPU, during the learning stage, the neural network and the training data together with the environment occupy approximately 8,000 MiB (Mebibyte, was not a typo) of GPU memory when checked locally with nvidia-smi.
Basically, the python script has the following steps:
step 1: Defining the conda environment with all dependencies.
step 2: Downloading obstacletower.zip, a necessary file used to generate the data.
step 3: Initialising the data generator using the contents of obstacletower.zip.
step 4: Creating the AI agent and alternating data collection and data training stages.
step 5: Returning the trained AI agent, and not obstacletower.zip.
Only after reaching step 4 and step 5 the GPU is used. Some of the jobs that succeeded but barely used the GPU were to test that indeed problems in step 1 and step 2 had been solved (most of them solved by Keith Myers).
We noticed that most recent failed jobs returned the following error at step 3:
mlagents_envs.exception.UnityTimeOutException: The Unity environment took too long to respond. Make sure that :
The environment does not need user interaction to launch
The Agents' Behavior Parameters > Behavior Type is set to "Default"
The environment and the Python interface have compatible versions.
We are working to solve it. If step 3 is completed without errors, jobs reaching steps 4 and 5 should be using GPU.
We hope that helped shed some light on our work and the recent results. We will try to solve any further doubts and inform about our progress.
____________
|
|
|
|
Thanks for the more detailed answer.
regarding the 8GB of memory used.
-which step of the process does this happen?
-was Keith's nvidia-smi screenshot that he posted in another thread showing low memory use, from an earlier unit that did not require that much VRAM?
-will these units fail from too little VRAM?
-what will you do or are you doing about GPUs with less than 8GB VRAM, or even with 8GB?
-do you have some filter in the WU scheduler to not send these units to GPUs with less than 8GB?
____________
|
|
|
|
-do you have some filter in the WU scheduler to not send these units to GPUs with less than 8GB?
It's certainly possible to set such a filter: referring again to Specifying plan classes in C++, the scheduler can check a CUDA plan class specification like
if (!strcmp(plan_class, "cuda23")) {
if (!cuda_check(c, hu,
100, // minimum compute capability (1.0)
200, // max compute capability (2.0)
2030, // min CUDA version (2.3)
19500, // min display driver version (195.00)
384*MEGA, // min video RAM
1., // # of GPUs used (may be fractional, or an integer > 1)
.01, // fraction of FLOPS done by the CPU
.21 // estimated GPU efficiency (actual/peak FLOPS)
)) {
return false;
}
}
We last discussed that code in connection with compute capability, but I think we're still having problems implementing filters via tools like that. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
At step 3, initialising the environment requires a small amount of GPU memory (somewhere around 1GB). At step 4 the AI agent is initialised and trained, and a data storage class and a neural network are created and placed on the GPU. This is when more memory is required. However, in the next round of tests we will lower the GPU memory requirements of the script while debugging step 3. Eventually for steps 4 and 5 we expect it to require the 8G mentioned earlier.
Keith's nvidia-smi screenshot showing a job with low memory use was a job that returned after step 2, to verify problems in steps 1 and 2 had been solved.
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1367 Credit: 7,967,208,214 RAC: 3,050,375 Level
Scientific publications
|
So was this WU https://www.gpugrid.net/result.php?resultid=32660133 the one that was completed after steps 1 and 2?
Or after steps 4 and 5?
I never got to witness this one in realtime.
I had nvidia-smi polling update set at 1 second and I never saw the gpu memory usage go above 290MB for that screenshot. It was not taken from the task linked above.
The BOINC completion percentage just went to 10% and stayed there and never showed 100% completion when it finished. Think that is an issue with BOINC historically. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1367 Credit: 7,967,208,214 RAC: 3,050,375 Level
Scientific publications
|
The environment and the Python interface have compatible versions.
Is the reason why I was able to complete a workunit properly because of having my local python environment match the zipped wrapper python interface?
I use several pypi applications that probably have setup the python environment variable.
Is there something I can dump out of the host that completed the workunit properly that will help you debug the application package? |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
This one completed the whole python script. Including steps 4 and 5. Should have used the GPU.
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1367 Credit: 7,967,208,214 RAC: 3,050,375 Level
Scientific publications
|
Thanks for confirming the one I completed used the gpu. |
|
|
GregerSend message
Joined: 6 Jan 15 Posts: 76 Credit: 25,247,649,732 RAC: 13,432,972 Level
Scientific publications
|
Did a check on one host running GPUGridpy units.
e4a6-ABOU_ppo_gym_demos3-0-1-RND1018_0
Run time 4,999.53
GPU Memory: nvidia-smi report 2027MiB
No check-pointing yet but works well. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
We sent out some jobs yesterday and almost all finished successfully.
We are still working on avoiding the following error related to the Obstacle Tower environment:
mlagents_envs.exception.UnityTimeOutException: The Unity environment took too.
long to respond. Make sure that :
The environment does not need user interaction to launch
The Agents' Behavior Parameters > Behavior Type is set to "Default"
The environment and the Python interface have compatible versions.
However, to test the rest of the code we tried with another set of environments that are less problematic (https://gym.openai.com/). The successful jobs used these environments. While we find and test a solution for the Obstacle Tower locally we will continue to send jobs with these environments to test the rest of the code.
Note that reinforcement learning (RL) techniques are independent of the environment. The environment represents the world where the AI agent learns intelligent behaviours. Switching to another environment simply means applying the learning technique to a different problem that can be equally challenging (placing the agent in a different world). Thus, we will now finish debugging the app with these Gym environments simply because are less prone to errors and, once we know the only possible source of problems is the environment, consider solving others.
____________
|
|
|
|
I had a few failures:
https://www.gpugrid.net/result.php?resultid=32660680
and
https://www.gpugrid.net/result.php?resultid=32660448
seems to be a bad WU on both instances since all wingmen are erroring in the same way.
mainly used ~6-7% GPU utilization on my 3080Ti, with intermittent spikes to ~20% every 10s or so. power use near idle, GPU memory utilization around 2GB, and system memory use around 4.8GB. make sure your system has enough memory in multi-GPU systems.
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Thank you for the feedback. We had detected the error in
https://www.gpugrid.net/result.php?resultid=32660448
but not the one in
https://www.gpugrid.net/result.php?resultid=32660680
Having alternating phases of lower and higher GPU utilisation is normal in Reinforcement Learning, as the agent alternates between data collection (generally low GPU usage) and training (higher GPU memory and utilisation). Once we solve most of the errors we will focus on maximizing GPU efficiency during the training phases.
____________
|
|
|
|
have you considered creating a modified app that will use the RTX (and other) GPU's onboard Tensor cores? it should speed up things considerably.
https://www.quora.com/Does-tensorflow-and-pytorch-automatically-use-the-tensor-cores-in-rtx-2080-ti-or-other-rtx-cards
I'm guessing in addition to making the needed configuration changes, you'd need to adjust your scheduler to only send to cards with Tensor cores (GeForce RTX cards, TitanV, Tesla/QuadroRTX cards from Volta forward)
____________
|
|
|
|
information for pytorch here:
https://github.com/NVIDIA/apex
https://nvidia.github.io/apex/
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
We are using PyTorch to train our agents, and for now we have not considered using mixed precision, which seem required for the Tensor cores.
It could be an interesting possibility to reduce memory requirements and speed up training processes. I have to admit that I do not know how it affects performance in reinforcement learning algorithms, but it is an interesting option.
____________
|
|
|
|
Getting errors in the test5 run, like
e2a16-ABOU_ppod_gym_test5-0-1-RND0379_1
e2a10-ABOU_ppod_gym_test5-0-1-RND0874_1
And on the test6 run. This time, the error seems to be in placing the expected task files in the slot directory, prior to starting the main run.
e3a17-ABOU_ppod_gym_test6-0-1-RND2029_0
e3a11-ABOU_ppod_gym_test6-0-1-RND1260_4
Both have
File "run.py", line 393, in <module>
main()
File "run.py", line 106, in main
feature_extractor_network=get_feature_extractor(args.nn),
File "/var/lib/boinc-client/slots/4/gpugridpy/lib/python3.8/site-packages/pytorchrl/agent/actors/feature_extractors/__init__.py", line 19, in get_feature_extractor
raise ValueError("Specified model not found!")
ValueError: Specified model not found!
|
|
|
mmonninSend message
Joined: 2 Jul 16 Posts: 337 Credit: 7,773,367,558 RAC: 502,871 Level
Scientific publications
|
I got one that worked today. Then 6 more that didnt on the same PC
https://www.gpugrid.net/workunit.php?wuid=27086033 |
|
|
mmonninSend message
Joined: 2 Jul 16 Posts: 337 Credit: 7,773,367,558 RAC: 502,871 Level
Scientific publications
|
I got another. So far it is running
Over 4 CPU threads at 1st then 1 thread for 1st 4min
13% completed back to 10% then no more progression
At 10% hen GPU load at 3-5% 875mb vram
78min so far. |
|
|
|
I've got several GPU Python beta tasks at my triple GPU Host #480458
Several of them have succeeded after around 5000 seconds execution time.
But three of these tasks have exceeded this time.
Task e1a20-ABOU_ppod_gym_test-0-1-RND4563_6 failed after 11432 seconds.
Task e1a6-ABOU_ppod_gym_test-0-1-RND1186_1 failed after 18784 seconds.
Task e1a2-ABOU_ppod_gym_test-0-1-RND2391_3 is currently running even longer.
This last task is theoreticaly running at device 1.
But it seems to be effectively running at device 0, sharing the same device with an ACEMD3 regular task e14s132_e10s98p1f905-ADRIA_AdB_KIXCMYB_HIP-0-2-RND7676_5. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1367 Credit: 7,967,208,214 RAC: 3,050,375 Level
Scientific publications
|
I've got the same thing going on. BOINC says the task is running on Device2 while in reality it is sharing Device0 along with an Einstein GRP task.
This is the task https://www.gpugrid.net/result.php?resultid=32661276 |
|
|
|
Task e1a2-ABOU_ppod_gym_test-0-1-RND2391_3 is currently running even longer.
The risk of beta testing: It finally failed after 42555 seconds.
I hope this is somehow useful for debugging... |
|
|
|
Task e1a2-ABOU_ppod_gym_test-0-1-RND2391_3 is currently running even longer.
FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/boinc-client/slots/3/model.state_dict.73201'
The same for two of your predecessors on this workunit. Is there any way we could avoid re-inventing the wheel (slowly) for errors like this? |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
The excessively long training time problem and the problem related to
FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/boinc-client/slots/3/model.state_dict.73201'
Have been fixed now. Most jobs sent today are being completed successfully. The reported issues were very helpful for debugging.
Progress:
The core research idea is to train populations of reinforcement learning agents that learn independently for a certain amount of time and, once they return to the server, put their learned knowledge in common with other agents to create a new generation of agents equipped with the information acquired by previous generations. Each GPUgrid job is one of these agents doing some training independently. In that sense, the first 4 letters of the job name identify the generation and the number of the agent (i.e. e1a2-ABOU_ppod_gym_test-0-1-RND2391_3 refers to the epoch or generation number 1 and the agent number 2 within that generation).
The debugging done recently, has allowed more and more of this jobs to finish. An experiment currently running has achieved already a 3rd generation of agents.
As mentioned in an earlier post, we are working now with OpenAI gym environments (https://gym.openai.com/)
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1367 Credit: 7,967,208,214 RAC: 3,050,375 Level
Scientific publications
|
Are you working on fixing the issue that the tasks only run on Device#0 in BOINC?
Even when Device#0 is already occupied by another task from another project?
That leaves at least one device doing nothing because BOINC thinks it is occupied.
|
|
|
|
Are you working on fixing the issue that the tasks only run on Device#0 in BOINC?
+1
At this other example, Device 0 is running 1 Gpugrid ACEMD3 task and 2 Python GPU tasks.
Meanwhile, Device 1 and Device 2 remain idle.
|
|
|
|
weird, I thought this problem had been fixed already. I guess I never realized since I've only been running the beta tasks on my single GPU system.
____________
|
|
|
|
Count me in on this, too.
My client is running e8a16-ABOU_ppod_gym_test7-0-1-RND1448_0 on device 1.
I have GPUGrid excluded from device 0, so I can run tasks from other projects in the faster PCIe slot while testing. But ...
Well, despite running on the wrong card, it finished and passed the GPUGrid validation test. I've swapped over the exclusion, and BOINC and GPUGrid are now in agreement that card 0 is the card to use.
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1367 Credit: 7,967,208,214 RAC: 3,050,375 Level
Scientific publications
|
Hard to tell from the error code snippet whether the tasks are hardwired to run on Device#0 or whether the error snippet is just the result of where the task actually has run.
[nan, nan, nan, ..., nan, nan, nan]], device='cuda:0', |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1367 Credit: 7,967,208,214 RAC: 3,050,375 Level
Scientific publications
|
Well, I have a new python task running by itself now on Device#2.
So it may mean they have fixed the issue where the tasks always ran on Device#0.
See this new output in the stderr.txt that looks like it is allocating to Device#2
It hasn't been there in any other of my tasks till just now for this new task.
Found GPU: True, Number 2 - 2
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Yes, we have fixed the issue. It should be fine now. Please, let us know if you encounter any new device placement error. We just ran the tests and, as you mention, we print the device number in the stderr file.
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1367 Credit: 7,967,208,214 RAC: 3,050,375 Level
Scientific publications
|
Thank you for fixing this issue. I don't know whether you test in a multi-gpu environment or not. I suspect a lot of projects don't.
But there are lots of us that run many multi-gpu hosts that have been bit by this bug often. |
|
|
|
Thank you very much for your continuous support. |
|
|
|
Overnight, every of my currently active 6 varied Linux hosts received at least one task of the kind ...-ABOU_ppod_gym_test9-0-1-...
All the tasks gave a valid result, none of them errored. This is promising!
My triple-GPU host happened to receive several tasks in a short time, and three of them were executed concurrently.
It catched my attention that there was observed a drastic change in overall system temperatures when transitioning from executing highly GPU/CPU intensive PrimeGrid tasks to the Gpugrid tasks.
On the other hand, every GPU was effectively executing its own task, as shown at the following nvidia-smi screenshot:
This confirms the Keith Myers observation that the previous task-to-GPU assignment problem in multi-GPU systems is solved. Well done! |
|
|
mmonninSend message
Joined: 2 Jul 16 Posts: 337 Credit: 7,773,367,558 RAC: 502,871 Level
Scientific publications
|
I enabled Python on a 2nd PC with a 1070 and 1080 and they all error out
https://www.gpugrid.net/result.php?resultid=32662330
Output in format: Requested package -> Available versions
Then lists tons pf packages and versions.
When I check python version on this PC I get 'Python 2.7.17'. On the PC that works, Python is not install at all.
I'm guessing there is some incompatibility between packages I have installed? |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1367 Credit: 7,967,208,214 RAC: 3,050,375 Level
Scientific publications
|
You needn't install any packages. The tasks are entirely packaged with everything they need in the work unit bundle. |
|
|
mmonninSend message
Joined: 2 Jul 16 Posts: 337 Credit: 7,773,367,558 RAC: 502,871 Level
Scientific publications
|
Supposedly, but then they should work. Another PC of mine also with Ubuntu 18.04, driver 470 and Pascal arch works OK. These tasks were all completed by others. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1367 Credit: 7,967,208,214 RAC: 3,050,375 Level
Scientific publications
|
I can only guess the tasks are confused with the locally installed old Python 2.7 library with the bundle containing 3.8 Python.
Python 2.7 is deprecated in current Linux distributions with minimum Python 3.6 in the distros now.
You might want to either uninstall Python or upgrade it to the 3 series. I don't think uninstalling though is desired as I believe a lot of stock applications are Python based and you would lose those. |
|
|
Jim1348Send message
Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level
Scientific publications
|
I think you can uninstall python2 without damage. At least I could on Ubuntu 20.04.3, though I had only BOINC and Folding installed on it.
But I then made the mistake of trying to purge all python versions. It made the system unbootable, and I had to re-install it. |
|
|
|
When I check python version on this PC I get 'Python 2.7.17'. On the PC that works, Python is not install at all.
To discard that something is getting confused with the old, deprecated Python version, you can upgrade to Python 3 with the following Terminal commands:
sudo apt install python-is-python3
sudo apt install python3-pip
And after that, you can uninstall unnecessary old packages with the command:
sudo apt autoremove |
|
|
|
I also have two closely similar Linux machines:
132158
508381
Don't be fooled by the host IDs: 132158 is an inherited ID from an earlier generation of hardware, and is actually slightly younger than 508381. Both run the same version of Linux Mint 20.2, installed from the same ISO download, and the same basic software environment - but I do make tweaks to the installed packages separately, as I encounter different testing needs.
Yesterday, I was away from home, but both machines downloaded tasks from the ppod_gym_test9 batch. 132158 failed to run them, 508381 succeeded.
The problem occurs during the learner.step in Python, with a ValueError raised at line 55 during initialisation:
File "/var/lib/boinc-client/slots/4/gpugridpy/lib/python3.8/site-packages/torch/distributions/distribution.py", line 55, in __init__
raise ValueError(
ValueError: Expected parameter loc (Tensor of shape (146, 8)) of distribution Normal(loc: torch.Size([146, 8]), scale: torch.Size([146, 8])) to satisfy the constraint Real(), but found invalid values:
tensor([[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]], device='cuda:0',
grad_fn=<AddmmBackward0>)
The two 'file extraction' logs for the GPUGrid Python download seem to be different. I'll try to compare the software environment of the two machines and work out where the difference is coming from. |
|
|
|
Well, I've looked through the software installations for both machines, but I can't see any significant differences. Both have Python 3.8 installed (probably with the operating system), and no sign of any Python 2.x; I've installed a few sundries from terminal (libboost, git, some 32-bit libs for CPDN), but the same list on both machines.
The 'file extraction' logs are different for every task, and sometimes the same filename appears more than once (is duplicated) in the list for a single task.
For the tasks I ran successfully on host 508381, that was the only host that attempted them. The tasks that failed on host 132158 were issued to the full limit of 8 hosts, and failed on all of them.
I can only assume that the difference between success and failure resulted from differences in the task data make-up, and not from differences in the installed software on my hosts.
|
|
|
mmonninSend message
Joined: 2 Jul 16 Posts: 337 Credit: 7,773,367,558 RAC: 502,871 Level
Scientific publications
|
When I check python version on this PC I get 'Python 2.7.17'. On the PC that works, Python is not install at all.
To discard that something is getting confused with the old, deprecated Python version, you can upgrade to Python 3 with the following Terminal commands:
sudo apt install python-is-python3
sudo apt install python3-pip
And after that, you can uninstall unnecessary old packages with the command:
sudo apt autoremove
The python-is command didn't work.
So I followed the instructions here starting with Option 1
https://phoenixnap.com/kb/how-to-install-python-3-ubuntu
At the end I did the python --version to check. Same 2.7.17 even though it seemed to complete.
So I tried option 2 from source. That worked OK too with 3.7.5
I get to the end and see the note about checking for specific versions. Uh, oh.
python --version = 2.7.17
python3 --version = 3.6.9
python3.7 --version = 3.7.5
So now I have 3 versions installed haha. Maybe one will work, dunno. But we'll need some more tasks to find out. |
|
|
Aurum Send message
Joined: 12 Jul 17 Posts: 401 Credit: 16,864,495,916 RAC: 2,584,743 Level
Scientific publications
|
sudo apt install python-is-python3
sudo apt install python3-pip
Thanks, that worked and I now have python 3.8.10 installed on my two GG computers with cuda 11.4.
I just noticed that one computer had previously attempted to run a python WU but it failed. https://www.gpugrid.net/result.php?resultid=32727968
The stderr said this among many other things:
==> WARNING: A newer version of conda exists. <==
current version: 4.8.3
latest version: 4.11.0
Please update conda by running
$ conda update -n base -c defaults conda
I tried running that command but it said "conda: command not found."
The rig that didn't run a python WU installed many more lines of files. The rig that did run the failed python WU installed less than half of the files.
What are all of the prerequisites I need to run these python WUs? |
|
|
|
What are all of the prerequisites I need to run these python WUs?
I read Keith Myers Message #58061
Then, I executed:
sudo apt install cmake
chance or not, the following Python task worked for me: e1a1-ABOU_rnd_ppod3-0-1-RND4818_5
The same WU had previously failed at five other hosts. |
|
|
Aurum Send message
Joined: 12 Jul 17 Posts: 401 Credit: 16,864,495,916 RAC: 2,584,743 Level
Scientific publications
|
sudo apt install cmake
Done. Fingers crossed. Thx |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1367 Credit: 7,967,208,214 RAC: 3,050,375 Level
Scientific publications
|
What are all of the prerequisites I need to run these python WUs?
I read Keith Myers Message #58061
Then, I executed:
sudo apt install cmake
chance or not, the following Python task worked for me: e1a1-ABOU_rnd_ppod3-0-1-RND4818_5
The same WU had previously failed at five other hosts.
I was hoping to get a response from the researcher before interfering with the process. Happy someone beat me to it.
So once again we crunchers need to help along the process by installing missing software on our hosts to properly crunch the work the researchers are sending out.
Would be nice if the researchers ran some of their work on some test systems of their own before releasing it to the public, or as we are also known as . . . "beta-testers" |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Hello everyone, sorry for the late reply.
we detected the "cmake" error and found a way around it that does not require to install anything. Some jobs already finished successfully last Friday without reporting this error.
The error was related to the atari_py, as some users reported. More specifically installing this python package from github https://github.com/openai/atari-py, which allows to use some Atari2600 games as a test bench for reinforcement learning (RL) agents.
Sorry for the inconveniences. Even while the AI agents part of the code has been tested and works, every time we need to test our agents in a new environment we need te modify environment initialisation part of the code with the one containing the new environment, in this case atari_py.
I just sent another batch of 5 test jobs, 3 already finished the others seem to be working without problems but have not yet finished.
http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730763
http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730759
http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730761
http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730760
http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730762
____________
|
|
|
Aurum Send message
Joined: 12 Jul 17 Posts: 401 Credit: 16,864,495,916 RAC: 2,584,743 Level
Scientific publications
|
http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730760
http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730762
I cannot open these links. Please use the [url][/url] tags to make them linkable.
I have 2 running now and am surprised how much memory they report using. They finished and reported as I wrote this so I can't say how much memory but I think it said 22 GB each but my System Monitor reported much less on the order of 17 GB which has been relinquished. How much RAM should we have to run pythonGPU?
https://www.gpugrid.net/result.php?resultid=32730780
https://www.gpugrid.net/result.php?resultid=32730783
BTW, I installed cmake and latest python 3.8. Should I uninstall cmake as a better test?
I recommend making its CPU use require 1 and not 0.963. |
|
|
ToniVolunteer moderator Project administrator Project developer Project tester Project scientist Send message
Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level
Scientific publications
|
Those are private links, but you can see the result ID. |
|
|
|
http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730760
http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730762
I cannot open these links. Please use the [url][/url] tags to make them linkable.
I have 2 running now and am surprised how much memory they report using. They finished and reported as I wrote this so I can't say how much memory but I think it said 22 GB each but my System Monitor reported much less on the order of 17 GB which has been relinquished. How much RAM should we have to run pythonGPU?
https://www.gpugrid.net/result.php?resultid=32730780
https://www.gpugrid.net/result.php?resultid=32730783
BTW, I installed cmake and latest python 3.8. Should I uninstall cmake as a better test?
I recommend making its CPU use require 1 and not 0.963.
real memory? or virtual memory allocation? high virt is normal, and on the order of tens of GB, even for acemd3 tasks.
re: CPU use for the task, this is easily configured client-side with an app config file, and it will force 1:1 no matter what the project defines. I'd recommend that.
____________
|
|
|
Aurum Send message
Joined: 12 Jul 17 Posts: 401 Credit: 16,864,495,916 RAC: 2,584,743 Level
Scientific publications
|
re: CPU use for the task, this is easily configured client-side with an app config file, and it will force 1:1 no matter what the project defines. I'd recommend that.
I wasn't asking you for a trivial response. I'm asking the people that create these work units why they don't specify 1 instead of 0.963. |
|
|
|
a trivial question garners a trivial response :)
does it solve your problem?
____________
|
|
|
|
re: CPU use for the task, this is easily configured client-side with an app config file, and it will force 1:1 no matter what the project defines. I'd recommend that.
I wasn't asking you for a trivial response. I'm asking the people that create these work units why they don't specify 1 instead of 0.963.
Because the GPUGrid staff don't set that figure. There's an algorithm in the (Berkeley written) BOINC server code which generates the figure to use from a range of outdated, stupid, data.
I discussed this at some length almost three years ago, in https://github.com/BOINC/boinc/issues/2949 - with examples drawn from GPUGrid, among other projects. I think that this was about the point that Berkeley stopped reading a single word of what I write. Someone else can get to grips with it this time. |
|
|
|
thanks Richard, I had a thought that it was likely the output of some automated function within BOINC since nearly all projects end up with something like this by default if they don't manually set the figures.
____________
|
|
|
Bill F Send message
Joined: 21 Nov 16 Posts: 32 Credit: 146,419,535 RAC: 54,024 Level
Scientific publications
|
Too bad they never worked to implement all or part of Richard's suggestion / GitHub issue. While I don't claim to be able to see the bigger picture in BOINC it sounds like a good path for automated adjustments when new GPU hardware is released.
Bill F
____________
In October of 1969 I took an oath to support and defend the Constitution of the United States against all enemies, foreign and domestic;
There was no expiration date.
|
|
|
|
I got one WU today (e1a10-ABOU_rnd_ppod_13-0-1-RND2740_2).
Is it normal behaviour that the WU uses more than 7GB of RAM? |
|
|
|
I got one WU today (e1a10-ABOU_rnd_ppod_13-0-1-RND2740_2).
Is it normal behaviour that the WU uses more than 7GB of RAM?
Yes.
____________
|
|
|
|
I got one WU today (e1a10-ABOU_rnd_ppod_13-0-1-RND2740_2).
Is it normal behaviour that the WU uses more than 7GB of RAM?
Yes.
Thanks for answering. |
|
|
Greg _BESend message
Joined: 30 Jun 14 Posts: 137 Credit: 122,463,514 RAC: 19,685 Level
Scientific publications
|
What is typical run time for these tasks? I am at 1 day and x hours processing and only 34% of the way done. I have 2 days left before deadline. I am running a GTX 1080 plain, not TI that is OC'd a bit. |
|
|
Bill F Send message
Joined: 21 Nov 16 Posts: 32 Credit: 146,419,535 RAC: 54,024 Level
Scientific publications
|
What is typical run time for these tasks? I am at 1 day and x hours processing and only 34% of the way done. I have 2 days left before deadline. I am running a GTX 1080 plain, not TI that is OC'd a bit.
Your times may be about right. I have a GTX 1060 with 6GB and my times were similar.
|
|
|
|
This task is giving wild results for its estimated time remaining.
This morning, it was saying over 400 days remaining.
Application Python apps for GPU hosts 4.03 (cuda1131)
Name e23a16-ABOU_rnd_ppod_demo_sharing_large-0-1-RND7660
State Running
Received 6/19/2022 6:33:56 AM
Report deadline 6/24/2022 6:34:02 AM
Resources 0.949 CPUs + 1 NVIDIA GPU
Estimated computation size 1,000,000,000 GFLOPs
CPU time 23:13:06
CPU time since checkpoint 00:02:47
Elapsed time 06:42:45
Estimated time remaining 357d 12:44:26
Fraction done 22.580%
Virtual memory size 5.81 GB
Working set size 1.05 GB
Directory slots/10
Process ID 6376
Progress rate 3.240% per hour
Executable wrapper_6.1_windows_x86_64.exe
I've seen other tasks start out claiming over 300 days remaining, and then finish in between 5 and 6 days.
Is there something wrong in the data sent as task input, or is it the wild first ten tasks for a new application version? |
|
|
mmonninSend message
Joined: 2 Jul 16 Posts: 337 Credit: 7,773,367,558 RAC: 502,871 Level
Scientific publications
|
Yup, non beta task but I've seen over 3k day ETAs recently.
Name e23a60-ABOU_rnd_ppod_demo_sharing_large-0-1-RND1212_1
Application Python apps for GPU hosts 4.03 (cuda1131)
Workunit name e23a60-ABOU_rnd_ppod_demo_sharing_large-0-1-RND1212
State Running High P.
Received 6/20/2022 12:17:53 PM
Report deadline 6/25/2022 12:17:53 PM
Estimated app speed 311.15 GFLOPs/sec
Estimated task size 1,000,000,000 GFLOPs
Resources 0.99 CPUs + 1 NVIDIA GPU
CPU time at last checkpoint 05:44:17
CPU time 05:47:46
Elapsed time 02:17:30
Estimated time remaining 2764d,01:55:33
Fraction done 11.890%
Virtual memory size 18,693.93 MB
Working set size 3,824.01 MB |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1367 Credit: 7,967,208,214 RAC: 3,050,375 Level
Scientific publications
|
Just ignore the ETA estimates. Garbage data.
The tasks finish fine and well within their deadlines. |
|
|
|
Were the very long ETA estimates intended to make GPUGRID Python GPU work run before any GPU work from other BOINC projects? They seem to be very good at doing that.
Note there seems to be no thread for discussing non-beta Python tasks. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1367 Credit: 7,967,208,214 RAC: 3,050,375 Level
Scientific publications
|
Were the very long ETA estimates intended to make GPUGRID Python GPU work run before any GPU work from other BOINC projects? They seem to be very good at doing that.
Note there seems to be no thread for discussing non-beta Python tasks.
No not at all. BOINC just has no mechanism for dealing with hybrid cpu-gpu tasks.
The Python on GPU tasks are the first of their kind.
It will take the BOINC devs a lot of time to accommodate them correctly.
If they are getting in the way of your other work, I suggest stopping them or limiting them to only a single task at any time by changing your cache size to absolute minimal values. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1367 Credit: 7,967,208,214 RAC: 3,050,375 Level
Scientific publications
|
Were the very long ETA estimates intended to make GPUGRID Python GPU work run before any GPU work from other BOINC projects? They seem to be very good at doing that.
Note there seems to be no thread for discussing non-beta Python tasks.
The other threads are here:
https://www.gpugrid.net/forum_thread.php?id=5323
https://www.gpugrid.net/forum_thread.php?id=5319 |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1367 Credit: 7,967,208,214 RAC: 3,050,375 Level
Scientific publications
|
FYI, the Python on GPU tasks are the same as the beta Python tasks currently.
Both tasks are using the latest application code.
The devs said they would still keep the beta plan class available, just not in use, for whenever a new application might be developed.
So everyone is getting the standard Python work even if they have beta selected. |
|
|
DragoSend message
Joined: 3 May 20 Posts: 18 Credit: 907,251,710 RAC: 529,067 Level
Scientific publications
|
Does anybody know how many cpu threads would be ideal to run them efficiently? I gave them 12 threads exclusively but still the task is run primarily on the cpu with my 3070ti kicking in only sporadically for a second or two. |
|
|
|
giving it more cores wont necessarily make it run faster. as few as 4 cores per task works fine on my EPYC system. but if you are running other projects on the CPU it will slow them down as the processes compete with each other for CPU time.
by default the program will use however many cores you have and you really can't change this with any BOINC settings.
also I would recommend putting Linux on that system instead of Windows. Linux runs much faster
____________
|
|
|
DragoSend message
Joined: 3 May 20 Posts: 18 Credit: 907,251,710 RAC: 529,067 Level
Scientific publications
|
Gotcha! My Linux Laptop finishes them in 10 hours, my much faster Windows PC needs 18! Thanks again Ian & Steve |
|
|