Message boards : Number crunching : Pablo, When is enough enough???
Author | Message |
---|---|
Two days ago I completed a WU that ran for 21 days, next one took 14 days and today there's two running at 5 and 8 days (below): IF ElapsedTime > 2(PredictedTime) THEN Abort&Flag ELSE WTF | |
ID: 51134 | Rating: 0 | rate: / Reply Quote | |
Two days ago I completed a WU that ran for 21 days, next one took 14 days and today there's two running at 5 and 8 days (below): When running GPUGrid, you really should only run 1 work unit per GPU. Without knowing what the other work unit is on that same GPU, you are probably starving the GPUGrid work unit with time on the GPU. What GPUs are in those machines? | |
ID: 51137 | Rating: 0 | rate: / Reply Quote | |
So gpuGRID has been shown to be incapable of sharing a GPU??? | |
ID: 51138 | Rating: 0 | rate: / Reply Quote | |
When running GPUGrid, you really should only run 1 work unit per GPU. Without knowing what the other work unit is on that same GPU, you are probably starving the GPUGrid work unit with time on the GPU. What GPUs are in those machines? I dont have any problems with running two tasks per GPU. My Ryzen 1700 + 2x gtx1070 System runs 4 long jobs in parallel and needs 30.000 - 50.000 sec for completion. This config keeps the GPU working and at approximately constant temperature (-> longer lifetime) whereas a singe job config would put more thermal stress on the card especially when jobs are rare (many down/zero load times inbetween). ____________ I would love to see HCF1 protein folding and interaction simulations to help my little boy... someday. | |
ID: 51139 | Rating: 0 | rate: / Reply Quote | |
This config keeps the GPU working and at approximately constant temperature (-> longer lifetime) whereas a singe job config would put more thermal stress on the card especially when jobs are rare (many down/zero load times inbetween). My temps are never very high. All my cards are hybrids so they never go above 50C. Extended warranties for total 10 years (longer than most people keep their cards) so no worries about them failing and not being able to replace them. It is troubling that his machines are taking anywhere from 5-21 days to do what most do in a few hours. The only way to know would be to run those exact work units on another machine and see if the results are the same. ____________ | |
ID: 51140 | Rating: 0 | rate: / Reply Quote | |
So gpuGRID has been shown to be incapable of sharing a GPU???It's capable to share the GPU, but enabling SWAN_SYNC means that you dedicate your GPU to the GPUGrid app to make it as fast as possible (=maximize GPU utilization). In this case you should not share it with other project(s). Both of these computers have SWAN_SYNC enabled but their app_config.xml was changed after enabling SWAN_SYNC so they still show 0.5 GPU even though its mate completed days ago and it's been running alone.The app_config tells the BOINC manager what to expect from the given app, it does *not* instruct the app to what extent utilize resources (GPU, CPU). (There's no way to config the GPUGrid app to a given GPU utilization percentage.) So if you enable SWAN_SYNC, you should set 1.0 GPU and 1.0 CPU in app_config.xml like this: <app_config>
<app>
<name>acemdlong</name>
<fraction_done_exact/>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>1.0</cpu_usage>
</gpu_versions>
</app>
<app>
<name>acemdshort</name>
<fraction_done_exact/>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>1.0</cpu_usage>
</gpu_versions>
</app>
<app>
<name>acemdbeta</name>
<fraction_done_exact/>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>1.0</cpu_usage>
</gpu_versions>
</app>
</app_config> BTW the deadline is 5 days, so if a workunit takes much more than that you should abort it, and check for a solution, as it should take 5-6 hours on a GTX 1070. | |
ID: 51141 | Rating: 0 | rate: / Reply Quote | |
Pablo et alia, Please provide some feedback on these WUs that take days and weeks to complete. IF ElapsedTime > 2*PredictedTime THEN Abort&Flag ELSE WTF If I abort them will they just queue up and go out again??? I don't want you to miss the best binding pocket the universe has ever seen. {BTW, this thread is not about SWAN_SYNCing.} | |
ID: 51142 | Rating: 0 | rate: / Reply Quote | |
On my SuSE Linux host, with a GTX 750 Ti graphic board, I find in the stderr.txt the following line: | |
ID: 51147 | Rating: 0 | rate: / Reply Quote | |
On my SuSE Linux host, with a GTX 750 Ti graphic board, I find in the stderr.txt the following line: This means that SWAN_SYNC is successfully enabled in your system. When it is not, it should read: CUDA Synchronization mode BLOCKING {But this matter better fits in SWAN_SYNC in Linux client thread http://www.gpugrid.net/forum_thread.php?id=4813, as Aurum suggests.}[/list][/code] | |
ID: 51148 | Rating: 0 | rate: / Reply Quote | |
Today's long running WU, to abort or not to abort: | |
ID: 51149 | Rating: 0 | rate: / Reply Quote | |
If we should just abort them if they take too long can you automate the process as I suggested in the lead post??? There is a clue in your never-ending task 15602360 http://www.gpugrid.net/workunit.php?wuid=15602360 This task has failied or not finished in many different systems, so you can freely abort a task like this. It is clearly a defective one. As you suggest, some in-task protection would be appreciated for eviting such problem. If I abort them will they just queue up and go out again??? As seen in this particular defective task, it has been resent to many systems. This task has been automatically retired by the project, as it has reached a total number of 10 resendings with no successful result. | |
ID: 51150 | Rating: 0 | rate: / Reply Quote | |
Wow! If the other nine spent 9 days each that's a quarter of a GPU-year that could've been better used. | |
ID: 51151 | Rating: 0 | rate: / Reply Quote | |
Wow! If the other nine spent 9 days each that's a quarter of a GPU-year that could've been better used. Even when you abort I'm pretty sure the results are still uploaded for their analysis | |
ID: 51152 | Rating: 0 | rate: / Reply Quote | |
BOINC will sometimes report run times that aren't correct although not usually off by days. I would first try to make sure these problem tasks are really running that long. If they are that's almost certainly a problem with your host and not the app, problems like that usually get reported pretty quickly by multiple users. You may have a problem similar to what's reported here. | |
ID: 51153 | Rating: 0 | rate: / Reply Quote | |
I would think there should be a builtin rule that says: I second that approach. Now some extremely long running (or endlessly looping?) tasks keep timing out and so waist many days of GPU time. For example: e1s363_8_gen-PABLO_V3_p27_sj403_IDP-1-4-RND1764 timed out after running five days on a GTX 1080 Ti (single GPU task for that computer, id:274120). Looks like that task has been further assigned to others after being errored out or cancelled by operators. | |
ID: 51156 | Rating: 0 | rate: / Reply Quote | |
...almost certainly a problem with your host and not the app, problems like that usually get reported pretty quickly by multiple users.Thanks for alerting me to that & I'm taking Rig-12 offline. From its stderr.txt file: # CUDA Synchronisation mode: SPIN # SWAN Device 2 : # Name : GeForce GTX 1070 # ECC : Disabled # Global mem : 8117MB # Capability : 6.1 # PCI ID : 0000:03:00.0 # Device clock : 1784MHz # Memory clock : 4004MHz # Memory width : 256bit # GPU [GeForce GTX 1070] Platform [Linux] Rev [3212] VERSION [80] # SWAN Device 2 : # Name : GeForce GTX 1070 # ECC : Disabled # Global mem : 8117MB # Capability : 6.1 # PCI ID : 0000:03:00.0 # Device clock : 1784MHz # Memory clock : 4004MHz # Memory width : 256bit # Simulation unstable. Flag 5 value 11 # Simulation unstable. Flag 6 value 28 # Simulation unstable. Flag 7 value 11 # Simulation unstable. Flag 9 value 18654 # Simulation unstable. Flag 10 value 20896 # The simulation has become unstable. Terminating to avoid lock-up # The simulation has become unstable. Terminating to avoid lock-up (2) # Attempting restart (step 5000) Those 3 Gigabyte 1070s are probably my oldest cards. May be time to retire. Nick, How did you spot this??? {It's so hard for me to use this web site. It took me 15 minutes to get that link to appear and another few minutes to get its Tasks list to appear.} | |
ID: 51157 | Rating: 0 | rate: / Reply Quote | |
Nick, How did you spot this??? I also have problems getting the site to load, but once it does it usually works ok. I guess I had a little better luck than you. 200+ errors on a single machine stood out pretty quickly, since I hadn't experienced nor read reports here about a bad batch of tasks. ____________ Team USA forum | Team USA page Join us and #crunchforcures. We are now also folding:join team ID 236370! | |
ID: 51168 | Rating: 0 | rate: / Reply Quote | |
Message boards : Number crunching : Pablo, When is enough enough???