Advanced search

Message boards : Number crunching : ATM free energy calcuation -> GPU overheated and kicked off bus

Author Message
Jari Kosonen
Send message
Joined: 5 May 22
Posts: 24
Credit: 12,458,305
RAC: 0
Level
Pro
Scientific publications
wat
Message 60114 - Posted: 19 Mar 2023 | 22:04:37 UTC

The NVIDIA GPU running at 93C, then suddenly kicked off from the bus due to overheating, I am guessing:

Mon Mar 20 06:00:36 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A |
| N/A 93C P0 N/A / N/A | 170MiB / 4096MiB | 98% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2706 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 5626 C python 164MiB |
+-----------------------------------------------------------------------------+
Mon Mar 20 06:00:41 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... ERR! | 00000000:02:00.0 Off | N/A |
|ERR! ERR! ERR! N/A / ERR! | GPU is lost | ERR! ERR! |
| | | ERR! |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1335
Credit: 7,538,317,459
RAC: 13,749,196
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60115 - Posted: 19 Mar 2023 | 23:42:45 UTC - in response to Message 60114.

Yes, fan not running will do that. Fix the fan so that it runs and try again.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,755,010,632
RAC: 592,501
Level
Trp
Scientific publications
watwatwat
Message 60116 - Posted: 20 Mar 2023 | 9:28:03 UTC
Last modified: 20 Mar 2023 | 9:33:42 UTC

What brand and model GPU? E.g., MSI and Gigabyte fans don't last very long but EVGA fans do. Easy to replace.
Might just need to blow out the dust and clean the PCIe connector with isopropyl alcohol.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 581
Credit: 9,572,812,024
RAC: 20,230,570
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60117 - Posted: 20 Mar 2023 | 10:41:54 UTC - in response to Message 60114.

The NVIDIA GPU running at 93C, then suddenly kicked off from the bus due to overheating

This symptom would also fit to an overheated crunching laptop.
More details are given at my Message #52937

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1067
Credit: 40,231,533,983
RAC: 1,418
Level
Trp
Scientific publications
wat
Message 60119 - Posted: 20 Mar 2023 | 12:08:01 UTC

it's a laptop with a very old GPU. just look at his profile and hosts. this is the same problem he had with the acemd3 tasks where the GPU overheated and dropped off the bus.

overheating not necessarily due to fans not spinning, though possible. the fans might not be connected to the GPU itself. sometimes laptops use fans that share the GPU and CPU or are chassis controlled.

time to retire this system IMO. or use it for PythonGPU only (lower GPU utilization and heat output). laptops in general aren't good candidates for BOINC due to the limited cooling under 24/7 use. laptop cooling systems really aren't designed for that.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1335
Credit: 7,538,317,459
RAC: 13,749,196
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60122 - Posted: 20 Mar 2023 | 16:16:01 UTC

I just saw the reported 0 rpm fan speed in the nvidia-smi output and commented. Didn't look into the actual hardware.
Yes, don't use this hardware for anything but Python, and questionable even that.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1131
Credit: 9,960,232,676
RAC: 31,919,313
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60124 - Posted: 20 Mar 2023 | 19:57:55 UTC - in response to Message 60119.

... laptops in general aren't good candidates for BOINC due to the limited cooling under 24/7 use. laptop cooling systems really aren't designed for that.

Recently I was playing with the idea of buying a new laptop with a RTX3070 or even 3080 inside; but exactly what you are saying prevented me from doing it.

Jari Kosonen
Send message
Joined: 5 May 22
Posts: 24
Credit: 12,458,305
RAC: 0
Level
Pro
Scientific publications
wat
Message 60323 - Posted: 16 Apr 2023 | 4:22:02 UTC - in response to Message 60124.
Last modified: 16 Apr 2023 | 4:22:53 UTC

Run only the selected applications ACEMD 3: no
ACEMD 4: no
ATM (beta): no
Quantum Chemistry (CPU): yes
Quantum Chemistry (CPU, beta): yes
Python Runtime (CPU, beta): yes
Python Runtime (GPU, beta): yes


I tried to set the ATM and ACEMD off because they are mostly causing the overheating.
But even setting them off the boincmgr still downloads them and I can not
avoid the problem issues.
The laptop model and GPU model is maybe too old,
but overheating is the problem mainly.
I don't know if it possible to get the GPU back to the bus if it
was kicked off the be buss due to overheating. Always I have to
reboot to get it back.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1335
Credit: 7,538,317,459
RAC: 13,749,196
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60324 - Posted: 16 Apr 2023 | 6:37:59 UTC - in response to Message 60323.

You missed one critical setting in Project Preferences.

If no work for selected applications is available, accept work from other applications? no

If you don't set this to no, you will get any and ALL other applications when your desired app tasks aren't available.

Just a FYI, there hasn't been any Quantum Chemistry work in about 3 years.

Drago
Send message
Joined: 3 May 20
Posts: 18
Credit: 797,919,060
RAC: 3,208,145
Level
Glu
Scientific publications
wat
Message 60531 - Posted: 1 Jun 2023 | 15:41:34 UTC

I am using a gaming laptop with an RTX 3060 to crunch ATM which works fine. As mentioned before the heat needs to be controlled more carefully in a laptop. When I let the GPU crunch the CPU stays in idle so the only heat is generated by the GPU. Since Python tasks also use a few CPU threads at the same time I manually set the CPU frequency to 1300 Mhz which accelerates the calculation because otherwise it would stay at 400 Mhz but on the other hand doesn’t increase the heat much if otherwise left at idle. The GPU runs at 80 degress C. System is a Ryzen 7 5800H from Asus.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1131
Credit: 9,960,232,676
RAC: 31,919,313
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60532 - Posted: 2 Jun 2023 | 16:01:02 UTC - in response to Message 60531.

I am using a gaming laptop with an RTX 3060 to crunch ATM ... When I let the GPU crunch the CPU stays in idle so the only heat is generated by the GPU.

how can the CPU stay idle while crunching an ATM task?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1335
Credit: 7,538,317,459
RAC: 13,749,196
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60533 - Posted: 2 Jun 2023 | 18:53:17 UTC - in response to Message 60532.

I am using a gaming laptop with an RTX 3060 to crunch ATM ... When I let the GPU crunch the CPU stays in idle so the only heat is generated by the GPU.

how can the CPU stay idle while crunching an ATM task?

All of these Python based applications are very 'bursty' IOW, they very infrequently use the cpu and gpu, flipping back and forth between the two processing elements.

Drago
Send message
Joined: 3 May 20
Posts: 18
Credit: 797,919,060
RAC: 3,208,145
Level
Glu
Scientific publications
wat
Message 60534 - Posted: 2 Jun 2023 | 20:06:12 UTC

Exactly! That‘s why I wrote „otherwise in idle“ meaning only the ATM task is being crunched.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1335
Credit: 7,538,317,459
RAC: 13,749,196
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60535 - Posted: 3 Jun 2023 | 19:48:53 UTC - in response to Message 60534.

By throttling the cpu speed down to idle to save watts and heat, the only consequence is longer running tasks which may risk getting the credit bonuses.

Drago
Send message
Joined: 3 May 20
Posts: 18
Credit: 797,919,060
RAC: 3,208,145
Level
Glu
Scientific publications
wat
Message 60536 - Posted: 3 Jun 2023 | 20:22:55 UTC

Not necessarily. With this setup my ATM beta tasks finish after around 6 hours on this linux host awarding 1.1 million credit and neither CPU or GPU overheat. If I were to let them lose like they were programmed to the CPU would stay at 400 Mhz even if the ATM task needs it. So manually raising it to 1300 Mhz decreases CPU calculation times. That way the CPU temp doesn't exceed 75 degrees and the GPU stays at around 80. It depends on the project that you run and your personal boltness what temps your willing to accept. I like the CPU to peak at 75 degrees and the GPU a little over 80. That's why I usually run only GPU or CPU work. Both together may be too much. An exception is milkyway which can be run in parallel if the CPU gets throttled to 1000 Mhz. This is just an example to show that ATM tasks can be run on a laptop.

My Intel 1280P, GTX 1650 and Win 11 Laptop behaves differently. If I run only CPU work on it the temps go up straight to 93 degrees. If I use the GPU together with the CPU the CPU gets throttled automatically to around 2200 Mhz keeping the CPU temps at around 75 and the GPU at around 73. Sometimes it needs a little kick from the Intel Extreme tuning utility though.

Post to thread

Message boards : Number crunching : ATM free energy calcuation -> GPU overheated and kicked off bus

//