Message boards : Number crunching : ATM free energy calcuation -> GPU overheated and kicked off bus
Author | Message |
---|---|
The NVIDIA GPU running at 93C, then suddenly kicked off from the bus due to overheating, I am guessing: Mon Mar 20 06:00:36 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A | | N/A 93C P0 N/A / N/A | 170MiB / 4096MiB | 98% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 2706 G /usr/lib/xorg/Xorg 4MiB | | 0 N/A N/A 5626 C python 164MiB | +-----------------------------------------------------------------------------+ Mon Mar 20 06:00:41 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... ERR! | 00000000:02:00.0 Off | N/A | |ERR! ERR! ERR! N/A / ERR! | GPU is lost | ERR! ERR! | | | | ERR! | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ | |
ID: 60114 | Rating: 0 | rate: / Reply Quote | |
Yes, fan not running will do that. Fix the fan so that it runs and try again. | |
ID: 60115 | Rating: 0 | rate: / Reply Quote | |
What brand and model GPU? E.g., MSI and Gigabyte fans don't last very long but EVGA fans do. Easy to replace. | |
ID: 60116 | Rating: 0 | rate: / Reply Quote | |
The NVIDIA GPU running at 93C, then suddenly kicked off from the bus due to overheating This symptom would also fit to an overheated crunching laptop. More details are given at my Message #52937 | |
ID: 60117 | Rating: 0 | rate: / Reply Quote | |
it's a laptop with a very old GPU. just look at his profile and hosts. this is the same problem he had with the acemd3 tasks where the GPU overheated and dropped off the bus. | |
ID: 60119 | Rating: 0 | rate: / Reply Quote | |
I just saw the reported 0 rpm fan speed in the nvidia-smi output and commented. Didn't look into the actual hardware. | |
ID: 60122 | Rating: 0 | rate: / Reply Quote | |
... laptops in general aren't good candidates for BOINC due to the limited cooling under 24/7 use. laptop cooling systems really aren't designed for that. Recently I was playing with the idea of buying a new laptop with a RTX3070 or even 3080 inside; but exactly what you are saying prevented me from doing it. | |
ID: 60124 | Rating: 0 | rate: / Reply Quote | |
Run only the selected applications ACEMD 3: no | |
ID: 60323 | Rating: 0 | rate: / Reply Quote | |
You missed one critical setting in Project Preferences. | |
ID: 60324 | Rating: 0 | rate: / Reply Quote | |
I am using a gaming laptop with an RTX 3060 to crunch ATM which works fine. As mentioned before the heat needs to be controlled more carefully in a laptop. When I let the GPU crunch the CPU stays in idle so the only heat is generated by the GPU. Since Python tasks also use a few CPU threads at the same time I manually set the CPU frequency to 1300 Mhz which accelerates the calculation because otherwise it would stay at 400 Mhz but on the other hand doesn’t increase the heat much if otherwise left at idle. The GPU runs at 80 degress C. System is a Ryzen 7 5800H from Asus. | |
ID: 60531 | Rating: 0 | rate: / Reply Quote | |
I am using a gaming laptop with an RTX 3060 to crunch ATM ... When I let the GPU crunch the CPU stays in idle so the only heat is generated by the GPU. how can the CPU stay idle while crunching an ATM task? | |
ID: 60532 | Rating: 0 | rate: / Reply Quote | |
I am using a gaming laptop with an RTX 3060 to crunch ATM ... When I let the GPU crunch the CPU stays in idle so the only heat is generated by the GPU. All of these Python based applications are very 'bursty' IOW, they very infrequently use the cpu and gpu, flipping back and forth between the two processing elements. | |
ID: 60533 | Rating: 0 | rate: / Reply Quote | |
Exactly! That‘s why I wrote „otherwise in idle“ meaning only the ATM task is being crunched. | |
ID: 60534 | Rating: 0 | rate: / Reply Quote | |
By throttling the cpu speed down to idle to save watts and heat, the only consequence is longer running tasks which may risk getting the credit bonuses. | |
ID: 60535 | Rating: 0 | rate: / Reply Quote | |
Not necessarily. With this setup my ATM beta tasks finish after around 6 hours on this linux host awarding 1.1 million credit and neither CPU or GPU overheat. If I were to let them lose like they were programmed to the CPU would stay at 400 Mhz even if the ATM task needs it. So manually raising it to 1300 Mhz decreases CPU calculation times. That way the CPU temp doesn't exceed 75 degrees and the GPU stays at around 80. It depends on the project that you run and your personal boltness what temps your willing to accept. I like the CPU to peak at 75 degrees and the GPU a little over 80. That's why I usually run only GPU or CPU work. Both together may be too much. An exception is milkyway which can be run in parallel if the CPU gets throttled to 1000 Mhz. This is just an example to show that ATM tasks can be run on a laptop. | |
ID: 60536 | Rating: 0 | rate: / Reply Quote | |
Message boards : Number crunching : ATM free energy calcuation -> GPU overheated and kicked off bus