Message boards : Number crunching : NVidia-Linux Adjustments for heat
Author | Message |
---|---|
Not wanting to get banned or put on that Stop-Forum-Spam list (yes, it happened once) I started a new thread on cooling GPUS +-----------------------------------------------------------------------------+ | NVIDIA-SMI 390.116 Driver Version: 390.116 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 1070 Off | 00000000:01:00.0 On | N/A | |100% 52C P2 107W / 151W | 1520MiB / 8117MiB | 89% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 106... Off | 00000000:02:00.0 Off | N/A | |100% 74C P2 67W / 120W | 1292MiB / 3019MiB | 96% Default | +-------------------------------+----------------------+----------------------+ | 2 GeForce GTX 106... Off | 00000000:03:00.0 Off | N/A | |100% 62C P2 83W / 120W | 384MiB / 3019MiB | 90% Default | +-------------------------------+----------------------+----------------------+ | 3 GeForce GTX 106... Off | 00000000:04:00.0 Off | N/A | |100% 67C P2 98W / 120W | 1294MiB / 3019MiB | 97% Default | +-------------------------------+----------------------+----------------------+ | 4 GeForce GTX 106... Off | 00000000:05:00.0 Off | N/A | |100% 66C P2 84W / 120W | 1292MiB / 3019MiB | 91% Default | +-------------------------------+----------------------+----------------------+ | 5 GeForce GTX 1070 Off | 00000000:06:00.0 Off | N/A | |100% 54C P2 79W / 151W | 1322MiB / 8119MiB | 86% Default | +-------------------------------+----------------------+----------------------+ I ran the following script #!/bin/bash sudo nvidia-xconfig -a --cool-bits=4 /usr/bin/nvidia-settings -a "[gpu:0]/GPUFanControlState=1" /usr/bin/nvidia-settings -a "[fan:0]/GPUTargetFanSpeed=100" /usr/bin/nvidia-settings -a "[fan:1]/GPUTargetFanSpeed=100" /usr/bin/nvidia-settings -a "[gpu:1]/GPUFanControlState=1" /usr/bin/nvidia-settings -a "[fan:2]/GPUTargetFanSpeed=100" /usr/bin/nvidia-settings -a "[gpu:2]/GPUFanControlState=1" /usr/bin/nvidia-settings -a "[fan:3]/GPUTargetFanSpeed=100" /usr/bin/nvidia-settings -a "[gpu:3]/GPUFanControlState=1" /usr/bin/nvidia-settings -a "[fan:4]/GPUTargetFanSpeed=100" /usr/bin/nvidia-settings -a "[gpu:4]/GPUFanControlState=1" /usr/bin/nvidia-settings -a "[fan:5]/GPUTargetFanSpeed=100" /usr/bin/nvidia-settings -a "[gpu:5]/GPUFanControlState=1" /usr/bin/nvidia-settings -a "[fan:6]/GPUTargetFanSpeed=100" /usr/bin/nvidia-settings -a "[fan:7]/GPUTargetFanSpeed=100" Problem #1 (not sure a real problem) The assignments for fans 6 and 7 generated an error when the script was executed. However, as you can see above, all the fans are running %100. However, gpu0 and gpu5 are 1070 which have two fans but only one fan is shown by nvidia-smi. Compounding the problem is that gpu0 was an eVga with aftermarket (also eVga) hybrid cooler. The pump and radiator fan always ran at full speed but it was that rear fan, the "hybrid" that always caused problems. Anyway it seems that all the fans are running as I it is obvious by looking plus the output of nvidia-smi shows significant cooling along with high usage. Problem #2 (real problem) A reboot loses everything and I cannot ssh in to run that script. I have to put a terminal & keyboard on the system and bring up a terminal window and be sure to leave the terminal window open. Fortunately, this is not windows and thus not subject to rebooting on every update. Since the system has automatic login one would think that $DISPLAY was defined at the reboot but I cannot run the script from SSH (putty). However, I am looking at this. There is bound to be a way. | |
ID: 52275 | Rating: 0 | rate: / Reply Quote | |
From what I have tried, nvidia-settings will only work if a monitor is physically attached, or a monitor dummy plug is fitted. | |
ID: 52276 | Rating: 0 | rate: / Reply Quote | |
Also, forgot to mention, the first command in your script | |
ID: 52279 | Rating: 0 | rate: / Reply Quote | |
Also, forgot to mention, the first command in your script You are correct. I put it in as the script did not work the first time I ran it and I thought that was the problem. Not sure what is going on (IANE on Linux) but every single fan control statement generated an error the first time the script is run after a reboot and the fans are not set to 100. The second time the script is run all the fans spin at the proper speed but fans 6 and 7 generate errors. I am guessing there has to be a delay between the "enable" and the "speed" /usr/bin/nvidia-settings -a "[gpu:0]/GPUFanControlState=1" /usr/bin/nvidia-settings -a "[fan:0]/GPUTargetFanSpeed=100" From what I have tried, nvidia-settings will only work if a monitor is physically attached, or a monitor dummy plug is fitted. I used putty on win10 to ssh into the 18.04 system but nvidia-settings does not work jstateson@tb85-nvidia:~/Desktop$ /usr/bin/nvidia-settings -a "[fan:7]/GPUTargetFanSpeed=100" Unable to init server: Could not connect: Connection refused ERROR: The control display is undefined; please run `/usr/bin/nvidia-settings --help` for usage information. jstateson@tb85-nvidia:~/Desktop$ echo $DISPLAY jstateson@tb85-nvidia:~/Desktop$ Even though I enabled auto login when setting up 18.04 I see a login screen after a reboot. going to try something like the following: 1. get the script working so it does not need to be run twice 2. put the script someplace where it gets run automatically after either login or reboot 3. put dummy HDMI on one of the GPUS. | |
ID: 52282 | Rating: 0 | rate: / Reply Quote | |
OK, got it working at login #!/bin/bash #sudo nvidia-xconfig -a --cool-bits=4 let NumGPU=6 let NumFAN=6 for (( n=0; n < NumGPU; n++)) do /usr/bin/nvidia-settings -a "[gpu:$n]/GPUFanControlState=1" /bin/ping -c 1 127.0.0.1 done for (( n=0; n < NumFAN; n++)) do /usr/bin/nvidia-settings -a "[fan:$n]/GPUTargetFanSpeed=100" /bin/ping -c 1 127.0.0.1 done Once I saw it was working I edited ".profile" and appended the script (except top two lines) When I rebooted that bash ".profile" ran and the fans all kicked in and I used my HDMI dummy plug. This worked under bash in 18.04 Ubuntu and I am not sure about others. Also, that login screen I saw was just the screensaver lockout. I was too slow going back into the garage to see it did login automatically. Also, it is possible that my original driver installation attempt of sudo sh ./NVIDIA-Linux-x86_64-430.34.run failed because I used "sh" instead of "bash" but that is a guess. pimping my system... | |
ID: 52283 | Rating: 0 | rate: / Reply Quote | |
You only have one fan interface on Pascal cards. So you don't need to increment a fan designator for successive cards. Simply enabling the fan for each card and setting its speed is enough. Probably should have used my Pascal only machine as an example #!/bin/bash /usr/bin/nvidia-settings -a "[gpu:0]/GPUPowerMizerMode=1" /usr/bin/nvidia-settings -a "[gpu:1]/GPUPowerMizerMode=1" /usr/bin/nvidia-settings -a "[gpu:2]/GPUPowerMizerMode=1" /usr/bin/nvidia-settings -a "[gpu:3]/GPUPowerMizerMode=1" nvidia-smi -i 0 -pl 200 nvidia-smi -i 1 -pl 200 nvidia-smi -i 2 -pl 200 /usr/bin/nvidia-settings -a "[gpu:0]/GPUFanControlState=1" /usr/bin/nvidia-settings -a "[fan:0]/GPUTargetFanSpeed=100" /usr/bin/nvidia-settings -a "[gpu:1]/GPUFanControlState=1" /usr/bin/nvidia-settings -a "[fan:1]/GPUTargetFanSpeed=100" /usr/bin/nvidia-settings -a "[gpu:2]/GPUFanControlState=1" /usr/bin/nvidia-settings -a "[fan:2]/GPUTargetFanSpeed=100" /usr/bin/nvidia-settings -a "[gpu:3]/GPUFanControlState=1" /usr/bin/nvidia-settings -a "[fan:3]/GPUTargetFanSpeed=100" /usr/bin/nvidia-settings -a "[gpu:0]/GPUMemoryTransferRateOffset[3]=2000" -a "[gpu:0]/GPUGraphicsClockOffset[3]=40" /usr/bin/nvidia-settings -a "[gpu:1]/GPUMemoryTransferRateOffset[3]=1800" -a "[gpu:1]/GPUGraphicsClockOffset[3]=100" /usr/bin/nvidia-settings -a "[gpu:2]/GPUMemoryTransferRateOffset[3]=2000" -a "[gpu:2]/GPUGraphicsClockOffset[3]=40" /usr/bin/nvidia-settings -a "[gpu:3]/GPUMemoryTransferRateOffset[3]=1000" -a "[gpu:3]/GPUGraphicsClockOffset[3]=80" This host has a GTX 1080 Ti, 1080, 1080 and 1070 Ti. in it. Notice the fan designator matches the gpu number. Even if a Pascal card has two physical fans on it, it only has ONE fan interface. Only the newer Turing cards have TWO fan interfaces. So your script needs to be rewritten to get rid of the errors which are trying to manipulate a non-existing interface. Your script should look like this: #!/bin/bash /usr/bin/nvidia-settings -a "[gpu:0]/GPUFanControlState=1" /usr/bin/nvidia-settings -a "[fan:0]/GPUTargetFanSpeed=100" /usr/bin/nvidia-settings -a "[gpu:1]/GPUFanControlState=1" /usr/bin/nvidia-settings -a "[fan:1]/GPUTargetFanSpeed=100" /usr/bin/nvidia-settings -a "[gpu:2]/GPUFanControlState=1" /usr/bin/nvidia-settings -a "[fan:2]/GPUTargetFanSpeed=100" /usr/bin/nvidia-settings -a "[gpu:3]/GPUFanControlState=1" /usr/bin/nvidia-settings -a "[fan:3]/GPUTargetFanSpeed=100" /usr/bin/nvidia-settings -a "[gpu:4]/GPUFanControlState=1" /usr/bin/nvidia-settings -a "[fan:4]/GPUTargetFanSpeed=100" /usr/bin/nvidia-settings -a "[gpu:5]/GPUFanControlState=1" /usr/bin/nvidia-settings -a "[fan:5]/GPUTargetFanSpeed=100" As somebody else already stated you only need to invoke the coolbits tweak once. It rewrites xorg.conf to add the coolbits into the monitor section for each card. [Edit] You need to add a persistence invocation to the script if you are doing anything with nvidia-smi. It needs to be run as root when you invoke it. Adjusting fans with nvidia-settings does not need it though. Just a bit of info for later if you decide to overclock the cards to get back the performance penalty loss the drivers cause when they detect a compute load. /usr/bin/nvidia-smi -pm 1 You would also needs to change your coolbits bit mask to 28 for clock settings | |
ID: 52284 | Rating: 0 | rate: / Reply Quote | |
From what I have tried, nvidia-settings will only work if a monitor is physically attached, or a monitor dummy plug is fitted. Running the command manually? I've started up Linux PCs many times and used the GUI to set OC and fan speeds without a monitor plugged in. | |
ID: 52285 | Rating: 0 | rate: / Reply Quote | |
I've started up Linux PCs many times and used the GUI to set OC and fan speeds without a monitor plugged in. It would be nice to know how you do this, a couple of questions: Are you controlling the Linux PC from another Linux PC? (Controlling the host from another Linux PC, I suspect would be easier to setup.) Are you using X forwarding or enabling XDMCP server on the host? Do you invoke Nvidia Settings GUI remotely or via a script to set the OC and fan speeds? The underlying issue, as I understand, is how the Remote Display is setup in xorg.conf on the host. The X server needs to "think" it is outputting to a real display for nvidia-settings to work. Any tips you could offer would be most appreciated. | |
ID: 52286 | Rating: 0 | rate: / Reply Quote | |
OK, got it working at login Great to see you got it going. Always nice to see pics of custom setups! | |
ID: 52287 | Rating: 0 | rate: / Reply Quote | |
I've started up Linux PCs many times and used the GUI to set OC and fan speeds without a monitor plugged in. I've remotely controlled PCs with TeamViewer. Reboot for whatever reason, remote in with TV and set the OC via the GUI. It was obviously installed with a monitor attached but after that it hasn't been required, at least in more recent versions of Ubuntu. 18.04 is fine. Some older FAH guides at Overclock.net mention editing xorg to create a monitor maybe for the issue you're describing. 1 per GPU. | |
ID: 52290 | Rating: 0 | rate: / Reply Quote | |
Some older FAH guides at Overclock.net mention editing xorg to create a monitor I have done a bit more reading after viewing your post and came across a similar solution. One solution may be adding connected-monitor="DFP-0" to xorg.conf (and a few others steps). I wont be able to try this until next week. | |
ID: 52291 | Rating: 0 | rate: / Reply Quote | |
I've remotely controlled PCs with TeamViewer I just looked at TV but it is not applicable to what I need and I cannot justify the monthly subscription. Currently I use the Splashtop and RealVNC free personal versions but am paying that cheap yearly subscription for SplashTop's mobile connect. Both remote desktops are limited to 5 systems but only RealVNC strictly enforces that. Splashtop does not support Linux and RealVNC has problems, at least for me. I spent a long time trying to get RealVNC and TightVNC to work with 16 and tried an upgrade to 17 but finally gave up. I have not bothered with 18.04 after reading about Wayland compatibility and "Before we disable Wayland, we need to switch to open source linux video driver instead of Nvidia third party driver"I take that to mean I lose CUDA and OpenCL with opensource drivers. I used VNC on all my systems back about 2009: Win7, Dotsch_UX (Ubuntu 9) but converted my Linux boxes to win10 when I figured out how to upgrade to free win10 (Thanks Microsoft!). However, there are real advantages to running Linux especially on a BOINC farm. I have a pair of 18.04 systems. One is NVidia, the other AMD, and will be converting two more to Linux. These are all headless and I would like to use VNC for access instead of putty. | |
ID: 52292 | Rating: 0 | rate: / Reply Quote | |
You don't need to pay for a Teamviewer subscription if you are not a business. You can have as many computers as you want for free. I've used it for years. | |
ID: 52293 | Rating: 0 | rate: / Reply Quote | |
You don't need to pay for a Teamviewer subscription if you are not a business. You can have as many computers as you want for free. I've used it for years. My Bad, I assumed "free" meant free trial. I just scrolled down a lot further and now see it is free for students and personal use w/o limit which is nice. Can it access from a different subnet? That was why I signed up ($12 a year) for Splashtop so my iPhone and tablet could access my systems while out of town. However, I do not need that feature for my boinc farm systems. Just need to access occasionally using the GUI from my windows desktop at home. [EDIT] This Worked! http://stateson.net/images/tv_worked.png I took a picture of the dialog box on the Ubuntu monitor with my iPhone of the code and password and entered that info into my windows TV app. However, I need to bring this up w/o a monitor on the Linux system. I assume it can be done. It would have to be installed as a service and the passwords be persistent. | |
ID: 52294 | Rating: 0 | rate: / Reply Quote | |
As long as both computers (this means phones too) have an internet connection, the connection will be established. | |
ID: 52295 | Rating: 0 | rate: / Reply Quote | |
18.04 after reading about Wayland compatibility and This is very outdated information that was only applicable to Ubuntu 17.04 and 17.10 when they had a brief dalliance with making Wayland the default DM. That went over like a lead balloon as too many application only work with X11. The default DM for Ubuntu 18.04 and later versions is X11. You can switch to Wayland on X11 at login via the config wheel if desired. The default Nvidia drivers in the Ubuntu 18.04.2 LTS distro is now proprietary Nvidia driver version 430.34. They have been added to the SRU or stable release now. So you get full CUDA and OpenCL support out of the box now. | |
ID: 52299 | Rating: 0 | rate: / Reply Quote | |
You don't need to pay for a Teamviewer subscription if you are not a business. You can have as many computers as you want for free. I've used it for years. Yup, until they decide your multiple computer setup is now a business and take weeks to respond to your request to re-activate it again. Some of our equipment at work use Radmin which everyone prefers over VNC for situations where the desktop account remains unlocked. Although it might have a GPU driver hook like RDP that would abort any GPU task. It's $50 but worth not having input lag like VNC/TV. | |
ID: 52300 | Rating: 0 | rate: / Reply Quote | |
I had problems with TV as it fought I use it for commercial use. So I have switched to DWService - free to use and access is from any web browser. Win, Linux agents available. | |
ID: 52301 | Rating: 0 | rate: / Reply Quote | |
Added PCIe splitter to get 7 GPU on a 6 slot mombo and lost control of fans on two GPUs. | |
ID: 52350 | Rating: 0 | rate: / Reply Quote | |
You don't need to pay for a Teamviewer subscription if you are not a business. You can have as many computers as you want for free. I've used it for years.That was not my experience. TeamViewer constantly had popups accusing me of not playing fairly implying I was a commercial user. No provision to say I'm a charity. Now I use NoMachine 6.6.8 but it has problems making the served screen a reasonable size which is often smaller than popup windows so I can't push any buttons on the bottom. When it upgraded to 6.7.6 my Linux rigs got either the Blue or White Screen of Death on my screen. Had to revert back. TightVNC is the best I've tried but they don't do Linux. I'm looking for a better remote desktop program. Maybe x11vnc??? http://www.karlrunge.com/x11vnc/ My solution to the heat is to turn off everything from 1:00 until 6:00. I also have TOU electric rates that go up 7x during those hours. BOINC is programmed to do that and works great. But it's weird to walk around and hear all the fans still spinning since it's over 90°F. So I manually turn them off. I wish I knew how to write a script that would shutdown at 1:00 and power on at 6:00. ____________ | |
ID: 52351 | Rating: 0 | rate: / Reply Quote | |
I wish I knew how to write a script that would shutdown at 1:00 and power on at 6:00.Shutdown: https://forums.linuxmint.com/viewtopic.php?t=22251 Power on: it can be configured in the BIOS. You should check your local time, and the BIOS clock, as Linux tends to set GMT (=UTC) in the BIOS, so the wake up time should be set according to this. | |
ID: 52352 | Rating: 0 | rate: / Reply Quote | |
re: my problem with only 5 out of 7 GPUs having fan control. | |
ID: 52353 | Rating: 0 | rate: / Reply Quote | |
The posts from the high-gpu count Seti users that have multiple gpus in mining type motherboards usually comes down to fixing the problems with replacing the USB cables from the risers with higher quality shielded cables. | |
ID: 52354 | Rating: 0 | rate: / Reply Quote | |
Has anyone tried using nfancurve? | |
ID: 52355 | Rating: 0 | rate: / Reply Quote | |
looks promising. The "to do" list on the first link does mention support still to be added for headless applications. So limited in that respect. Still looking for a fully scripted solution for a headless environment. | |
ID: 52356 | Rating: 0 | rate: / Reply Quote | |
I have checked that scirpt - it works good for 1 GPU 1080Ti (sorry don't have more to check). | |
ID: 52357 | Rating: 0 | rate: / Reply Quote | |
I'd like to figure out how nvidia-settings uses the fan control interface identifier on the Turing cards to differentiate the two interfaces. | |
ID: 52358 | Rating: 0 | rate: / Reply Quote | |
I'd like to figure out how nvidia-settings uses the fan control interface identifier on the Turing cards to differentiate the two interfaces. Does the below link help? I dont have a Turing based card so cant test this. https://linustechtips.com/main/topic/1048251-turing-rtx-linux-cli-fan-control/ | |
ID: 52359 | Rating: 0 | rate: / Reply Quote | |
I'd like to figure out how nvidia-settings uses the fan control interface identifier on the Turing cards to differentiate the two interfaces. Good find! It's not just Turing cards that have the asynchronous fan control as a "feature". I have an EVGA 1080Ti with ICX2 cooling that has separate interfaces for controlling the "front" and "rear" fans. On older drivers, I can only control the rear fan speed. By watching the fans under load, if I increase the rear fan speed to 70% using nvidia-settings, the front fan will stop spinning. At that time I didn't think it was a good situation, so I left the fan control on auto on that card. FYI, I've been controlling heat on my cards by lowering the power limit in watts by using nvidia-smi. It's not an ideal solution but it works for me. http://stefanocappellini.com/monitor_gpu_nvidia-smi/ | |
ID: 52360 | Rating: 0 | rate: / Reply Quote | |
FYI, here's how I set fan speed at login. | |
ID: 52361 | Rating: 0 | rate: / Reply Quote | |
FYI, here's how I set fan speed at login. That is a good wiki and gives me a few thing to try. Best guide I have seen for extracting the EDID. I've been controlling heat on my cards by lowering the power limit in watts by using nvidia-smi. It's not an ideal solution but it works for me. I use this method to maximize the GPU efficiency. In my case, I can reduce the power draw by 40%, and only get a reduction in output of 15%. Card runs cooler and quieter, power usage is lower... for only a relatively small loss in output. | |
ID: 52362 | Rating: 0 | rate: / Reply Quote | |
I'd like to figure out how nvidia-settings uses the fan control interface identifier on the Turing cards to differentiate the two interfaces. No, I had already figured all that out on my own when I got my first 2080. I use nvidia-settings in a script to set the fan speeds on all my cards. When the Python code polls the card interfaces with nvidia-settings, it only picks up a single interface in the Python application. You would also have to change the code for the application window to have more radio buttons for the two interfaces and their control. I am not a programmer, I just dabble in reading code and somewhat understand the logic. But where in the code this fails on Turing escapes me for now. Anyone want to try and take a crack at the Python script? | |
ID: 52363 | Rating: 0 | rate: / Reply Quote | |
I wish I knew how to write a script that would shutdown at 1:00 and power on at 6:00.Shutdown: https://forums.linuxmint.com/viewtopic.php?t=22251 Thanks. I installed Gshutdown but it seems only able to handle one time events. I need something that will shut down all computers at 1:00 either M-F if TOU electric rate is the issue or every day if heat is the issue. I've heard of cronjob scripts so maybe that's what I need to learn about. I've seen the wake on timer line in the BIOS. I'll test drive that too. ____________ | |
ID: 52374 | Rating: 0 | rate: / Reply Quote | |
I wish I knew how to write a script that would shutdown at 1:00 and power on at 6:00.Shutdown: https://forums.linuxmint.com/viewtopic.php?t=22251 If you have the computer on a UPS, then you can schedule a crontab to shut down the host per your schedule and then bring it back up when directed. The UPS interface is already in place to shut the system down and bring it back up for a power event and recovery. The setting in the BIOS I use is in the power APM events settings. I set the BIOS to "Last State" | |
ID: 52378 | Rating: 0 | rate: / Reply Quote | |
Message boards : Number crunching : NVidia-Linux Adjustments for heat