Author |
Message |
Ketzer7Send message
Joined: 31 Jul 10 Posts: 7 Credit: 449,728,430 RAC: 0 Level
Scientific publications
|
Hi all!
I have two boxes in the lab here that I have been trying to get running with GPUGrid again.
One has 3 GTX 280s installed in it, and the other has 2 GTX 295s and a GTX 280. Previously (maybe 6 months+ ago), all of these cards were able to run GPUGrid just fine and return good results. No crashes or other stability problems with any of them; they're all plain vanilla EVGA cards.
The problem I am having now is that no matter what I seem to try, I can't get any of these cards to finish a WU without getting a computation error. I've observed two different behaviors, however, which I'll try outlining.
1. I installed CentOS 6 x64 with all of the necessary libraries, and installed the CUDA 4.0 dev driver for Linux x64, version 270.41.19, with BOINC 6.10.58. With this setup, I found that one of the GTX 280s seemed to be able to run and complete WUs without error (the card with monitor attached), but the other two cards would just fail within seconds of starting and report computation errors.
2. I tried starting over by doing a fresh reinstall of CentOS 6 x64, but this time with the recommended CUDA 3.1 dev driver, 256.40, again with BOINC 6.10.58. In this combination, now I find that none of the GTX 280s is able to complete a WU without failing on a computation error within seconds of starting.
If memory serves me correctly, all of these cards worked with the 6.12 and 6.13 GPUGrid clients a couple of months ago, but I'm noticing that now the client version is 6.14.
Is there something new in 6.14 that breaks CC1.3 cards on Linux x64? It doesn't seem to be an OS/driver/BOINC version problem as I have another box in the lab with 3 GTX 480s in it, and that box runs like a champ with all cards crunching and returning good results. I've seen other volunteers that seem to still be running CC1.3 cards without problems, but I think all of them have been running on Windows unfortunately. So I'm beginning to wonder if it's something to do with the version of GPUGrid client for Linux.
In any case, I'm hoping maybe someone has experience or insight with this and can offer some help. I hate to let these cards go to waste if I am just doing something wrong. Although I get the impression that for Linux x64 I may be SOL unless I am running all CC2.0 cards.
Sorry for the long post. Many thanks in advance.
|
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
If this is app related, there is little I can do to help, so I will concentrate on other possibilities.
Using old Linux drivers with a newer Linux operating system might be an issue - you did say some tasks worked with the more recent driver.
A few suggestions:
Suspend the CPU project, restart and try GPUGrid tasks.
Use the up to date repository drivers.
Swap out 2 cards and try one at a time to confirm that they can each crunch GPUGrid tasks in the existing setup. Probably best to select the normal length tasks to do this (time). Clean them when you are at it. If they fail in the same way you might try running MW or Einstein tasks just to tax the card, and see if its a general problem (GDDR, overheating) or more specific to GPUGrid (app, driver, libraries). If the cards work individually try 2 cards together.
I see other people on Linux with CC1.3 cards running GPUGrid tasks successfully:
190,194.87 4,941,079 6.10.58 GenuineIntel Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz [Family 6 Model 26 Stepping 5] (8 processors)
[2] NVIDIA GeForce GTX 275 (895MB) Linux 2.6.38-10-generic 3 Aug 2011 14:07:53 UTC
So, perhaps you could try another Linux version?
After looking at your error messages I see,
"process exited with code 193",
"process exited with code 98",
"process exited with code 1"
Often when there are multiple error types a bad driver install is to blame, or the cards are overheating and failing for apparently different reasons.
The tasks that report failing after 0.00 seconds are mostly 193 errors and a few 1 errors. The tasks that are reported as failing after around 2seconds are all 98 errors.
From http://boincfaq.mundayweb.com/index.php?view=238&language=1,
"Code 193 is a segmentation violation error.
You either have problems with your memory or swap file, or the application attempts to access a memory location that it is not allowed to access, or attempts to access a memory location in a way that is not allowed (for example, attempting to write to a read-only location, or to overwrite part of the operating system).
Use a memory checking program like memtest86+ to rigorously test your memory.
And always when you have this error, report it on the forums of the application it happens with. It may well be an error in the application's code."
Are there any RAM or HDD problems? Run scans.
Any issue with folder security (access)?
Hopefully one of the more frequent and experienced Linux users will be able to help.
Good luck, |
|
|
Ketzer7Send message
Joined: 31 Jul 10 Posts: 7 Credit: 449,728,430 RAC: 0 Level
Scientific publications
|
Thanks skg!
Yeah, I am really not sure what is going on with it.
When looking at my two computers, the only difference between the two of them is really the CPU and the GPUs; haf-x runs perfectly with the GTX 480s, but fortress doesn't at all on the 280s with the same OS, same driver, and same BOINC version. The installs of the OS and driver were basically identical between the two as well.
I'm fairly certain that there's no problems with the memory or disks as when I first built fortress, I think I let memtest86+ run for 24+ hours with no errors. Also, I keep all of my crunching boxes on stock clock speeds and timings to avoid any corruption errors that could be caused by overclocking them. The disks are also configured in a RAID1 array so that if there's a problem with one of them and/or it dies, the other should be able to take over and keep the system going until I can repair it.
The cards themselves don't seem to be overheating either as when I look at the Nvidia control panel app in Gnome, the thermal monitor shows all of them are basically sitting at idle temperatures, like they aren't even running a task at all.
From the example system you showed, it appears it is running Ubuntu based on the kernel version. I'm downloading it right now and will give it a try to see if it makes any difference. I had similar problems to this with the CC1.3 cards back in March when I was also running Fedora 14 x64, so maybe it is a Red Hat based distro related thing? I noticed that the top GPUGrid volunteer is running Linux (albeit with CC2.0 cards) but judging from his kernel versions, is also using Ubuntu, so it might be worth a shot.
I'll try some of your other suggestions as well. Like you said, hopefully some of the more experienced Linux users will have some ideas. I just want to get all of cards running again. :-(
Thanks again!
|
|
|
|
Hello: I think the quickest solution is to mount Ubuntu OS 11.04.
I have a GTX295 running Ubuntu without any problems for over six months. Greetings. |
|
|
Ketzer7Send message
Joined: 31 Jul 10 Posts: 7 Credit: 449,728,430 RAC: 0 Level
Scientific publications
|
heh I would if I could get it to install and run..
I just tried installing Ubuntu 11.04 to both of my crunching boxes with GT200 cards in them.
The first one (fortress) got most of the way through the install without any problems and then just crapped out with an error telling me it couldn't install GRUB to /dev/sda. I was like wtfbbq? So I killed it, tried again from scratch, and then it finished installing fine, came up with the little screen I needed to reboot, pushed reboot, and it basically locked at that screen. So reset the box manually, and now when it comes up, it acts like there is no bootloader, it just sits there. <sigh>
So I went to my other machine which has a Skulltrail motherboard in it to see if it was something specific to the first box. CD starts booting up and it gets to the point where I think it should be detecting the hardware (just a flashing cursor in the upper left corner of the monitor after a purple splash screen that shows the accessibility icon at the bottom), and then it just sits there and doesn't do anything else. X-[
Ugh..the fight continues... |
|
|
DagorathSend message
Joined: 16 Mar 11 Posts: 509 Credit: 179,005,236 RAC: 0 Level
Scientific publications
|
I recall similar problems when I tried to install Ubuntu 10.x over Fedora. I solved it by using Gparted to delete all existing partitions. Then Ubuntu installed and ran OK.
Oooops! Don't go and delete your Windows partitions if you have any. |
|
|
Ketzer7Send message
Joined: 31 Jul 10 Posts: 7 Credit: 449,728,430 RAC: 0 Level
Scientific publications
|
Well, between any reinstalls I have done, I always go into the Intel south bridge RAID controller and delete the array I have defined and then recreate it so that it should wipe out all of the MBR and partition data, at least that's what I always thought it did, but maybe not..
|
|
|
DagorathSend message
Joined: 16 Mar 11 Posts: 509 Credit: 179,005,236 RAC: 0 Level
Scientific publications
|
I don't know if that removes the MBR and partitions or not. I've never thought much of Ubuntu's installer. When I'm in that wtf? kind of position I try to go right back to the basics and assume nothing. Here's a few ideas...
1) Did you verify the integrity of the iso file you downloaded with an MD5 or SHA check?
2) Did you verify the disc after you burned the ISO image to disk?
3) The machine on which the install finished with no apparent error but won't boot must have install logs and boot error logs. Can you boot with a live CD, mount the disc manually (if required to) and look at the logs for errors/clues?
4)I know it should not make a difference but have you tried disabling the RAID array before installing Ubuntu? Same with the video cards... yank 'em out and drop back to the onboard video. If you can get Ubuntu to install and boot that way then add the items back one at a time and see which one causes a problem. |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
Tried to install Ubuntu 11.04 x64 on a system with a GT240, but had similar problems; the install failed and it might have left me at the flashing cursor also (few weeks ago now). I was under the impression that it was PSU related at the time so thought nothing of it. Used the on-die gPU and managed to boot with no issues. It ran fine for a few days, right up until I tried to install updates (including GRUB). When I rebooted it gave me failure messages, so I reinstalled again (just OTT) and that seems to have fixed it. So just saying that there might be an issue with 11.04 and the drivers it uses during the install.
Ketzer7, as you have an i7-900 series CPU, can I suggest you try the install with a different GPU (if you have a spare ATI card for example).
Does anyone think altering the PCIE settings (x4 x16) in the Bios would make any difference to this Ubuntu install? |
|
|
Ketzer7Send message
Joined: 31 Jul 10 Posts: 7 Credit: 449,728,430 RAC: 0 Level
Scientific publications
|
Thanks for the replies guys. As it turned out, the installer finally came up on the Skulltrail system after a REALLY long time sitting at the blinking cursor, so not sure what's up with that. It was all for naught though, because as it was going through the install, it also crapped out with the same failed to install bootloader to /dev/sda error just like I got on fortress. So that is twice I've had this specific problem happen on two entirely different systems.
@Dagorath:
The ISO I downloaded of Ubuntu looks good. I checked it against md5sum, sha1sum, and sha256sum, and each of them returned a result matching the appropriate value for the Ubuntu 11.04 x64 (amd64) ISO. I had a hell of time finding the sums files for Ubuntu on their site. Had to finally dig into some FTP mirror to get them.
I also used ImgBurn when I created the disk and had it verify the burn after it was done, and that came back good as well.
Last night I started wondering the same thing about the RAID1 array and if maybe that is causing a problem. Ubuntu's drivers don't seem to have an issue with picking it up and manipulating it during the install, but who knows. What made me start thinking this is that both of the boxes I tried installing it to, skulltrail and fortress, both hit that error of failing to install the bootloader the first time around. The only thing common to both of them is that they are running RAID1 arrays on an Intel SB RAID controller.
Unfortunately neither of these boxes have on-board video on the mobos, so I'd have to leave at least one card in, but I could try that as well.
@skg:
I noticed something similar the first time I tried installing it on fortress. I got that failed to install bootloader error, but then I started the install over again (right OTT like you said) and the second time is when it seemed to complete Ok, but then the system wouldn't boot up afterwards. So I'm not sure what is up with that. Dagorath may be right and I need to scrub the disks somehow to cleanse them of their Red Hat ways, but dunno.
I'm going to give it one more try with Ubuntu and see how far I get, but in the background I've been trying to get a WinXP Pro x64 installer from work and just go with that. Unfortunately, it seems the dark side is better in this regard to getting the cards up and running with grid. |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
Stuck a GTX260 into a Ubuntu 11.04 x64 system and installed the latest NVidia drivers. So far no problems; running a long task and should take around 16 to 17h. |
|
|