Graphics Card Necromancy: EVGA RTX 2080 XC (Reference Board)

RazorWind

Supreme [H]ardness
Joined
Feb 11, 2001
Messages
4,646
It's been a while since we worked on a graphics card together, in part because this one took me almost a year to figure out, but I've finally got something resembling a complete story about it, so here we go.

On the bench we have an EVGA RTX 2080 XC, which is the same board design as the Founder's Edition, but with EVGA's heatsink design. The card came to me from another [H] member, who told the story that he had bought it new right after the 20 series was released, but that it had never delivered the same performance that most folks were seeing from their 2080s. Assuming that this had to do with thermal throttling, he had installed an aftermarket heatsink on it, which failed to improve its performance significantly. Eventually, he got tired of the much larger heatsink, and attempted to remove it, but in the process, tore one of the inductors apart, as we can see here.
crackedinductor.jpg

After some discussion on the forum, he concluded repairing this was beyond his ability and sold the card to me. As an aside, I wish everyone selling used, disassembled graphics cards packed them as well as he did. He really did an excellent job.
disassembled.jpgpcbonly.jpg

Once the card arrived, the hunt for a replacement inductor began. I should note that it's possible that the card might have worked anyway, but the inductors are the only thing standing between the 12V input power coming out of the power stages and the load, which is the memory in this case. It's not a technically correct explanation, but you can imagine them as sort of a shock absorber. They take the big meaty pulses of 12V and smooth them out into a steadier flow of ~1.3V. That being the case, and this being a theoretically expensive card, I didn't feel it would be prudent to test the card before replacing the obviously damaged part.

Now, inductors are generally commodity items that can be sourced from almost any supplier in fairly standard ratings and sizes. At least, that's usually the case. In this case, though, we have a problem in that while the markings indicate that these inductors have a relatively common inductance rating of 0.47uH (that is, 470nH), they're an oddball physical size. In some cases, you could probably get away with just something close enough, but this card needs them to have an unusually low profile in order for the heatsink to fit. I took some measurements and came up with the following.

measurements.jpg

The thing that makes these really odd is that low Z-measurement. Most inductors you see on graphics cards are more like 4-5mm, but these are extra flat. A little extra digging reveals that this is an unusual, but apparently standard SMD size - 1284. The markings on them indicate the inductance, with the R indicating the decimal point, and the numbers theoretically being microhenries. So, this one 0.47uH. I'm not sure what the L is for. It could be a tolerance or temperature rating, or maybe a manufacturing date code. If anyone happens to know, please enlighten me.

I went off to DigiKey, and while I could find plenty of options in the right inductance and amperage specs from the usual suspects such as Murata, TDK and Kemet, none of them are flat enough to fit in this spot. I found the same thing at Mouser, and even on Ebay, I struck out. It was almost as if these things were made specifically for this card, which is hardly out of the question, given that this is an NVidia card. They LOVE to do stuff like that.

Eventually, it occurred to me that the manufacturing of this card probably happens entirely in Southeast Asia, so I checked Alibaba. You can buy literally anything from Alibaba. It took me a couple of hours of sifting through Alibaba storefronts, but eventually, I ran across a listing for products made by Sanhe Transformer, a smaller outfit in Tianjin that makes, among other things, inductors. As I looked through their various part numbers, I eventually concluded that this pretty much had to be the original source of the inductors on the card. The markings in the picture they had looked almost exactly the same as the original one, less the L.

aliababa.jpg

The trouble was, Alibaba is a marketplace primarily intended for companies to sell stuff to each other, usually in large quantities. Notice that the minimum order quantity is 100 pieces. I tried everything I could think of, including just asking them, if there were a way I could order just a handful of samples, but they didn't seem to be set up to do that. So, after a few emails back and forth with a very nice lady named Echo...

emailfromecho.jpg

And a couple of weeks waiting on the mail...

I had a spool of 100 shiny new SMD 1284 package 0.47uH inductors on my workbench.

newinductors.jpg
 

Attachments

  • heatsinkoff.jpg
    heatsinkoff.jpg
    2 MB · Views: 0
With the new parts in hand, I set to work replacing the damaged inductor. Flux on!
flux_on.jpg

Heating up...

heatingup.jpg

And it's off. Smokey!
andit'soff.jpgitsoff.jpg
withoutinductor.jpg

Then, I cleaned up the pads...
padcleanup.jpg

...and installed the new inductor.

newinductoron.jpg

I cleaned the flux off, and cleaned the grease residue off of the core. You can see here how the markings on the new inductor are slightly different, but it's otherwise a pretty close match to the other ones.
cleanedupcore.jpg

The previous owner was good enough to include the original thermal pads, so I put them back on the heat spreaders, using the greasy spots and component inprints as a guide.
pads_heat_spreader.jpgpads_backplate.jpg

On this side, it has like, the biggest thermal pad ever.
biggestthermalpadever.jpg

It took me a while to figure out how to plug this connector in. I hear I'm not alone in that. ;) Fuck this connector.
fuckthisconnector.jpg

Finally, I put the heatsink back on, and got ready to test the card. I left the back plate off for now, since it seemed possible I might have to take it apart again.
leftbackplateoff.jpg
 
With the card plugged into the test bench, I hit the button, and...
pluggedin.jpg

We've got signs of life!
wehavepicture.jpg

I was pretty sure the card would at least kind of work, since I felt like the [H] member who sold it to me is probably trustworthy, so I installed the backplate, booted into Windows and ran some benchmarks. Perhaps unsurprisingly, it does graphics card things.
benchmark.jpg
 
Nice work, now time to sell it to some miner on ebay! :p
The thought has occurred to me, but unlike most of the cards I repair, which just sit on a shelf afterwards, I actually intend to use this card myself. The pictures above are from last summer, right after I received it, and you may remember I said it took me almost a year to figure out how to fix it completely. Before we continue the story, we need to circle back to the very beginning, and think about two very important details:

1. This was a very early EVGA RTX 2080, from right when they released

and

2. The reason the card's original owner put that aftermarket heatsink on it in the first place was that it never performed as well as the reviews said it should.

The more astute among us may remember that the launch of the 20 series, particularly EVGA's launch, did not go smoothly. There were at least three different and fairly widespread defects, which Gamers Nexus squawked about very vociferously at the time.


The most egregious defect was the "space invaders" thing that I gather was more common on the 2080 Ti. A second was a bug in the early BIOS that caused the cards to underperform for reasons that are not really explained, and there was a third defect which caused similar poor performance, but was not fixed with the BIOS update that fixed the second one. The explanation given is that it's somehow hardware related, and EVGA fixed any cards that were affected under their reputedly excellent warranty (I've never had to test this).

As it turned out, this card is one of the cards affected by that third defect, but the previous owner never realized it until after he'd damaged it, and it was presumably too late to RMA it.
 
The thought has occurred to me, but unlike most of the cards I repair, which just sit on a shelf afterwards, I actually intend to use this card myself. The pictures above are from last summer, right after I received it, and you may remember I said it took me almost a year to figure out how to fix it completely. Before we continue the story, we need to circle back to the very beginning, and think about two very important details:

1. This was a very early EVGA RTX 2080, from right when they released

and

2. The reason the card's original owner put that aftermarket heatsink on it in the first place was that it never performed as well as the reviews said it should.

The more astute among us may remember that the launch of the 20 series, particularly EVGA's launch, did not go smoothly. There were at least three different and fairly widespread defects, which Gamers Nexus squawked about very vociferously at the time.


The most egregious defect was the "space invaders" thing that I gather was more common on the 2080 Ti. A second was a bug in the early BIOS that caused the cards to underperform for reasons that are not really explained, and there was a third defect which caused similar poor performance, but was not fixed with the BIOS update that fixed the second one. The explanation given is that it's somehow hardware related, and EVGA fixed any cards that were affected under their reputedly excellent warranty (I've never had to test this).

As it turned out, this card is one of the cards affected by that third defect, but the previous owner never realized it until after he'd damaged it, and it was presumably too late to RMA it.

:( How bad is the performance deficit?
 
:( How bad is the performance deficit?
The card actually worked well enough as it was that, particularly given what it cost me, I'd have been OK with it just being like that. The only real drawback was that, given how precious most people are regarding used graphics cards, it would have been a tough sell getting anyone to buy it off of me if I wanted to get rid of it.

Can you imagine having to explain this to someone in FS/FT? "See, when I got it, it had this one part ripped in half. I repaired that using parts from an obscure parts house in China, but it's still got this obscure defect that..."

Anyway, I think the deficit is about 20% in raw performance, depending on exactly how you measure it. So, actual performance was somewhere between a 2060 and a 2070. I actually did run a couple of TimeSpy runs, but now I can't figure out how to retrieve them from the online results thingy. With the defect, I actually really liked this card for use in my Global Deathplague Home Workstation.™ It was still plenty fast for doing the modestly intense things I do for a living, and because it was unstressed, it ran very cool and quiet. I set up a custom fan profile so that the fans wouldn't stop completely, and it stayed around 35 C most of the time.

Moving on, the defect is characterized by a few distinct symptoms:

1. The "GPU Boost" logic doesn't work. This card's "base" clock is 1515 MHz, and that's as high as it would go. Boost speed, theoretically, was supposed to be around 1850.
1515mhz.jpg
2. The power target slider in PX1 displays zero, and you can't adjust it. I assume Afterburner is affected as well, but I don't like Afterburner, so I didn't try that. See the picture above.
3. If one attempts to overclock the card using the core or memory sliders, the display driver crashes. It eventually recovers, but the changes never get applied.
4. The core voltage never goes over 800mv
5. It took me forever to notice this, in part because I wasn't even sure if this card even has this feature, but the RGB disco lights didn't work.

If you watch GN's video, they talk about how in some cases, this behavior is supposed to be fixed by flashing an updated BIOS onto the card, using a special software tool that EVGA released via their forum. The previous owner told me that he'd already tried this, but just to be sure, I tried it again, with no success.

Wondering if there were just something wrong with the supposedly revised BIOS image EVGA provided, I tried a number of different BIOS images, using a hacked version of nvflash.

* The latest official EVGA 2080 XC bios
* The bios revision for the 2080 XC Ultra
* 2080 Black Edition
* Nvidia Founder's Edition
Edit: I think I may have tried one more as well - the MSI or Gigabyte "Gaming" reference board one, as I recall.

The only effect this had was that the base clock would change to whatever the installed BIOS specified. For instance the XC Ultra's base clock is 1525 or 1530, and it would report it was running at that speed. So, the problem was apparently not that the card was incapable of setting the frequency - it was just choosing not to. Given the missing power target information, I began to suspect that the core of the problem had to do with communication between whatever mechanism monitors power consumption and the mechanism that reports this to the driver.

It was at this point that I concluded that, obviously, there's something still wrong with the card itself. Given the behavior, it seemed likely that the problem is that whatever mechanism is used to monitor power consumption isn't talking to the part of the GPU that reports this to the driver
 
One of the problems with troubleshooting an issue like this is that there's basically zero documentation available for how any of the subsystems on the board actually work. The first thing I naiively suspected could be the problem is the current sensing on the main voltage control IC. With little to go on, I speculated that, if this were to report nothing back to the driver, you might get this kind of behavior.

The controller on this board is a Ubiq Semiconductor UP9512. Unfortunately, much like the CD3217 USB controllers that Louis Rossmann likes to complain about, Ubiq doesn't apparently sell these to anyone but nvidia, so there isn't even a basic datasheet for it that's publicly available. Buildzoid seems to have gotten his hands on one, and I was able to confirm that his information is probably good by comparing the circuitry on my card with a datasheet for a similar controller that at least has a pinout diagram available. Here's Buildzoid's video about this.


Unfortunately, I couldn't find anything obviously wrong here. The resistors and caps around my UP9512 seem to match Buildzoid's description, with connections present to each phase for current monitoring, VRef and so forth. What I couldn't figure out was a way to test is the I2C-like communication between the UP9512 and the GPU. As it turned out, this didn't matter anyway.

Another avenue of attack I tried is to write a piece of software to interrogate the driver in greater detail than is available via PX1 or the control panel This was kind of neat, in that there's actually a lot more information available to you, as a software developer, than the consumer tools provide, and it seems like someone with more free time than me could do all sorts of fun stuff with this. For instance, smarter cooling system control, where you monitor the sensors on the card and use that information to control pumps or fans in the system that aren't connected to the card itself.

Nvidia's relevant developer docs are here:
https://docs.nvidia.com/gameworks/content/gameworkslibrary/coresdk/nvapi/index.html

To save myself the suffering of doing this in C++, I made use of use of Soroush Falahati's wonderful C# wrapper for NvAPI. There's a pretty neat sample application in there that will just write out a lot of the relevant stuff I needed to the console, but it's really verbose.
https://github.com/falahati/NvAPIWrapper

A bit of inspection via the debugger revealed that certain data structures related to power management were present on a healthy Turing card, but were absent on my defective one. Here's the output from my diagnostic application on a healthy 2080. On the defective card, none of the performance control limitation options exist, so you get a mostly blank output.
customapp_2080_healthy.jpg

And here's another example from my 2080 Ti.
healthy_2080ti.png

After looking at this for a while, I began to think that what's probably happening is that the monitoring mechanism isn't communicating, which causes an exception of some kind to be thrown inside of the drivers. When this happens, it's caught, and the performance control data structures get initialized with nulls. I suspect this may also have something to do with why the driver crashes when trying to set the core and memory offsets, since there's a null structure where there should be something meaningful, and nvidia perhaps didn't ever consider the possibility of those structures being empty.
 
Honestly, it sounds like a good candidate for mining. Even if there's a performance drop in core clocks that probably wouldn't affect eth mining much, if at all, and miners might like paying a discounted price for the card. The only slightly unfortunate thing is that IMO it's one of those cards that's relatively better at gaming than mining to begin with, so it's not like getting a bargain on a Radeon VII or something that's selling for exorbitant prices due to mining right now.
 
Wonder if that control chip may just be flawed or defective and worked just enough to pass QA at the factory.
 
Honestly, it sounds like a good candidate for mining. Even if there's a performance drop in core clocks that probably wouldn't affect eth mining much, if at all, and miners might like paying a discounted price for the card. The only slightly unfortunate thing is that IMO it's one of those cards that's relatively better at gaming than mining to begin with, so it's not like getting a bargain on a Radeon VII or something that's selling for exorbitant prices due to mining right now.
I've actually been just using it in my daily driver machine for the last few months, since I've been working from home. I'm even typing this post with it. It's more than fast enough for doing normal work stuff, even with the defect, and it's a great big unstressed GPU with a really good heatsink on it, so it's cool and quiet.
Wonder if that control chip may just be flawed or defective and worked just enough to pass QA at the factory.
The hypothesis that I operated on for a little while is that maybe a few cards got manufactured with the wrong firmware version on the controller, where something like the I2C address for it is different than on non-defective cards.

It was at this point that I basically gave up for a while. Part of this was laziness - there were some clear avenues of investigation that I didn't even try, such as soldering another device onto the I2C bus with tiny wires, sort of like an old school Xbox mod chip. I should have tried this, but I didn't, because the required reading for that always felt a little bit too much like work, and I have plenty of work these days. More recently, I've also considered putting the card in the microwave for a few minutes, but I'm worried it might knock my test bench machine out of quantum alignment if I try that.

Then, a few weeks ago, GamersNexus published this video.


Listen to the section around the 6:00-7:00 mark. He talks about how they were told by engineers, presumably at EVGA or Nvidia, that the defect in question has to do with damage to the "twenty" (actually fourteen) pin fan connector on the reference boards.

So, I pulled the card back out of the system, tore it down, and I found... this. Look closely. Can you see what's wrong?

bentpins.jpg
 
I've actually been just using it in my daily driver machine for the last few months, since I've been working from home. I'm even typing this post with it. It's more than fast enough for doing normal work stuff, even with the defect, and it's a great big unstressed GPU with a really good heatsink on it, so it's cool and quiet.

The hypothesis that I operated on for a little while is that maybe a few cards got manufactured with the wrong firmware version on the controller, where something like the I2C address for it is different than on non-defective cards.

It was at this point that I basically gave up for a while. Part of this was laziness - there were some clear avenues of investigation that I didn't even try, such as soldering another device onto the I2C bus with tiny wires, sort of like an old school Xbox mod chip. I should have tried this, but I didn't, because the required reading for that always felt a little bit too much like work, and I have plenty of work these days. More recently, I've also considered putting the card in the microwave for a few minutes, but I'm worried it might knock my test bench machine out of quantum alignment if I try that.

Then, a few weeks ago, GamersNexus published this video.


Listen to the section around the 6:00-7:00 mark. He talks about how they were told by engineers, presumably at EVGA or Nvidia, that the defect in question has to do with damage to the "twenty" (actually fourteen) pin fan connector on the reference boards.

So, I pulled the card back out of the system, tore it down, and I found... this. Look closely. Can you see what's wrong?

View attachment 349445

I was going to link you this but I couldn't find the video that actually referenced this. :\
 
Are those (un)intentional solder bridges just south of that connector, or is that something else I see?
 
Are those (un)intentional solder bridges just south of that connector, or is that something else I see

You mean the legs of the connector? This connector is an SMD part. Those are the legs that hold it on the board.

or did you mean some of them are bridged? I don’t see any, but I was wrong once before?
 
You mean the legs of the connector? This connector is an SMD part. Those are the legs that hold it on the board.

or did you mean some of them are bridged? I don’t see any, but I was wrong once before?
Might be legs, hard to tell on my potato.
Screenshot_20210421-072052~2.png
 
Might be legs, hard to tell on my potato.
View attachment 349698
Oh, yeah, that's just the light reflecting off of the traces. They look fine in person.

fanconnector.jpg

If you look closely, you can also see damage to the heatsink side of the connector. See how the hole for pin 2 is bigger?

Obviously, after that, I unbent pin 1. This was easier said than done. My finest tweezers weren't stiff enough to bend it back, and my next size up was too clumsy to not butcher pin 2. I got it eventually, though.

Here's after after the first pass.
unbent1.jpg

And here's after the second. Not perfect, but good enough.
unbent2.jpg

I put the card back together, reinstalled it, and booted the system up. If I'm honest, I didn't really think that just unbending that pin would fix the problem, and it turned out that I was (mostly) right. The defect behavior remained apparently unchanged. One thing I did notice, though, is that PX1, which I was using to control the fans on the card, would no longer crash the driver the first time it started.

Figuring that there must be more to the problem, I decided it couldn't hurt to reach out via email to Steve at GamersNexus and ask him for any additional information he might have about this issue. Perhaps unsurprisingly, I haven't heard back.

It was at this point that I was sort of stumped again. At least, I didn't have any ideas I thought were good enough to go through all the steps of getting out the cameras, shutting the system down, taking the card apart, and so forth. So I just let it be for a few days. Like many of you who are working from home, I suspect, I don't reboot my desktop machine every day. In fact I really only reboot it when Windows tells me I absolutely have to. Thus, it was a few days before I rebooted again.

When I finally did reboot, though, I was greeted with this message after PX1 started up (not actually my screenshot, but you get the idea):
x1a.jpg


I actually hit the close button and forgot about it a few times, but after a week or two of dismissing this every few days, something possessed me to just let it do its thing, so I hit the start button, instead. It took a couple of seconds for it to do whatever it did, and then...

Holy shit, the RGB Disco lights came on!
rgblightson.jpg

Could it be actually fixed, just like that?

Yes, apparently. Here it is running the GPU-Z render test.
px1_success.jpg

I confirmed with a few minutes of Doom later that day. It seems like it works just fine. The only real problem is that two years of doing this has conditioned me to dread the sound of the coil whine and the fans going, so the sound it makes when it's working hard causes me to feel a distinctly unpleasant flavor of anxiety. Maybe I should get a real case?
 
Not sure if it's too small, but I use retractable ballpoint pens to unbend pins. Put the pen around the pin then you can gently lever it back in place. Works for cpu pins but might be hard for connectors like that.
 
Huh, I didn't even think of that. That's a good idea, but I don't think it would have worked in this case, unless you have a really fine pen. This connector is smaller than the ones you normally see on graphics cards.

I ended up using my angle tweezers to get in there and separate the bent pin from underneath, and then used my straight tweezers to straighten it out. Mostly. If I were going to offer fixing this as a service for hire, I'd probably just replace the connectors.
tweezers.jpg
 
Mechanical pencils with long tips work well, as long as the pin isn't too big for your pen(cil).
 
The thought has occurred to me, but unlike most of the cards I repair, which just sit on a shelf afterwards, I actually intend to use this card myself. The pictures above are from last summer, right after I received it, and you may remember I said it took me almost a year to figure out how to fix it completely. Before we continue the story, we need to circle back to the very beginning, and think about two very important details:

1. This was a very early EVGA RTX 2080, from right when they released

and

2. The reason the card's original owner put that aftermarket heatsink on it in the first place was that it never performed as well as the reviews said it should.

The more astute among us may remember that the launch of the 20 series, particularly EVGA's launch, did not go smoothly. There were at least three different and fairly widespread defects, which Gamers Nexus squawked about very vociferously at the time.


The most egregious defect was the "space invaders" thing that I gather was more common on the 2080 Ti. A second was a bug in the early BIOS that caused the cards to underperform for reasons that are not really explained, and there was a third defect which caused similar poor performance, but was not fixed with the BIOS update that fixed the second one. The explanation given is that it's somehow hardware related, and EVGA fixed any cards that were affected under their reputedly excellent warranty (I've never had to test this).

As it turned out, this card is one of the cards affected by that third defect, but the previous owner never realized it until after he'd damaged it, and it was presumably too late to RMA it.

space invaders is bad vram chips mining kills them and upping vram voltage shortens life span because of temps and over cycleing the ram
 
Back
Top