PCIE Bifurcation related, a niche technical question.

Swang4004

n00b
Joined
Nov 30, 2021
Messages
6
I've been doing extensive research for a work related project. Its a niche application that needs a boatload of PCIe lanes. Between using bridge\switch chips or a motherboard with PCIe bifurcation, the bifurcation route seems to be not only far cheaper but also more reliable providing the correct parts are selected from the start. I've seen time and time again people splitting 8x and 16x slots into 8x and 4x increments by making use of custom risers and compatible BIOS firmware. My question is this;

Is there some specific technical reason you cant bifurcate a 16x or 8x or 4x slot into a large number of 1x lane groups?

For example the application I am looking into could far more benefit from 16 individual 1x lanes from a 16x slot than it would 4 4x lanes. The same holds true for 8x and 4x slots with the expectation that only 8 or 4 1x lanes would be available respectively. For clarification the project relates to accessing NVME drives in large number. Throughput is not the issue at hand and I understand that an NVME drive with a single 1x lane will run at 25% the rated speed. Nowhere in the PCI Express Card Electromechanical, PCI Express miniCard Electromechanical, or Serial ATA Specifications is there a requirement that NVME drives make use of only 4x lanes to operate. I recall seeing in the standards that the drive needs to negotiate with 1x, 2x, or 4x lanes. I've even found specific cases noted where drives on specific systems are paired up and share a 4x connection with 2 lanes going to each drive in the pair so I don't expect trouble from the drive side.

Now If it were technically possible I expect that I would need to design some rather odd bifurcation PCBs with a CLKREF fanout chip onboard. I assume the harder trick would be the BIOS support. Looking around I see that something like the Supermicro X11DPX-T would probably fit the bill as it has a gratuitous quantity of lanes providing both CPUs are installed. Does anyone have any insight on this or other avenues to explore?
 
I find PCIe bifurcation interesting, although I don't have an actual use for it, but I haven't seen any one bifurcating below x4 links. You end up needing specific support from the PCIe root complex and the system firmware, and I think it's just not there. If you really need the storage density implied of one nvme drive per PCIe lane on a dual-cpu system, I think you're going to end up with pci-e switch chips doing 16x to the cpu(s) and 1x to each drive. Although, looking around, even those often don't have enough ports to go from a 16x to 16 x1s. It might be easier to use multiple, less dense, storage servers.
 
This sounds like what Liqid was designed to do - and they have storage shelves to tie in to a set of PCIE HBAs. And that just takes software support then, since the PCIE firmware is built into the HBAs.
 
All 1x huh? Planning on building a mining rig? :p

I haven't seen any boards that support bifurcation down to the 1x level. That doesn't mean they aren't out there, but every one I have used supports choosing 16x (if large enough) 8x or 4x. In other words

For 16x slots with Bifurcation support:

16x
8x 8x or
8x 4x 4x or
4x 4x 8x or
4x 4x 4x 4x

For 8x slots with Bifurcation support:
8x or
4x 4x

I use bifurcation quite extensively on my server.

1.) One 8x slot in 4x-4x mode to power an adapter that runs two U.2 NVMe drives each in 4x
2.) Three of these NVMe adapters each in a 16x slot, so I can add 12 m.2 NMVe drives, each in 4x

Bifurcation is a little trickier to get working right in some cases (for reasons I will explain below) but once you do get it working, it tends to be more stable, lower latency and generally higher performance than using PLX chips.

Issues with bifurcation usually boil down to implementation and documentation.

Lots of motherboards simply do not support the feature, though it is much more common in actual server hardware.

Of the ones that do, it can be poorly documented whether or not they support it, may support it with some BIOS revisions but not with others, or there may be mixups in the implementation. On my Supermicro X9DTE-F the feature only seems to work on the very final BIOS revision. It's in the menu on previous versions, but it does not appear to work. They also mixed up the slot numbers in the BIOS initially making me think it didn't work, when in reality I just had to enable it on the wrong slot to make it work.

Further complications can also include custom vendor implementations.

For instance, for the dual U.2 adapter, I initially tried to use a $100 Supermicro AOC-SLG3-2E4R adapter card for use with bifurcation compatible systems. It refused to work, leaving me assuming that bifurcation was not working on my motherboard. Turns out that this specific model of adapter is only for certain Supermicro motherboards (not the one I have) and will only work if connected with an i2c wire to the motherboard.

Because of this, I mistakenly ordered a Supermicro $200 AOC-SLG3-2E4 board. (apparently dropping an R in th emodel number means you get a PLX chip?) Essentially the same board as the above, but with a PLX chip for non bifurcation compatible motherboards. It worked just fine, but then someone on the Servethehome forums filled me in on the i2c problem with the initial Supermicro adapter board, and I decided to try a third option.

The third option was a Unicaca basic 2x u.2 adapter card for use with bifurcation which I only paid $20 for. This wound up working as intended with bifurcation, so I decided to stop using the PLX based supermicro card in order to get lower latency and less heat from the PLX chip.

I still have the PLX based supermicro card. I should really sell it in the FSFT section or something...
 
As an Amazon Associate, HardForum may earn from qualifying purchases.
No mining rigs here. I work at a place that performs data destruction for a humbling number of drives as part of asset disposal\reuse. As more media formats have been created its made the job more complex and has impacted scalability. Long gone are the days where you can simply throw together a rack (or ten) full of SAS shelves, top them with a server full of HBAs and then fill the shelves with whatever mix of SAS or SATA drives the line hauled in today. We constantly have problems due to the sheer number of media formats which is in no way helped by the industry reusing connectors. Notable examples of reused connectors include SAS3 and U.2, mSATA (SATA protocol) and mSATA (NVME protocol) and so on.

While many would simply say "Just grab some tri-mode cards and be done with it" to that I say, "how many tri-mode cards do you think you need to run 500 drives at the same time?". Just the cost and how to source that many boards isn't something we want to consider and when you additionally consider that a tri-mode card cant run all that many NVME drives at a time you then have to add a LOT of servers to the mix just to get enough PCIe slots with all the support, licensing, power and space issues\costs that adds.

Small things add up that normally don't matter to most people like slotting an mSATA drive into a SATA adapter. Its one screw, like 20 seconds, who cares? Then you realize you have 150 to slot into adapters, before lunch. Later you have to take them back out when they are done wiping and odds are good that at least a handful or two wont detect so an hour from now you may still be trying to figure out which specific ones didn't detect. Are the drives bad? Are the adapters worn out? Are they NVME format drives and need to go in the other adapter?

M.2 has the same issues but in addition while its relatively simple to slot an M.2 B into a SATA adapter. M.2 M isn't an easy adaptation. Do you manually power off a full server and slot each of the cards out then slot in a fresh set and wait for it to boot? That'll wear out the connectors in no time. Do you use expensive USB adapters that are slow and prone to issues? Maybe just get a pricey server with an M.2 M backplane, now you have to assemble the drives into caddies that takes time and they wear out easily? With M.3 on the way you will soon see servers come in with 30 or 40 M.3 drives EACH, by the pallet. In a perfect world you always get the server or shelf the drives go into and the items all work so you use the system to preform the wipe. This is not a perfect world.

This is why I'm looking to split PCIe lanes into the largest number of devices possible. Who cares if a drive takes 4x as long to run, the computers time and trouble is meaningless by comparison. The point is to reduce the amount of tech time spent slotting in drives, reduce the amount of time spent troubleshooting ones that don't work and generally simplify the process for the poor souls that have to run this hardware day after day. The eventual goal is to fab up different boards with each type of board making use of a specific connector like mSATA that you slot any mSATA drive into. After you slot the drive in, the board detects the type of drive, connects the drive to the PCIe bus or SAS HBA as appropriate and presents the drive to the OS.

I've generally been looking at Supermicro hardware to base this monstrosity off of as their hardware has been good for odd solutions like this for years and is easy to mod. As it is I'll need to overhaul the entire front end of the case as the normal backplane isn't going to cut it. Custom PCBs for each media format will have to be fabbed as well as a bifurcation card. Mounting bracket need to be designed and faceplates to clearly indicate slot type and status with LEDs are on the list. This is not a small deal and so far the only issue I've not got figured out conceptually is if you can (BIOS support permitting) split PCIe lanes in such a manner.

I have looked into PLX chips too. They are hit and miss in my experience and they are extra complexity. Whenever possible I try to avoid things like that as less complex is always preferable when possible. The cost of PLX based cards quite quickly adds up though not as bad as tri-mode cards. Lastly with PLX cards I don't recall ever seeing a 1xPICe16x > 16xPCIe1x split either, always in 4s or greater though to be honest I haven't looked super deeply at them.

Like I said, this is really, really niche. The IT industry at large doesn't seem to really make anything like these because almost nobody would need these capabilities. The media problem isn't getting any better on its own and it seems like every 6 months or so a new format gets announced so even if this all works out I will periodically have to make new boards for new formats.
 
No mining rigs here. I work at a place that performs data destruction for a humbling number of drives as part of asset disposal\reuse. As more media formats have been created its made the job more complex and has impacted scalability. Long gone are the days where you can simply throw together a rack (or ten) full of SAS shelves, top them with a server full of HBAs and then fill the shelves with whatever mix of SAS or SATA drives the line hauled in today. We constantly have problems due to the sheer number of media formats which is in no way helped by the industry reusing connectors. Notable examples of reused connectors include SAS3 and U.2, mSATA (SATA protocol) and mSATA (NVME protocol) and so on.

While many would simply say "Just grab some tri-mode cards and be done with it" to that I say, "how many tri-mode cards do you think you need to run 500 drives at the same time?". Just the cost and how to source that many boards isn't something we want to consider and when you additionally consider that a tri-mode card cant run all that many NVME drives at a time you then have to add a LOT of servers to the mix just to get enough PCIe slots with all the support, licensing, power and space issues\costs that adds.

Small things add up that normally don't matter to most people like slotting an mSATA drive into a SATA adapter. Its one screw, like 20 seconds, who cares? Then you realize you have 150 to slot into adapters, before lunch. Later you have to take them back out when they are done wiping and odds are good that at least a handful or two wont detect so an hour from now you may still be trying to figure out which specific ones didn't detect. Are the drives bad? Are the adapters worn out? Are they NVME format drives and need to go in the other adapter?

M.2 has the same issues but in addition while its relatively simple to slot an M.2 B into a SATA adapter. M.2 M isn't an easy adaptation. Do you manually power off a full server and slot each of the cards out then slot in a fresh set and wait for it to boot? That'll wear out the connectors in no time. Do you use expensive USB adapters that are slow and prone to issues? Maybe just get a pricey server with an M.2 M backplane, now you have to assemble the drives into caddies that takes time and they wear out easily? With M.3 on the way you will soon see servers come in with 30 or 40 M.3 drives EACH, by the pallet. In a perfect world you always get the server or shelf the drives go into and the items all work so you use the system to preform the wipe. This is not a perfect world.

This is why I'm looking to split PCIe lanes into the largest number of devices possible. Who cares if a drive takes 4x as long to run, the computers time and trouble is meaningless by comparison. The point is to reduce the amount of tech time spent slotting in drives, reduce the amount of time spent troubleshooting ones that don't work and generally simplify the process for the poor souls that have to run this hardware day after day. The eventual goal is to fab up different boards with each type of board making use of a specific connector like mSATA that you slot any mSATA drive into. After you slot the drive in, the board detects the type of drive, connects the drive to the PCIe bus or SAS HBA as appropriate and presents the drive to the OS.

I've generally been looking at Supermicro hardware to base this monstrosity off of as their hardware has been good for odd solutions like this for years and is easy to mod. As it is I'll need to overhaul the entire front end of the case as the normal backplane isn't going to cut it. Custom PCBs for each media format will have to be fabbed as well as a bifurcation card. Mounting bracket need to be designed and faceplates to clearly indicate slot type and status with LEDs are on the list. This is not a small deal and so far the only issue I've not got figured out conceptually is if you can (BIOS support permitting) split PCIe lanes in such a manner.

I have looked into PLX chips too. They are hit and miss in my experience and they are extra complexity. Whenever possible I try to avoid things like that as less complex is always preferable when possible. The cost of PLX based cards quite quickly adds up though not as bad as tri-mode cards. Lastly with PLX cards I don't recall ever seeing a 1xPICe16x > 16xPCIe1x split either, always in 4s or greater though to be honest I haven't looked super deeply at them.

Like I said, this is really, really niche. The IT industry at large doesn't seem to really make anything like these because almost nobody would need these capabilities. The media problem isn't getting any better on its own and it seems like every 6 months or so a new format gets announced so even if this all works out I will periodically have to make new boards for new formats.

Interesting.

The original PCIe spec had a consideration for hotplug, but I don't know how well implemented and used it is.

My gut would be to use a SAS controller with a large number of connectors (like my LSI SAS 9305-24i and just get a large number of cheap SATA to m.2 adapters, and maybe some m.2 to u.2 adapters and between those try to figure it out, but I can see where that would be a pain too.

I hope you figure out a solution!

Something tells me that your answer probably lies in the products they sell to the mining crowd. There are a large number of PCIe breakout/expansion type of products they use which may wind up being useful to you.
 
Call liqid. I think their storage setup should do what you want. And since you’re buying empty shelves…
 
PCIe does indeed have hot insertion implementation. Its actually called "Surprise Insertion/Removal" and most of the standard comes down to properly sequencing powerup and other electrical considerations. The one hard limit I've read of is that the PCIe bus wasn't at first designed for this so a driver fix had to be implemented for OSes to handle this capability. Windows 7 and older can't handle the Surprise Insertion/Removal process and will hard BSOD if you try it as only newer OSes got these updates. It is very rarely implemented as the need for such a feature is very slim. The plan is a SAS controller, extenders if needed and bifurcation boards. The bifurcation boards would pass a cable to each drive connect board aong side a SAS connector and the board's onboard MUX and type selection hardware can do its job. If it just is not possible to get 16 1x lanes from a single 16x slot this is still possible but with 1/4 of the total drives at a time and even a monster of a board like a Supermicro X11DPX-T would only allow for around 23 drives at a time. If you can split it to 1x lanes that is a significant gain.

I've not seen Liqid specifically but I've seen other companies make products like them. The external PCIe enclosures add considerable cost and space needs and while they do make use of PCIe switch hardware to simplify the lane usage it would still need a lot of modification to make use of. Spares\maintenance become an issue very easily as well.

In general for SAS\SATA shelving we don't have a need to buy them as we get far more of them in for processing than there is demand for. Downstream clients mostly want replacement drives and the occasional power supply, or DAE. It is common to get a shelf without drives as well and if it is of reasonable capability\condition we can make use of it internally instead of scrapping it. There is budget allotted for this project but its not in the order of fortune 500 company IT spending levels.
 
Yeah, your requirements make a lot of sense, but I haven't seen much that would addresses them.

For PCI-e surprise insertion, my (rough) understanding is that the system firmware needs to more or less know what could possibly end up connected, so it leaves room in the structures for the new devices. But that's what I got from looking at adjacent topics while getting PCI(-e) working with static hardware on my (terrible) hobby OS. I would think this might just work if you can bifurcate and may need more fiddling with PLX style bridge chips, but the PLX datasheets mention hot plugging, so maybe?

If it were up to me, I think what you'd want to have for this is modular boards with four m.2 or four mSata or four whatever. The connectors would be oriented vertically so you could just slot the devices in with no need to put in a retention screw like normal. The board would connect to the server with a mini-SAS connector for sata mode drives and some sort of pci-e connector for pci-e drives (occulink, thunderbolt, non-standard stuff works too but hopefully not easy to confuse with mini-SAS) and power. Like you said, it seems relatively straight-forward to detect sata vs pci-e and route the pins appropriately. If bifurcation is an option, I think you still need a retimer or something. Otherwise a bridge chip to do the pci-e x4 -> 4 x1s seems more findable than x16 -> 16 x1.

You'd want to have some leds per device to show sata/pci-e mode, device detected, wipe complete, etc; and maybe a button to start. I'm not sure if you could maybe use the PLX's i2c to drive these (plus custom software of course). OTOH, you don't want these boards to be too expensive, as the connectors will wear out.

Alternatively, maybe just USB? It looks like Realtek has a chip that does NVMe or SATA; there's some not terribly priced 'nvme docks' on aliexpress that might be worth looking into if that's a possibility. I can't imagine the fun that trying to figure out which USB device is which would be, but maybe?
 
Alternatively, maybe just USB? It looks like Realtek has a chip that does NVMe or SATA; there's some not terribly priced 'nvme docks' on aliexpress that might be worth looking into if that's a possibility. I can't imagine the fun that trying to figure out which USB device is which would be, but maybe?

Hmm. I wonder if the solution might be as you suggest, USB adapters, but instead of using one large host with an impossible to keep track of number of USB devices, maybe using a large number of cheap light weight systems, like Raspberry Pi's or something like that.

How do you wipe the drives? Just dd over them? hdparm commands?

Man secure erase stuff is so time consuming. It's definitely a great motivation for SED encryption support, and just removing the key to wipe.
 
@toast0

For the surprise insertion you and I are on the same page as far as memory structures. From what I can tell with the PLX chips they do the work of bifurcating a connector when a motherboard isn't equipped to do so. I suspect if you did use a PLX chip it would need to manage the hot insertion internally so I expect there are chips that can't do this as they lack that feature though I have not deeply explored them.

The modular boards I was looking to fit in the same footprint as a 3.5" HDD so as to reduce the amount of modification to a normal rackmount case. Additionally by making a bunch of smaller PCBs instead of massive backplanes the signal integrity wont have to be fleshed out to nearly the same degree though it will still need to be kept in mind. Between 1 and 4 drives would be on a board depending on interface. I don't think I would need a retimer though a clock fanout chip looks to be an necessity. I've worked with retimers and they can fix a lot of issues but by keeping the traces to minimal sizes with small PCBs and ensuring impedance and length matching I hope to not need them. If I were planning to daisy chain these together or make massive case spanning backplanes retimers would probably be necessary for even the slowest of speeds. I plan to keep it as simple as possible by having each connector forward only the lanes and signals needed for any particular board. You are pretty much on the same track I am right now regarding the boards otherwise.

All the LEDs I was planning to make use of are for diagnosis of drives and the boards at a physical level. The LEDs I was planning to make use of are;
1. A single LED on the board which would tap in after the fuse to indicate that the PCB is powered. Yes fuses, I've seen some stuff in this environment. Which should always be on all the time the system has power.
2. An LED for each slot to indicate that the slot has detected that a drive has been plugged in and is powered.
3. An LED for each slot to indicate if its a SATA or NVME drive.

I probably wont include an activity LED as its not as easy as you would think to set that up for some drives. For example the SATA protocol mSATA drives have a DSS\DAS pin to supply a buffered activity LED but the NVME type doesn't seem to. It wouldn't be stricktly necessary either as the software is far easier to use to determine if a drive is in use than looking at the rack of LEDs.

A few parts will also be needed like a PCIe rated MUX, a boatload of differential ESD diodes and for the mSATA boards at least a 3.3v and a 1.5v regulator to supply those voltages as a standard SATA power connector doesn't supply them. Changes to the SATA specification mean that the 3.3v lines aren't supplied with power on all systems.

@Zarathustra[H]

A purely USB process isn't really effective as the software we use makes use of the ATA\SCSI command sets to wipe the drives securely. A few adapters we have right now for M.2 M drives use USB to connect and they are glacially slow by comparison as you can't issue these commands over USB. It also adds a layer of abstraction to the drive complicated detection and interaction. If we didn't need as many drives it'd be a quick work around though.

The software in use is similar in concept to the SCSI toolbox software though very different. As far a wiping goes a combination of different processes are permitted providing they adhere to the standards guides available from NIST. The biggest trick is you actually need logs to prove you actually ran the test which means part\serial numbers all need join up.

As far as the large number of lightweight systems we would need to spend a lot of time making the software compatible. A side project I have at home involves a Pi4 CM and even then you get a single PCIe1x to work with. Modifying a CM to handle this type of task would be difficult to scale well. You could do one drive without much trouble but even 4 or 5 would be a big ask of that type of system. A setup like what I am trying to build can be used by the techs in the same manner as we are used to handing other drives without resorting to a wholly different software/hardware architecture. Retooling\training isn't cheap and supporting multiple systems that are doing essentially the same task complicates matters.

SEDs are nice in concept though I would still wipe an SED drive, some say I am paranoid. Most places still don't seem to make use of them and we do regularly get them in where the default key wasn't changed. I guess people think they are magic and don't need any steps taken to secure them.

A lot of the reason clients work with us is because most IT asset disposal companies just grind all the hardware up and charge you to do so. With a few places like us a company can recoup some of the original cost of the hardware because we buy it from them. We separate the broken\low value materials out to be recycled but wipe and resell the valuable hardware. They are happy as they know the stuff actually got wiped and they get paid. We are happy cause some of this stuff still has a lot of value so we have jobs. A LOT of this hardware is still in good shape as many companies dispose of things as soon as the warranty expires and can have many years of life left. Lastly there's all the environmental concerns, if its not broken its a waste to throw it away. We do have some very sensitive clients that still require the drives to be destroyed no matter what but they have contracts detailing how the equipment has to be securely transported and want camera recordings of the grinder eating the drives, etc. its actually a complex industry and its not that old which is why there isn't a big market for things like this.
 
SEDs are nice in concept though I would still wipe an SED drive, some say I am paranoid. Most places still don't seem to make use of them and we do regularly get them in where the default key wasn't changed. I guess people think they are magic and don't need any steps taken to secure them.
Lots of places are required to always buy SED drives (even for apps that don't need them), and thus don't bother with a key manager for the places that they aren't required.
 
I hadn't considered that aspect and the cost difference is little to none for many SED vs non drives. It would allow for the use of that option later without a huge cost replacing noncompliant drives.
 
I hadn't considered that aspect and the cost difference is little to none for many SED vs non drives. It would allow for the use of that option later without a huge cost replacing noncompliant drives.
Bingo. Move it into something secure - enable encryption. If not - meh. But you don't worry about mixing non-SED and SED drives in teh same DC.
 
All 1x huh? Planning on building a mining rig? :p

I haven't seen any boards that support bifurcation down to the 1x level. That doesn't mean they aren't out there, but every one I have used supports choosing 16x (if large enough) 8x or 4x. In other words

For 16x slots with Bifurcation support:

16x
8x 8x or
8x 4x 4x or
4x 4x 8x or
4x 4x 4x 4x

For 8x slots with Bifurcation support:
8x or
4x 4x

I'm not sure if bifurcating the x16 slot below x4 is possible without adding extra chips into the mix. IIRC the advertised CPU specs from Intel/AMD for their mainstream desktop chips (I don't watch HEDT/Server closely enough to say anything there) bottom out at the cpus x16 being split to x4s.
 
I guess Ill need to do a deeper dig into the chipset spec then and see if this is possible. Thanks for the input so far on this everybody.
 
Back
Top