ninjaburger
n00b
- Joined
- Dec 15, 2014
- Messages
- 4
So, I'm just a lurker here -- created an account just to post this, honestly -- and usually I find my answers to problems quickly enough by looking over other HF posts, but I'm currently at my wit's end and have *no* idea what to do except start asking for help. Maybe somebody here will recognize some of these symptoms and have a troubleshooting idea I haven't tried.
Thanks for taking a look, if you have the time! I really appreciate it.
** SYSTEM SPECS **
** IMMEDIATE ISSUE **
Drives fail off the Adaptec 6805 controller like it's their job.
Literally any I/O pressure or substantial time spent at idle will cause them to spontaneously unmount with the Adaptec error
or something very much like it, depending on which slot failed this time, which leads to logical device failures, degraded arrays, and sad faces.
When this machine was first built it was running ONLY a boot disk over the Adaptec controller as a Simple Volume, and it BSOD'd pretty frequently, though it did limp along. Troubleshooting it was impossible in that setup -- eventually I deployed a fresh install of Server 2008 to a separate SATA drive on the motherboard so I could at least test the RAID, and have since discovered that *nothing* will stay alive on this controller / expander.
I have tried updating every piece of firmware on every piece of hardware I can find, have tried different combinations of drive models, tried every PCI slot, disabled every piece of hardware not directly in line with the SAS drives, replaced cabling, replaced drives, replaced the 6805 controller with another unit (under warranty) etc. I have no idea what is left to test, short of buying a new brand of controller or a new backplane (both of which are cost prohibitive for the role this system is being repurposed for).
** SYMPTOMS **
After creating any new logical device (via either the Adaptec BIOS utility or MaxView Storage Manger) performance is extremely slow (<75MB/s read/write on single drives and all types of arrays) and erratic, and if pushed (via ATTO or AJA disk benchmark utilities) the drive will within minutes disconnect. If left relatively idle the drives may stay on for as long as several days but they will disconnect eventually.
This behavior is easier to replicate in Windows (where I can push some I/O) but even if you leave the system in the BIOS utility with a freshly built logical device (no OS partitions), the drives will eventually disconnect.
I have tried creating JBODs, Simple Volumes, RAID 0, RAID 1, and RAID 5 arrays with 5, 4, 3, 2, and 1 disk arrangements, swapping disks in and out of these tests and into different slots on the expander. I have tried different cabling arrangements, using the J1 (Aux) port on the backplane, different cables, different brands of cables, etc.
** ADDITIONAL BACKGROUND **
The machine was built by BOXX in early 2013 and they have not had much luck helping me out (nor has Adaptec, who we don't have a direct support contract with -- I've only been able to speak with them via their "ASK" system. They also recommended getting a replacement 6805 card, which did not help). We bought the machine from BOXX as a renderfarm manager, so it didn't need much local storage and as I said earlier it limped along in that capacity for a year until we decided to repurpose it as a file server.
I'm not an IT / admin or hardware professional, I'm just the studio's resident nerd. I have about 15 years of experience building custom systems for film and animation production and have been running RAIDs on various hardware for a long time, but I am by no means an expert and, quite frankly, this one has defeated me completely.
Any advice from the community would be hugely appreciated. I have logs from the most recent round of testing available if anyone is interested and would like more detail or clarification.
Thanks in advance for your time!
Thanks for taking a look, if you have the time! I really appreciate it.
** SYSTEM SPECS **
- Chassis: SuperMicro 846E16-R1200B (Intel C602)
- Motherboard: SuperMicro X9DR7/E-LN4F (BIOS R 3.0a)
- CPU: Dual Xeon E5-v2 2620
- RAM: 16GB 1333MHz (8x 2GB)
- GPU: Matrox G200eW
- RAID Controller: Adaptec 6805 (FW 19147)
- SAS Backplane: SuperMicro BPN-SAS2-846EL1 (LSI SAS2x36 expander, Rev 0717?)
- Boot drive: 250GB SATA (direct to mobo)
- RAID drives: 5x2TB WD Re SAS (WD2001FYYG, mixed FW VR07 & VR02)
- OS: Windows Server 2008 R2 (fully updated)
** IMMEDIATE ISSUE **
Drives fail off the Adaptec 6805 controller like it's their job.
Literally any I/O pressure or substantial time spent at idle will cause them to spontaneously unmount with the Adaptec error
Physical drive removed: controller: 1 ( Adaptec 6805 #XXXXXXXXXXX Physical Slot: 1 ), channel: 0, deviceID: 9, enclosure ID: 0, slot ID: 1, WWN: XXXXXXXXXXXXXXXX, vendor: WD, model: WD2001FYYG-01SL3, S/N: XXXXXXXXXXXX, firmware level: VR02.
or something very much like it, depending on which slot failed this time, which leads to logical device failures, degraded arrays, and sad faces.
When this machine was first built it was running ONLY a boot disk over the Adaptec controller as a Simple Volume, and it BSOD'd pretty frequently, though it did limp along. Troubleshooting it was impossible in that setup -- eventually I deployed a fresh install of Server 2008 to a separate SATA drive on the motherboard so I could at least test the RAID, and have since discovered that *nothing* will stay alive on this controller / expander.
I have tried updating every piece of firmware on every piece of hardware I can find, have tried different combinations of drive models, tried every PCI slot, disabled every piece of hardware not directly in line with the SAS drives, replaced cabling, replaced drives, replaced the 6805 controller with another unit (under warranty) etc. I have no idea what is left to test, short of buying a new brand of controller or a new backplane (both of which are cost prohibitive for the role this system is being repurposed for).
** SYMPTOMS **
After creating any new logical device (via either the Adaptec BIOS utility or MaxView Storage Manger) performance is extremely slow (<75MB/s read/write on single drives and all types of arrays) and erratic, and if pushed (via ATTO or AJA disk benchmark utilities) the drive will within minutes disconnect. If left relatively idle the drives may stay on for as long as several days but they will disconnect eventually.
This behavior is easier to replicate in Windows (where I can push some I/O) but even if you leave the system in the BIOS utility with a freshly built logical device (no OS partitions), the drives will eventually disconnect.
I have tried creating JBODs, Simple Volumes, RAID 0, RAID 1, and RAID 5 arrays with 5, 4, 3, 2, and 1 disk arrangements, swapping disks in and out of these tests and into different slots on the expander. I have tried different cabling arrangements, using the J1 (Aux) port on the backplane, different cables, different brands of cables, etc.
** ADDITIONAL BACKGROUND **
The machine was built by BOXX in early 2013 and they have not had much luck helping me out (nor has Adaptec, who we don't have a direct support contract with -- I've only been able to speak with them via their "ASK" system. They also recommended getting a replacement 6805 card, which did not help). We bought the machine from BOXX as a renderfarm manager, so it didn't need much local storage and as I said earlier it limped along in that capacity for a year until we decided to repurpose it as a file server.
I'm not an IT / admin or hardware professional, I'm just the studio's resident nerd. I have about 15 years of experience building custom systems for film and animation production and have been running RAIDs on various hardware for a long time, but I am by no means an expert and, quite frankly, this one has defeated me completely.
Any advice from the community would be hugely appreciated. I have logs from the most recent round of testing available if anyone is interested and would like more detail or clarification.
Thanks in advance for your time!