Oyster.com Tech Blog Insights from our engineering team

How to build a 40TB file server

The one most valuable asset at Oyster.com is our photo collection. Take away the intellectual property and what’s left is, essentially, markup (with a bit of backend to snazz it up.) So we need a solid backup solution for the original high-res photos. The old servers were about to run out of capacity and their slightly outdated specs did not make transferring huge datasets any easier or faster. Thinking a few months ahead, we were looking at a 40TB data set. In strict accordance with KISS methodology, we opted against LTO and S3, and decided to build a big BOX. (For starters, 40TB on S3 costs around $60,000 annually. The components to build the Box — about 1/10th of that.)

Areca 1882ix-24 RAID Controller

Areca 1882ix-24 RAID Controller

Coincidentally, a great new product was just about to hit the market, reinforcing our decision with its timely relevance — the dual-core Areca ARC-1882ix RAID Host Bus Adapter, which comes with an on-board DDR3 SDRAM socket with up to 4GB chip support. Since we already opted for RAID Level 6 (striped, distributed parity–error checking, tolerates two disk failures) and dual-core RAID-On-Chip means it processes two streams of parity calculations simultaneously — it seemed ideal.

The first challenge in putting together the big box was getting internal SAS connectors properly seated into the backplane adaptor sockets, the bottom few being especially cumbersome to reach. Thankfully, our hardware technicians’ exceptional manual dexterity rendered having to disassemble the internal fan panel frame unnecessary.

Internal mini-SAS connectors (SFF-8087)

Internal mini-SAS connectors (SFF-8087)

The housing assembly comes with six individual backplanes, each accommodating four SAS or SATA disks. Each backplane is secured to the drive bay assembly with three thumbscrews, their shape and material designed to fall within the required torque range when screwed on “as tight as possible.” As we found out the hard way, it is absolutely critical to ensure that each of the backplane cards is seated ‘full snug’ in the slot and secured dead tight with the thumbscrews. A shoddy connection is not always immediately obvious, it turns out. We observed intermittent timeouts on a particular drive bay as well as degraded overall system performance caused by one of the backplane cards not having been secured quite tight enough — however, the array was still functional, making troubleshooting an opaque nightmare.

One of the six backplane cards

One of the six backplane cards

One of the most important differences between this system and your run-of-the-mill high performance enterprise server with a couple of hard disks is the addition of: six back-plane cards, one 24-channel raid controller, 24 hard disks, and internal connectors — all creating a new potential point of failure (at least 37 additional ones). Every single component’s installation must strictly conform to spec, as the delicately balanced system immediately amplifies any fractional deviations exponentially, resulting in problems persisting for hours, days, and weeks, and many more lost megabytes per second.

If the configuration of all components is optimized, the small individual gains add up to a significant performance boost. At the risk of stating the obvious, things are much less likely to go wrong in a stable, fine-tuned system which performs at max capacity.

Big Box with 24 hot-swappable drive bays

Big Box with 24 hot-swappable drive bays

One of those things is aligning the physical array dimensions with the file system’s allocation units, in our case with a stripe size of 64K (since most of our image files are relatively large) on 512 byte blocks, we format using also 64K cluster (“allocation”) size. It eliminates the RAID on-board logic overhead of having to keep the logical disk synced with the physical.

A few important things regarding the driver must be mentioned. Windows operating systems inherited their native SAS/SATA RAID controller driver framework from SCSI technology, (SCSIport) and they have several serious drawbacks. I stumbled upon an interesting investigative white paper which goes into great detail about these issues. The preferred driver for modern SATA/SAS cards is the STORport driver, developed by the manufacturer’s consortium in response to the inadequate state of native drivers, which inherited limitations of the SCSI protocol. The STORport driver is not certified by Microsoft, therefore the OS by default installs the inferior SCSIport driver. Switching to the STORport driver visibly improved stability and performance during the project’s earliest stages, instantly bumping write speed by several dozen MB/sec.

Gig-E is the bottleneck

Gig-E is the bottleneck

Having spent some extra time on research,  fine-tuning, and optimizing the new server, we were glad to find that the gigabit network had became the bottleneck, rather than the all-too-commonly disappointing I/O. By aggregating both on-board gigabit network interfaces we can expect transfer rates of 200MB/s (the disks are 3Gbps), which is great for us to maintain a light, low-maintenance incremental backup. Another big advantage of this system in terms of capacity scaling is the external SAS connector which can accommodate another external box of disks to expand our array into. While not the fastest solution possible, it strikes a great balance between performance, value, and reliability (redundancy), which is exactly what we we are looking for.