How to build a 40TB file server

by Anton on December 21, 2011

The one most valuable asset at Oyster.com is our photo collection. Take away the intellectual property and what’s left is, essentially, markup (with a bit of backend to snazz it up.) So we need a solid backup solution for the original high-res photos. The old servers were about to run out of capacity and their slightly outdated specs did not make transferring huge datasets any easier or faster. Thinking a few months ahead, we were looking at a 40TB data set. In strict accordance with KISS methodology, we opted against LTO and S3, and decided to build a big BOX. (For starters, 40TB on S3 costs around $60,000 annually. The components to build the Box — about 1/10th of that.)

Areca 1882ix-24 RAID Controller

Areca 1882ix-24 RAID Controller

Coincidentally, a great new product was just about to hit the market, reinforcing our decision with its timely relevance — the dual-core Areca ARC-1882ix RAID Host Bus Adapter, which comes with an on-board DDR3 SDRAM socket with up to 4GB chip support. Since we already opted for RAID Level 6 (striped, distributed parity–error checking, tolerates two disk failures) and dual-core RAID-On-Chip means it processes two streams of parity calculations simultaneously — it seemed ideal.

The first challenge in putting together the big box was getting internal SAS connectors properly seated into the backplane adaptor sockets, the bottom few being especially cumbersome to reach. Thankfully, our hardware technicians’ exceptional manual dexterity rendered having to disassemble the internal fan panel frame unnecessary.

Internal mini-SAS connectors (SFF-8087)

Internal mini-SAS connectors (SFF-8087)

The housing assembly comes with six individual backplanes, each accommodating four SAS or SATA disks. Each backplane is secured to the drive bay assembly with three thumbscrews, their shape and material designed to fall within the required torque range when screwed on “as tight as possible.” As we found out the hard way, it is absolutely critical to ensure that each of the backplane cards is seated ‘full snug’ in the slot and secured dead tight with the thumbscrews. A shoddy connection is not always immediately obvious, it turns out. We observed intermittent timeouts on a particular drive bay as well as degraded overall system performance caused by one of the backplane cards not having been secured quite tight enough — however, the array was still functional, making troubleshooting an opaque nightmare.

One of the six backplane cards

One of the six backplane cards

One of the most important differences between this system and your run-of-the-mill high performance enterprise server with a couple of hard disks is the addition of: six back-plane cards, one 24-channel raid controller, 24 hard disks, and internal connectors — all creating a new potential point of failure (at least 37 additional ones). Every single component’s installation must strictly conform to spec, as the delicately balanced system immediately amplifies any fractional deviations exponentially, resulting in problems persisting for hours, days, and weeks, and many more lost megabytes per second.

If the configuration of all components is optimized, the small individual gains add up to a significant performance boost. At the risk of stating the obvious, things are much less likely to go wrong in a stable, fine-tuned system which performs at max capacity.

Big Box with 24 hot-swappable drive bays

Big Box with 24 hot-swappable drive bays

One of those things is aligning the physical array dimensions with the file system’s allocation units, in our case with a stripe size of 64K (since most of our image files are relatively large) on 512 byte blocks, we format using also 64K cluster (“allocation”) size. It eliminates the RAID on-board logic overhead of having to keep the logical disk synced with the physical.

A few important things regarding the driver must be mentioned. Windows operating systems inherited their native SAS/SATA RAID controller driver framework from SCSI technology, (SCSIport) and they have several serious drawbacks. I stumbled upon an interesting investigative white paper which goes into great detail about these issues. The preferred driver for modern SATA/SAS cards is the STORport driver, developed by the manufacturer’s consortium in response to the inadequate state of native drivers, which inherited limitations of the SCSI protocol. The STORport driver is not certified by Microsoft, therefore the OS by default installs the inferior SCSIport driver. Switching to the STORport driver visibly improved stability and performance during the project’s earliest stages, instantly bumping write speed by several dozen MB/sec.

Gig-E is the bottleneck

Gig-E is the bottleneck

Having spent some extra time on research,  fine-tuning, and optimizing the new server, we were glad to find that the gigabit network had became the bottleneck, rather than the all-too-commonly disappointing I/O. By aggregating both on-board gigabit network interfaces we can expect transfer rates of 200MB/s (the disks are 3Gbps), which is great for us to maintain a light, low-maintenance incremental backup. Another big advantage of this system in terms of capacity scaling is the external SAS connector which can accommodate another external box of disks to expand our array into. While not the fastest solution possible, it strikes a great balance between performance, value, and reliability (redundancy), which is exactly what we we are looking for.

  • jberg

    Why not use linux and bacula?

    • clone1018

      They’re just tools, use what ya like.

    • Matt Freeman

      Out the box, its probably one thing I would consider Windows Server for (well Storage Server edition or whatever it is called – with no CALs). Plug and play.

      • Matt

        There are plenty of Linux distributions that are setup from the get go for SAN / NAS servers. A few clicks on the GUI, or edit a few text files and you are ready to go; iSCSI, AoE, SMB, NFS, etc. Openfiler is one that comes to mind. Of course you don’t need to worry about software costs.

  • mark54g

    Was a distributed system impossible to manage? Why go with such a monolithic system? Hadoop looks like it would be more ideal.

  • http://dar.sh darsh

    Would love to know which drive model you decided on for this project.

    • anon

      You typically want to have different drive vendors and different manufacturer dates to prevent them all(or too many) failing at the same time.

  • http://lucisferre.net Chris Nicola

    Not to harp on TCO stuff, but have you worked out how much time to build and maintenance labour is going into your solution.

    • Fadzlan

      Yes, and what about redundancy? What about off site redundancy? And what about the manpower to handle that?

  • Jean

    What about backups/off-site? A burning big box wouldn’t be beautiful.

  • jon w

    S3 gives you multi-host, multi-region redundancy. Putting it all in one box is asking for trouble. What if the raid controller grows a bug and corrupts on write? It’s happened. What if there’s a fire in the building that has both your server and your back-up? Eggs, meet basket!

    • none

      So then build another $6000 box in another building to act as a clone/backup. It’s still a lot cheaper.

  • http://www.sdragons.org Hugh S. Myers

    Contrary to the nay-sayers who’ve posted, I think you’ve done the right thing. And done it in fairly spectacular fashion. I suppose part of my positive eval is due to a built in hardware bias, but I really don’t like the idea of putting the company fortune in someone else’s hands—hands that don’t know or care about what they have a hold of…

    • trust

      At some point you have to trust someone.

      If you build your own, you have to trust your power company, hardware manufacturer, and colocation service provider.

      Might as well move your trust relationship up the application stack as far as possible and focus your time and energy on your business’s core competency.

      • Chris Gonnerman

        I disagree. Not on the trust issue… it’s just a matter of, do I let my data out of my sight or not? I’ve seen too many situations where an outsourced service provider has been cruising down the highway to Hell, but you couldn’t see that from the outside… and so when they well and truly went to Hell, the customer didn’t know until it was too late. If your server is in your facility, you have means to know whether it’s being managed right or not. If it’s in their facility, you really don’t know.

  • http://minecraftforum.net/ WedTM

    It is an awesome build, however, I tend to agree that you’ve forgotten the maintenance, the redundancy, the “what ifs”.

    Not to mention that S3 grows with you. When you max out on your big box, BAM need another one, just to store a single additional photo.

  • wormik

    two thumb up! Don’t get messed up with the _expensive_ cloud solutions offered nowadays. You can always build custom build horizontally scalable solution using some DFS – when you’ll need it. IMHO maintenance costs of this solution will be rather low as long as you go with some 4hr HW replacement support plan (which will not be so costly). I’d setup 2MD / Month for maintenance which would be some 24k / Year

    best wishes

  • Phufighter

    Heh. I think this is awesome. Good job! I do have some concerns with others about the many points of failure, and the fact that it is a single copy, but as a startup, it’s important to watch costs now. I guess the next thing you can think of is how to use multiple ethernet connections to speed things up a bit :)

    And while 100MByte/sec is fine for a home system, I always shake my head a bit when I see how fast mid-level enterprise disk systems run. I’m testing one now that tops off at 5300 MBytes/sec on sequential writes! But with 96 drives in RAID 6.. Anyways – good job! :)

  • charlie

    the problem w/ RAID systems is once you hit the max limit. ZFS on FreeBSD FTW!

  • http://www.zadajpytanie.net Porady

    What about TCO? (cost of server, cost of energy, sysadmin salary etc.)

    • Andreas Karlsson

      The TCO much be much lower than using S3. If we say this solution will be used over the course of three years then S3 is yearly 60k while this machine is 2k + maintenance, energy, sysadmin, …

      While I do not know the exact costs I can easily see how you could spend less then 58k on all those extras. Especially since I assume they have to run more than one server in the company. If this is their only server it is no long as obvious that it will be much cheaper than S3.

  • stan

    I think you did the right thing and have done something similar for video. However I would strongly suggest looking into incorporating a cache-a LTO system with the big drive for archive.
    Backup and archive are not the same and ideally you need both. Add a catalog system to your flow and your in good shape so you can find your tapes archives and backup materials.

  • http://www.jonathanlevin.co.uk Jonathan Levin

    I love the idea.
    Personally, I would have changed the OS and filesystem to ZFS.
    I might also have stuck a FusionIO in there as a read cache. Although 4Gb of cache on the raid card does look very good.
    Good job!
    Tell us how it goes.

  • http://vidarara.com jp

    But what if the room where that server is gets on fire? Are you considering having another one in a different location in sync w/this one?

    Cheers,

    jp

  • David Bernick

    So if 2 drives fail, the box is dead, right? There’s not a small likelihood that two drives will fail simultaneously (or during rebuild), especially as it takes a few days to sync a large hard drive. This is why erasure coding is a good thing.

    • Andreas Karlsson

      No, three disks have to die since they are running RAID 6. Running RAID 5 would for the reasons you specify be insane with this many disks since a disk failing during recover is not too uncommon.

      • http://davidweekly.org/ David E. Weekly

        Parity RAID is one of these ideas that sounds really good in theory but is a terrible idea in practice. Hard drives from a given batch tend to fail at around the same time period; rebuilds take a long time – *especially* when they are still very busy serving production data. Your “window of vulnerability” is therefore a lot larger than most people estimate – right when your other drives (likely from the same batch) are most likely to kick the bucket.

        Put another way: data is valuable, storage is cheap. Parity is a way of trying to weasel out of having to store a whole extra copy of the data, but it’s a false economy. If these 40TB of photos are really 100% of the value of your company, do you really want to put all of your jewels in one, shiny $6k basket? A smarter decision would have been to instead buy three separate storage servers (and a hot spare) and use e.g. mogilefs to keep three copies of every image you have, each one on a different server. Yes, you’d have to spend 3-4 times as much on hard drives, but we’re still talking a one-time spend of under $20k to preserve and deliver _the whole value of your business_. Seems an obvious tradeoff to me.

  • Tom

    S3 would cost, but you could also use it to cut your server load down by directly serving images from S3. More importantly, if the company should change hands, a custom build RAID thingy will be a liability, not an asset.

  • behindtext

    first off, nicely done, i love to build stuff too :). some thoughts

    - have you looked into the backblaze pod hardware architecture? it’s cheap to build but takes a bit of magic to maintain, cannot speak to windows support for the backplane adapters

    - avoiding hardware RAID tends to be a good thing: if your RAID card dies, you really should have a cold spare of the exact same type of card to recover smoothly. i have had a number of raid cards die on me.

    - using FreeBSD or DragonflyBSD can help you avoid using hw RAID by using FreeBSD + ZFS or DragonflyBSD + HammerFS. your admin might not be a *nix person, so i understand not using it

  • RaelC

    Have you looked at coraid? Their ata over Ethernet solution offers lowest cost/GB and very good horizontal scalability. Many large (PB sized clusters) installations currently use coraid as either primary or backup storage.

  • http://about.me/sbram Scott B

    Comparing this solution to S3 is a like an apple vs apple tree…

  • Anton

    Didn’t mean to give the impression that this is the *only* backup, it is not. It is the “warmest” one — first line. It does not share the same physical location with the primary storage box either, but is less than 2 miles away. So we can easily have a 2 hour recovery time without throwing away those astronomical monthly service fees. (Although many technologists will always prefer paying big bucks for the comfort of being cushioned from every angle by SLAs and such — nothing wrong with that, just a different approach.)

    As far as TCO goes, it cannot get any lower since we already have one or two system guys handling *all* servers, as well as office workstations, etc.. This backup box takes up such a small fraction of their time that its almost negligible — several thousand annually at most. Same goes for power, etc — it is just one of *many* servers.

    The disks are all Enterprise Class 2TB SATA-II, several different models. We were purchasing them right after the monsoon floods in Thailand constricted supply so our choices were somewhat limited as time was a factor.

    Raid6 has come a long way since it’s early inception days, but is still a trade-off between raw storage capacity and processor utilization. HW RAID industry is now old enough to not have to wait for new products to mature as we used to when the technology itself was in its infancy. Old habits certainly die hard, but getting the “latest and greatest” was a conscious choice made for this specific problem, not submission to some immature fascination with “elite” new products, or however that may be.. This card has the best specs for Raid6 currently on the market — bottom line, period.

    Big Kudos to all who made suggestions and participate in the discussion, keep it coming!

  • Pingback: Exploitation by alexis - Pearltrees

  • barky81

    So…the website that serves all this image data…does it also run on a single box. And it too is in your office? And what, you have a single cableco internet connection, too?

    Oh, no? Then maybe a comprehensive TCO analysis might be less one-sided than you make it appear here…

    I used to be a “box guy” too (Why else would anyone read this?)…but I am in recovery.

    • Anton

      There is always a “box guy” at the other end of any outsourced business scheme. It’s a question of how many entities stand between you and the box. We prefer to handle it ourselves — this way there is no one to blame if something goes wrong, which makes it less likely to happen. For big corporations the opposite is often true — they need someone to blame because their fragmented, bloated infrastructure becomes impossible to efficiently manage.

      Regarding the “website” and “cableco” — this article is about a server for the raw original high-res images before they get processed for web, i.e. watermarked, re-sized, etc.. It is separated relatively far both in terms of infrastructure and workflow from the nginx-based web server cluster which actually serves the images. For a reference point, the colo provider maintains 6MW of generator-backed utility power. Hopefully this gives you a better idea of our facilities.

      A “comprehensive TCO analysis” in this case may very well end up costing more than the process being analyzed.

  • http://www.itoctopus.com itoctopus

    Probably in a decade from now that file server will be the hard disk of a smartphone!

  • Nocko

    The Areca was a bad call. We’ve had two Areca 1222 fail in the same month in separate DCs. They have zero support resources in the states. Warrantee service means a slow trip to Asia. Keep extra cards on hand or buy from a different company (we’ve replaced our Areca with 3ware/LSI). Good luck!

  • Dude

    A. CrashPlan?
    B. The question is if you’ll be happy 3 years from now.

  • vask

    Could you please expand on ”
    40TB on S3 costs around $60,000 annually. The components to build the Box — about 1/10th of that”.

    Could you provide a list of the hardware (no specific brands) and their corresponding costs?
    For 24 x SAS 2TB HDDs only, I calculate about 10000$… 

  • Pingback: Structs in C++, Unity game, and other musings | Random Thoughts

  • سنوسي اردي

    thanks

Previous post:

Next post: