If you use Python on Windows and you have programs or servers which allocate a lot of items on the heap (both of which we do), you should upgrade to Python 2.7.4. Especially if you do anything with HTTPS/SSL connections.

Python versions 2.7.3 and below use an older version of OpenSSL, which has a serious bug that can cause minutes-long, CPU-bound hangs in your Python process. Apart from the process taking over your CPU, the symptom we saw was a socket.error with the message “[Errno 10054] An existing connection was forcibly closed by the remote host”. This is because the HTTPS request is opened before the OpenSSL hang kicks in, and it takes so long that the remote server times out and closes the connection.

The cause of the bug is actually quite arcane: the Windows version of OpenSSL uses a Win32 function called Heap32Next to walk the heap and generate random data for cryptographic purposes.

However, a call to Heap32Next is O(N) if there are N items in the heap, so walking the heap is an O(N2) operation! Of course, if you’ve got 10 million items on the heap, this takes about 5 minutes. The first connection to an HTTPS server (which uses OpenSSL) essentially brings Python to a grinding halt for this time.

There’s a workaround: call the ssl.RAND_status() function on startup, before you’ve allocated the big data on your heap. That seemed to fix it, though we didn’t dig too deep to guarantee the fix. We were still running on Python 2.6, and given that the just-released 2.7.4 addressed this issue by using a newer version of OpenSSL, we fixed this by simply upgrading to Python 2.7.4. Note that even Python 2.7.3 has the older version of OpenSSL, so be careful.

Other interesting things we found while hunting down this bug:

  • At first we thought this was a bug in Python’s SSL handling, and it turns out there’s a strangely similar bug in Python 2.6′s SSL module. This was interesting, but it wasn’t our problem.
  • Microsoft’s Raymond Chen has a very good historical explanation of why walking the heap with Heap32Next is O(N2), and why OpenSSL shouldn’t really be using this function.
  • You can reproduce the Heap32Next hang just by allocating a ton of Python objects (eg: x = [{i: i} for i in range(500000)]) and seeing the first HTTPS request take ages, with the CPU sitting at around 100%.
  • A blog post with graphs showing Heap32Next’s O(N) behaviour, as well as the connection to OpenSSL.
  • What’s new in Python 2.7.4 notes the update to the bug-fixed OpenSSL version 0.9.8y on Windows.
  • This is the second bug we’ve found due to running something of an eccentric architecture (6GB of website data cached in Python dicts). The other one was related to garbage collection, and incidentally the handling of that was improved in Python 2.7 too. Yes, I know, somebody will leave a comment about how we should be using memcached for this, and they’d probably be right, except for this. :-)

{ 0 comments}

Most of Oyster.com is powered by Python and web.py, but — perhaps surprisingly — this is the first time we’ve had to think about garbage collection. Actually, I think the fact that we’ve only run into this issue after several years on the platform is pretty good. So here’s the saga…

Observing a system alters its state

It started when we noticed a handful of “upstream connection refused” lines in our nginx error logs. Every so often, our Python-based web servers were not responding in a timely fashion, causing timeouts or errors for about 0.2% of requests.

Thankfully I was able to reproduce it on my development machine — always good to have a nice, well-behaved bug. I had just narrowed it down to our template rendering, and was about to blame the Cheetah rendering engine, when all of a sudden the bug moved to some other place in the code. Drat, a Heisenbug!

But not at all random

It wasn’t related to rendering at all, of course, and after pursuing plenty of red herrings, I noticed it was happening not just randomly across 0.2% of requests, but (when hitting only our homepage) exactly every 445 requests. On such requests, it’d take 4.5 seconds to render the page instead of the usual 15 milliseconds.

But it can’t be garbage collection, I said to myself, because Python uses simple, predictable reference counting for its garbage handling. Well, that’s true, but it also has a “real” garbage collector to supplement the reference counting by detecting reference cycles. For example, if object A refers to object B, which directly or indirectly refers back to object A, the reference counts won’t hit zero and the objects will never be freed — that’s where the collector kicks in.

Sure enough, when I disabled the supplemental GC the problem magically went away.

A RAM-hungry architecture

Stepping back a little, I’ll note that we run a slightly unusual architecture. We cache the entire website and all our page metadata in local Python objects (giant dict objects and other data structures), which means each server process uses about 6GB of RAM and allocates about 10 million Python objects. This is loaded into RAM on startup — and yes, allocating and creating 10M objects takes a while. You’re thinking there are almost certainly better ways to do that, and you’re probably right. However, we made a speed-vs-memory tradeoff when we designed this, and on the whole it’s worked very well for us.

But when the garbage collector does decide to do a full collection, which happened to be every 445 requests with our allocation pattern, it has to linearly scan through all the objects and do its GC magic on them. Even if visiting each object takes only a couple hundred nanoseconds, with 10 million objects that adds up to multiple seconds pretty quickly.

Our solution

Response time (ms) vs time, before and after the fix

So what’s the solution? We couldn’t just disable the GC, as we do have some reference cycles that need to be freed, and we can’t have that memory just leaking. But it’s a relatively small number of objects, so our short-term fix was to simply to bump up the collection thresholds by a factor of 1000, reducing the number of full collections so they happen only once in a blue moon.

The longer-term, “correct” fix (assuming we decide to implement it) will be to wait till the GC counts near the thresholds, then temporarily stop the process receiving requests and do a manual collection, and then start serving again. Because we have many server processes, nginx will automatically move to the next process if one of them’s not listening due to this full garbage collection.

One other thing we discovered along the way is that we can disable the GC when our server process starts up. Because we allocate and create so many objects on startup, the GC was actually doing many (pointless) full collections during the startup sequence. We now disable the collector while loading the caches on startup, then re-enable it once that’s done — this cut our startup time to about a third of what it had been.

To sum up

In short, when you have millions of Python objects on a long-running server, tune the garbage collector thresholds, or do a manual gc.collect() with the server out of the upstream loop.

{ 9 comments}

Here at Oyster.com most of our codebase is written in Python, which is of course open source, and we use many open source libraries, including web.py, Babel, and lxml. Now it’s time to give a (tiny bit) back.

Three third-party services we use heavily are Sailthru, QuickBase, and myGengo. When we started with them, they didn’t have Python libraries available, or the Python libraries that were available kinda sucked, so we rolled our own.

Note that the idea here is not a fully-fledged API, but an “at least what we needed” wrapper. It may well be what you need, too, so check out the source code on GitHub. Below are some quick examples.

Sailthru

Sailthru is the email provider we use to send transactional and mass emails.

>>> import sailthru
>>> sailthru.send_mail('Welcome', 'bob@example.com', name='Bobby')
>>> sailthru.send_blast('Weekly Update', 'Newsletter', 'The CEO', 'ceo@example.com',
                        'Your weekly update', '<p>This is a weekly update!</p>')

QuickBase

We use QuickBase as a nice user interface to enter and manage portions of our hotel data, and use this Python module to sync between our PostgreSQL database and QuickBase.

>>> import quickbase
>>> client = quickbase.Client(username, password, database='abcd1234')
>>> client.do_query("{'6'.EX.'Foo'}", columns='a')
[{'record_id': 1234, ...}]
>>> client.edit_record(1234, {6: 'Bar'}, named=False)
1

myGengo

Our site is (partially) translated into 5 languages, and myGengo provides an easy-to-use automated translation API. We also run the gettext strings in our web templates through myGengo.

>>> import mygengo
>>> client = mygengo.Client(api_key, private_key)
>>> client.get_account_balance()
'42.50'
>>> client.submit_job('This is a test', 'fr', auto_approve=True)
{'job_id': '1234', ...}
>>> client.get_job(1234)
{'body_tgt': "Il s'agit d'un test", ...}

Pull requests welcome

You’re welcome to send pull requests on our GitHub page, or comment below to send other feedback! If you like or use any of these APIs, we’d love to hear from you.

{ 0 comments}

In March 2012, Apple announced and released the new iPad with a Retina display. There was only a short time between the announcement and when the device arrived in stores. This left a very short time for developers to get a working app out the door. And we wanted to take advantage of the wave of publicity, so we tried to get our app ready the same day the devices shipped.

There had been speculation for months that the next iPad would have a Retina display. But we didn’t want to start working on a Retina version without official specs. It wasn’t even confirmed that Apple would release a new iPad yet. Finally, on March 7th, Apple announced the new iPad with a Retina display. The released the official specs, and later that evening they released Xcode with a Retina simulator. The device would ship on March 16th. We had just over a week!

Porting to Retina

The biggest improvement that we cared about was the “Retina display” — Apple’s name for a screen resolution where the pixels are so small the average human eye can’t see them. This means that angled lines look smoother and more natural. The display has twice the resolution in both the vertical and horizontal directions. But the size of the screen remained the same, so each pixel is one quarter the size.

To minimize issues with the transition to Retina, Apple’s iOS APIs measure things in what they call “points”, rather than straight pixels. In the old iPads one point represents one pixel, in the new iPad one point represents four pixels.

Close-up on a non-Retina device

Close-up on a non-Retina device

The same image on a Retina display

The same image on a Retina display

This makes it relatively simple to upgrade an app. Most of the work actually falls to the artists to generate larger sized images for UI elements. You bundle the new Retina assets in the app, and you don’t even have to modify the code — if the old asset is called “foo.png”, then name the new one “foo@2x.png”, and everything just magically works. But our app is photo heavy and relies on pulling down a lot of images off the web. This took some work to fix.

Huge photo sizes

The biggest amount of work involved making sure we request the right size photo. We have 31 different image sizes of each of our 750,000 hotel photos, and they are not all double the previous size. We chose to solve this in the client — if the app detects it’s running on a Retina device, it requests the Retina-ready photo.

The one challenge I ran into was some button icons that were being served by our web server. The original buttons were 16×16. We already had some 35×35 versions, and so the artist decided to use these, instead of creating new 32×32 ones. Since they were going on a button I figured it would be ok. Until I saw them … and they were just over twice as big as they should have been. The problem was iOS doubling the image size automatically. I didn’t want that — I wanted to use the same magic as UIImage does with “@2x” in the filename. The magic is in the “scale” attribute of the UIImage object. Set scale = 2 and the image is no longer doubled. I added this code to our image downloader, to mimic the “@2x” iOS supports.

UIImage *image = [[UIImage alloc] initWithData:imageData];
if ([url hasSuffix:@"@2x"])
{
    image = [UIImage imageWithCGIImage:[image CGImage] scale:2.0 orientation:UIImageOrientationUp];
}

But let’s not quadruple the file size

We did have to create one new size for our photos — for the full-screen 2048×1536 images. This was no big deal, although it did take a few days to churn through all 750,000 hotel photos. But 2048×1536 is one big image, and contains a lot of data. With over 3 million pixels the image file sizes were four times as big as before. We didn’t want our app to become slower due to image load time, so we reduced the image quality on the larger sized images. We figured the Retina display would hide some of the JPEG aritifacts, and settled on a 65% quality level. The resulting file size was only about 1.5 times as big as before (instead of 4 times).

We worked fast and furious the last few days, to hit our goal of uploading to the app store on Friday. There were a few minor layout issues that had to be tweaked. While the simulator was nice (although I could only see about half of it on my 21 inch 1600×1200 monitor!) we had to test the program on a real device. But like everyone else we had to wait until they were released on Friday. After receiving the new iPad and verifying everything worked, we packed up the app and shipped it off .

Reviewed by Apple in six hours

Our app was verified by Apple in just six short hours! Our Retina version was available to the general public late Friday evening. Which means folks could see all of our thousands of awesome photos in glorious Retina display right from the very beginning. Also, being one of the first “retinized” apps, we made it on several of Apple’s top lists. This boosted our download rate immensely. It was worth the effort to take advantage of the small window Apple gave us, and makes our app that much better.

{ 0 comments}

How to build a 40TB file server

by Anton on December 21, 2011

The one most valuable asset at Oyster.com is our photo collection. Take away the intellectual property and what’s left is, essentially, markup (with a bit of backend to snazz it up.) So we need a solid backup solution for the original high-res photos. The old servers were about to run out of capacity and their slightly outdated specs did not make transferring huge datasets any easier or faster. Thinking a few months ahead, we were looking at a 40TB data set. In strict accordance with KISS methodology, we opted against LTO and S3, and decided to build a big BOX. (For starters, 40TB on S3 costs around $60,000 annually. The components to build the Box — about 1/10th of that.)

Areca 1882ix-24 RAID Controller

Areca 1882ix-24 RAID Controller

Coincidentally, a great new product was just about to hit the market, reinforcing our decision with its timely relevance — the dual-core Areca ARC-1882ix RAID Host Bus Adapter, which comes with an on-board DDR3 SDRAM socket with up to 4GB chip support. Since we already opted for RAID Level 6 (striped, distributed parity–error checking, tolerates two disk failures) and dual-core RAID-On-Chip means it processes two streams of parity calculations simultaneously — it seemed ideal.

The first challenge in putting together the big box was getting internal SAS connectors properly seated into the backplane adaptor sockets, the bottom few being especially cumbersome to reach. Thankfully, our hardware technicians’ exceptional manual dexterity rendered having to disassemble the internal fan panel frame unnecessary.

Internal mini-SAS connectors (SFF-8087)

Internal mini-SAS connectors (SFF-8087)

The housing assembly comes with six individual backplanes, each accommodating four SAS or SATA disks. Each backplane is secured to the drive bay assembly with three thumbscrews, their shape and material designed to fall within the required torque range when screwed on “as tight as possible.” As we found out the hard way, it is absolutely critical to ensure that each of the backplane cards is seated ‘full snug’ in the slot and secured dead tight with the thumbscrews. A shoddy connection is not always immediately obvious, it turns out. We observed intermittent timeouts on a particular drive bay as well as degraded overall system performance caused by one of the backplane cards not having been secured quite tight enough — however, the array was still functional, making troubleshooting an opaque nightmare.

One of the six backplane cards

One of the six backplane cards

One of the most important differences between this system and your run-of-the-mill high performance enterprise server with a couple of hard disks is the addition of: six back-plane cards, one 24-channel raid controller, 24 hard disks, and internal connectors — all creating a new potential point of failure (at least 37 additional ones). Every single component’s installation must strictly conform to spec, as the delicately balanced system immediately amplifies any fractional deviations exponentially, resulting in problems persisting for hours, days, and weeks, and many more lost megabytes per second.

If the configuration of all components is optimized, the small individual gains add up to a significant performance boost. At the risk of stating the obvious, things are much less likely to go wrong in a stable, fine-tuned system which performs at max capacity.

Big Box with 24 hot-swappable drive bays

Big Box with 24 hot-swappable drive bays

One of those things is aligning the physical array dimensions with the file system’s allocation units, in our case with a stripe size of 64K (since most of our image files are relatively large) on 512 byte blocks, we format using also 64K cluster (“allocation”) size. It eliminates the RAID on-board logic overhead of having to keep the logical disk synced with the physical.

A few important things regarding the driver must be mentioned. Windows operating systems inherited their native SAS/SATA RAID controller driver framework from SCSI technology, (SCSIport) and they have several serious drawbacks. I stumbled upon an interesting investigative white paper which goes into great detail about these issues. The preferred driver for modern SATA/SAS cards is the STORport driver, developed by the manufacturer’s consortium in response to the inadequate state of native drivers, which inherited limitations of the SCSI protocol. The STORport driver is not certified by Microsoft, therefore the OS by default installs the inferior SCSIport driver. Switching to the STORport driver visibly improved stability and performance during the project’s earliest stages, instantly bumping write speed by several dozen MB/sec.

Gig-E is the bottleneck

Gig-E is the bottleneck

Having spent some extra time on research,  fine-tuning, and optimizing the new server, we were glad to find that the gigabit network had became the bottleneck, rather than the all-too-commonly disappointing I/O. By aggregating both on-board gigabit network interfaces we can expect transfer rates of 200MB/s (the disks are 3Gbps), which is great for us to maintain a light, low-maintenance incremental backup. Another big advantage of this system in terms of capacity scaling is the external SAS connector which can accommodate another external box of disks to expand our array into. While not the fastest solution possible, it strikes a great balance between performance, value, and reliability (redundancy), which is exactly what we we are looking for.

{ 40 comments}

Cohabitation with Python and C++

by Chris on November 23, 2011

Back in the day, Oyster.com was a C++ shop. One day we decided to convert to Python. We didn’t convert everything to Python, which left us with the task of bridging the gap between them. Of course there were issues when setting up this communication between the C++ and Python libraries.

We made the decision to convert to Python for several reasons. One main reason was to take advantage of some good but free Python libraries. We converted almost all of our code — it’s better and easier to maintain code in one language than in two. But one of our backend engines was complicated and it worked, so we decided to leave it in C++.

In simple cases, bridging the gap from Python to C++ is relatively easy. Python provides a routine that will convert data from Python’s managed memory to C++. PyArg_ParseTuple is used to convert incoming Python objects into C data types. We had to then iterate over the lists using PyList_GetItem to convert lists to arrays. We wrote the conversion function, everything worked, and we pushed the results into production.

But we experienced periodic crashes which we could not track down. While investigating the crashes we discovered that our multithreaded system was essentially only handling one incoming request at a time! Our C++ code had lots of dependencies and sometimes could take a while to return a result. It turned out that the entire Python server would block waiting for the C++ code to return.

The problem was Python’s Global Interpreter Lock (GIL). The GIL prevents multiple native threads from executing Python bytecodes at once. Apparently this is done because Python’s memory management isn’t thread safe.

While there are a few things that can be done to allow Python to play nicely with C++ in a multithreaded environment, we were in a time crunch to get the problem solved. The problem wasn’t so much the multithreading, but the amount of time that was spent inside the C++ code. If the C++ code is quick then Python won’t block for very long.

We solved our problem by redesigning the flow — instead of calling our C++ code directly inside Python, we switched to running our C++ code in a separate process and talking to it via a simple HTTP API. Our multithreading and GIL issues disappeared and became multi-processing issues (where there are much clearer, safer boundaries).

The lesson we learned from this is that multithreading is difficult, even when it looks simple. Python and C++ can play nicely with each other, as long as the C++ call is quick. You can’t let the call into C++ block for too long, as you need to let Python release the GIL occasionally to let other threads run. While it may be possible to solve this problem, we deemed our new multi-process HTTP solution worked great (in fact it probably works better this way), and we didn’t have the time to delve into a solution involving Python’s GIL.

As always, tread carefully when doing any sort of multithreading.

{ 7 comments}

CherryPy, ctypes, and being explicit

by Ben on October 31, 2011

Here at Oyster.com part of our web stack consists of web.py and CherryPy, and on the whole we’ve found they’re fast and stable. However, a little while ago, the CherryPy server started intermittently raising an exception and bombing out — a WindowsError due to an invalid handle on an obscure SetHandleInformation() call.

Auto-restart is not a solution

At first this was only happening once in a very long while, but after certain changes it would start happening a few times a day. We’ve got a script in place that restarts our servers when they die, but because of the aggressive caching we do, our web servers load tons of stuff from the database on startup, and hence take a while to load. So just letting our auto-restart scripts kick in wasn’t a solution.

On further digging, we found there was already a relevant CherryPy bug open, with someone else getting the same intermittent exception. They were working around it by changing an unrelated line of code, so something smelled fishy.

HANDLE != uint32

I noticed SetHandleInformation() was being called with ctypes, and had just recently been using ctypes for a named mutex class I’d written (to make Python’s logging module safe for writes from multiple processes).

ctypes is great for calling C DLLs when you just want a thin Python-to-C wrapper. Its defaults are good — for instance, Python integers get converted to 32-bit ints in C, which is normally what you want. SetHandleInformation()’s first parameter is a handle, which I (and apparently CherryPy) assumed was just an integer, so it was getting passed to C as a 32-bit value. However, it’s actually defined as a HANDLE, which is typed as void pointer, so on our 64-bit Windows machines it was actually a 64-bit value.

SetHandleInformation() was looking for the high 32 bits of the handle on the stack or in a register someone else owned, and of course sometimes those 32 undefined bits weren’t zero. Crash bang.

On being explicit

Once we realized what was happening, the fix was easy enough — ctypes lets you override the default conversions by specifying argument and return types explicitly. So we changed a straight ctypes call:

windll.kernel32.SetHandleInformation(sock.fileno(), 1, 0)

to a ctypes call with an explicit type spec, like this:

SetHandleInformation = windll.kernel32.SetHandleInformation
SetHandleInformation.argtypes = [wintypes.HANDLE, wintypes.DWORD, wintypes.DWORD]
SetHandleInformation.restype = wintypes.BOOL
SetHandleInformation(sock.fileno(), 1, 0)

Lo and behold, we were now telling ctypes to respect the function’s signature, and everything worked fine. We told the CherryPy folks and they were quick to implement this fix and resolve the bug.

So don’t be scared of ctypes, but just remember, it doesn’t memorize Windows.h, so avoid pain and suffering by telling it your types. Explicit isn’t for raunchy movies — it’s point #2 in the Zen of Python.

{ 1 comment}

Oyster Hotel Panoramas

by Bret on September 30, 2011

Here at Oyster we’re all about photos. We like photos of people, pools, property and power outlets. The more photos the better as our goal is to give you a real and ideally complete picture of the hotels we cover.  A single photo – while worth a thousand words – will generally show you a small window of about 40° horizontally and 27° vertically. What if we could give you a 360° view of the hotel? A panorama adds a lot of perspective and helps create a better sense of the space. We weren’t sure what the results would be but we decided to purchase a couple Gigapan pano heads and send them out to our photographers and see what we could make of it.

And we made this. Well that’s kind of cool, not very interactive but cool. When looking for some panoramic software we didn’t really have a big list of requirements. We wanted something that would work well and look good. We didn’t want any QuickTime or Java implementations, sticking to Flash and hopefully some in-progress HTML5/js solution. The first program we investigated was Pano2VR. Pano2VR definitely did the job, but when investigating how we might handle UI changes down the line it looked as though we’d have to generate each panorama over again. As an engineer this seemed like a less than ideal solution.

While not writing it off I continued to search for panorama software. I came across this website dedicated to panoramas, Panoramas.dk, and the panoramas I was looking at felt smoother. Digging a bit I found out they use krPano. After giving it a guick run-through it became obvious that this was going to be our solution. It ran off a suite of command-line tools and configurable template files, allowing you to separate your UI into an XML file, while packaging up the rest of your panorama into a swf. Now if we decide to change anything in the UI we just have to worry about modifying that one file – perfect!

Taking a look at the pano above you’ll see that we’re almost there. You might’ve noticed a rather visible seam where the left and right ends of the photo meet. Up until now we’d just been working out what panorama software we would use and it was time to get that stitching quality up. The software package that came with Gigapan wasn’t cutting it. Sure we could pass it off to our photo editors to try and do something about it, but the more streamlined this process was the better. GigaPan Stitch was out and AcroPano I found barely usable having to manually set points of similarity between each image as it was added. I did not get far enough to see what a stitch would look like. I finally stumbled upon Microsoft Research Image Composite Editor or MR ICE for short. Not only did ICE get rid of the seam but it also did a far superior job of selecting photos when blending moving objects. If you look closely at the photo below you can see a minivan on the road near the middle of the image that appears to be vanishing, while the ICE stitched photo correctly composites two images without the moving object.

So far we’re just working with individual panoramas for each hotel, but one great feature of krPano is the ability to create virtual tours. Though not something we’re currently working on we do have the capability with krPano to create a sort of walkthrough of a hotel. There are also some neat features with hotspots and javascript callbacks that could create some pretty interesting experiences. Take a look at the first batch of panoramas:

Bell Rock Inn

L’Auberge de Sedona

Best Western Plus Arroyo Roble Hotel & Creekside Villas

{ 1 comment}

Who Can I Trust To Send My Email?

by Mike on September 12, 2011

Oyster.com has a large average transaction size.  Customers visit our site and commonly hand over thousands of dollars.  Eventually they will expect their hotel room to be available, but what they want immediately after paying us is an email telling them that we got their order, their room has been confirmed, and everything is OK.  This is a very compelling reason to have a stable email provider.

There is also the ever-growing number of people on our email lists.  They have told us they love our website and would like to purchase something from us if we would just send them a nicely personalized email that maybe has a coupon in it.  That is another great reason to be able to get our emails out reliably and track their performance.

So who can I trust to send my email?  There are more companies around than you would believe who are fighting for the business.  I looked at about a dozen of them.  The criteria used were:

  1. Cost.  Oyster is a startup and doesn’t like to pay BigCompany prices.
  2. Deliverability.  Spammers have made email deliverability a huge pain in the rear.  Getting your message into your customer’s inbox is not trivial.
  3. API.  I am not interested in a solution that I can’t extend.  We push all our services and tools very hard and that’s impossible without a good API.
  4. Features.  Of course, the fewer things I have to write means more time actually working on our website instead of our email campaigns.

The Competitors

The cheapest is Amazon.  $0.10 CPM (cost-per-thousand) will stretch a startup’s budget a long way.  Their deliverability is a question mark though, and they are very low on features.  I am trying to hire an email vendor, not build one.

A step up are companies like MailChimp.  $0.50 CPM, but you get a lot of feature for your money.  Deliverability was a concern, but they will give you a dedicated IP address for $1,000 and that will at least keep the spammers off the exact same address.  Things started breaking down though when I dug into their API.  It is very campaign-centric.  How do I pull up a list of the last 10 mails I sent a specific customer?  They do get some points for having the best designed website in the industry.

Another bump up the pricing ladder takes you to a fellow NYC startup, Sailthru.  They probably have the worst designed website in the industry, but I heard good things about them so I got a demo account and took it for a spin.  They had a simple and powerful REST API that would play nicely with our codebase.  They definitely understood that deliverability is paramount.    There were some rough edges around the product (like the documentation), but I got quick support whenever I had a question.

Which of these companies hired a web designer?

The top tier includes some of the big names in the industry.  StrongMail, ExactTarget, CheetahMail, BlueHornet, Silverpop, and on and on.  This is the world of long web demos where they spend half an hour showing you their amazing WSIWYG editor (which a) sucks and b) we’re not going to use).  Then comes a breakdown of every possible report you could generate.  The conversation gets awkward when you ask for access to a sandbox where you can try it for yourself.

There were some diamonds in the rough.  StrongMail and ExactTarget both had strong offerings.  StrongMail  can pull data straight from your database.  They had a convoluted API, but everything you need is in there.  There were also some nice goodies like multi-variate A/B testing support built in.  Prices start in the $4 -$5 CPM range, but it comes down as your volume grows.

The Winner Is..

The competition was fierce.  Salespeople were sending me Yankees tickets and taking me out to fancy restaurants.  At the end of the day we chose Sailthru.  They had some features that were very useful for us.  Every mail we sent was archived on the web and we easily tied links to them into our customer service portal.  They have a developer-friendly API and were happy to add the hooks and functions that we asked for.  The templating system is simple and flexible.  There are a couple of warts, but it is a great system for the money.

So how is it going?

We’ve been using Sailthru for a while now.  Overall I’ve been very happy with them.  Senderscore reports our emails as highly deliverable.

98% Acceptance Rate

 

Besides deliverability, there are a couple key features that make Sailthru work for us.  The first is that their templating system does not get in our way.  Our transactional mails go out in different languages and currencies, so it is not going to work if we have to build emails in your system.  Most of our Sailthru templates look like this:

{body}

We process everything on our side with Cheetah templates and our internationalization code and then send the entire email up to them.  We get all the benefits of the reports, link tracking, and other goodies in Sailthru, but we can leverage our localization tools to build the mail.

International emails mean basically all the content changes.

The second is that their API is simple and complete.  It is a REST api and we wrote our own interface in Python, but they have a bunch of clients available.  We use it for everything.  Sending transactional mails is as easy as POSTing the body of the mail.  Campaign mails are a two step process because they are heavily personalized.

The issue is how do you get different content for 100,000 people into your email provider?  A solution like StrongMail can connect right to your database, but I haven’t seen anyone else who does that.  Sailthru has data feeds that you can setup, but it is only good for specifying the same content for everyone.  Sailthru’s solution is through their API ‘job’ call.  This is used for a lot of bulk tasks like massive updates to your lists.  In this case it allows you to set data fields on all your users.

Every night we build up a file with customized article selections for each user and post it to Sailthru.  A callback lets us know it has been processed and we are ready to start a campaign mail whenever it is convenient for us.  My data fields look like:

Data fields get populated in Sailthru

And they result in an email that looks like this:

Email is a critical component of our business, and we evaluate our provider regularly.  So far Sailthru’s combination of deliverability, features, API and price point have been tough to beat.

{ 5 comments}

Oyster Shots on the Front End

by Alex on July 29, 2011

In our last post Ben brought you up to speed on some of the inner workings of our latest addition to the site, Oyster Shots. Building the user interface for this new feature presented its own set of challenges. With Oyster Shots we wanted to create as immersive an experience as possible, allowing users to navigate our mountains of photographic content in a new and fun way.

Photo Sizing

One of the main goals we had with Oyster Shots was to provide the best photo-browsing experience to users with a wide range of screen resolutions: ranging from nerds like us that have huge monitors, to laptop displays, and even to those desktop displays still kicking around with 1024×768 resolution. We do this both on the photo detail view and the results view using a couple of different techniques:

The Client-Side Part

First off, we scale the photos as you resize your browser window using a combination of CSS and JavaScript. For the photo detail view, we have a fluid layout where the sidebar has a fixed width and the photo expands to fill as much space as is available. For the result view we used percentage measurements to define the width of image columns. At most window sizes four columns of photos looked pretty great, but at some smaller screen resolutions, the photos got too small. To address that we used JavaScript to change the number of columns based on window size: if the window size is less than 1410 pixels, you get three columns; if it’s more you get four. To keep the spacing between photos consistent regardless of size, we used an outer container with its width set by percentage and an inner box with fixed margins in pixels.

.photo-result-container {
    display: inline-block;
    max-width: 610px;
    min-width: 245px;
    vertical-align: top;
    /* percentage width to fit four photo results per row */
    width: 25%;
}
.photo-result {
    background-color: #fff;
    border: 1px solid #ccc;
    /* margins on the inside keep the spacing consistent,
    and don't mess with the width of the container */
    margin: 0 16px 44px 0;
    padding: 8px 0;
    position: relative;
    vertical-align: top;
}

We also use CSS and JavaScript to resize the photos themselves. Setting the image’s width to 100% and leaving the height at “auto” means the image will fill its container horizontally and maintain its original aspect ratio. Obviously we don’t want to use the same image file for every possible screen size: using a huge image and scaling it down would add undue page weight, while using a small image and scaling it up would look pretty terrible.

A Photo Detail Page for high-resolution display, and low-resolution

A Photo Detail Page for high-resolution display, and low-resolution

We have a defined set of image sizes that are available for all of our photos. So using a custom jQuery plugin, we check the size of the images when the browser window resizes to see if the image file needs to be replaced. So if the new size of the image is larger than its native resolution, we swap the src attribute with a larger image so that the scaling is cleaner. By measuring the direction a user is scaling his or her browser, either growing or shrinking the window, we can start to load new image sizes early.

Conventional wisdom seems to be that you’re not supposed to scale images in the browser. The two most oft-cited reasons are:

  1. You serve larger images than are needed: We dealt with this by having multiple sizes and serving the one that most closely matches the current display size.
  2. Browsers don’t do a good job of scaling images: We found that not to be the case. Rescaling images in a modern browser to a size close to its native resolution yields results that are quite acceptable.
One of these images was rescaled in Photoshop using bicubic resampling, one was rescaled in Internet Explorer.  Can you tell which is which?

One of these images was rescaled in Photoshop using bicubic resampling, one was rescaled in Internet Explorer. Can you tell which is which?

The Server-Side Part

Changing image sizes as a user resizes his or her browser window is one thing, but ideally we’d like to serve up the optimal image size on page load as well. To do this, we set a cookie with the user’s display size. This cookie is written on page load and every time the browser window is resized. We use this cookie on a number of pages that have fluid-sized images, including the Oyster Shots results page, to determine how large an image will be displayed when the HTML is loaded. Starting with the dimensions of the browser window, we do a few calculations: subtracting the height of the page header, the width of the sidebar and so on, to determine how much page real estate the photo will have to display. We find the closest available image size, and generate the HTML on the server side to use that image. That way when the resizing plugin on the client side takes over, the image does not need to be loaded again. Of course this can fail if a user is browsing in multiple windows, or clears cookies, for instance, but that is a minority of cases, and the image would just be reloaded once the client side sizing logic kicks in.

Quick Browsing

Another goal we had was to be able to browse photos quickly. From the Oyster Shots results page, once you click a photo to get to the detail view, you can start browsing through the results by clicking the arrows at the top right, clicking the photo itself, or using the left and right arrow keys on your keyboard. You may notice how quick it is to navigate from one photo to the next. This is because the detail pages are all loaded via Ajax, which allowed us to make a few key performance improvements.

The advantages of using Ajax to update only a portion of the page content, rather than causing a complete refresh, are well-known. What we did to go beyond that was to load multiple photo detail pages in a single request. The HTML for a dozen photo pages is stored in memory. That way, navigating to the next or previous photo in a series usually only requires reading a JavaScript variable to get the HTML content.

Loading multiple photo pages at once also allows us to preload images as well. When you’re looking at a photo detail page, we can look at the source of the next or previous image in the series and start to load it ahead of time.

The photo pages are stored in memory as objects that have a handful of properties, including a string that is the HTML for the page itself. Storing and manipulating a large number of photo pages in memory called for a couple of unique solutions, the first of which was a basic issue of organization. In the many ways that Oyster Shots works with these objects, we will at times need to access a particular photo page arbitrarily according to the unique image id, the id of the hotel that image belongs to, or the index (the order in which the image appears in the result set).

To allow for this flexibility, we maintain a single container object that stores the page objects and is keyed by the index, and two supplemental indexes where page objects are keyed separately by image id and hotel id. We used an object rather than an array for the primary collection.

//set up the main collection and the two supplemental indexes:
var images = {};  //main collection
var imageKeys = {}; //indexed by image id
var hotelIdImageKeys = {}; //indexed by hotel id

//update the indexes when we add an image to the collection:
function addImage(index, imageId, imageObj) {
    images[index] = imageObj;
    imageKeys[imageId] = index;
    var hotelId = imageObj.hotelId;
    if(!hotelIdImageKeys[hotelId])
        hotelIdImageKeys[hotelId] = [index];
    else
        hotelIdImageKeys[hotelId].push(index);
}

//now looking up image pages by image id or hotel id is easy:
function getImageById(imageId) {
    if(!imageKeys.hasOwnProperty(imageId))
        return false;
    var index = imageKeys[imageId];
    return images[index];
}
function getImagesByHotelId(hotelId) {
    if(!hotelIdImageKeys.hasOwnProperty(hotelId))
        return false;
    var hotelImages = [];
    var indexes = hotelIdImageKeys[hotelId];
    for(var i = 0, len = indexes.length; i < len; i++)
        hotelImages.push(images[indexes[i]]);
    return hotelImages;
}

Maintaining these indexes in this way allows us to quickly find a particular page object by its id, its hotel id, or its position in the search results with just a couple of object attribute lookups rather than looping through the entire set of images.

Another issue that arose is how to change the stored HTML, if needed, once it’s been retrieved from the server. We store the HTML as strings rather than DOM fragments because string manipulation is faster than DOM manipulation (even in a fragment) and strings take up far less memory than DOM fragments. Unfortunately that means that all of jQuery’s handy DOM methods are off the table for working with this content. To get around this, we marked sections of the page that would need to be changed on the fly with HTML comments. Then it was just a simple find and replace operation with no DOM manipulation required.

/*
    the markers are HTML comments like:
    <!--begin section-->
    and
    <!--end section-->
*/
function replaceHtmlAtMarkers(html, replace, beginMarker, endMarker) {
    var begin = html.indexOf(beginMarker);
    var end = html.indexOf(endMarker);
    end += endMarker.length;
    if(begin === -1 || end === -1)
        return html;
    var original = html.substring(begin, end);
    html = html.replace(original, replace);
    return html;
}

There were tons of problems we had to solve while building out the front end for Oyster Shots, and these are merely examples of a few of them. If you’re finding our new blog useful or interesting, leave a comment and let us know!


*Turn your monitor upside-down or stand on your head to read the answer:
ɹǝɹoןdxǝ ʇǝuɹǝʇuı uı pǝzısǝɹ sɐʍ ʇɟǝן ǝɥʇ uo ǝbɐɯı ǝɥʇ :ɹǝʍsuɐ

{ 1 comment}