« Passwords and Data Mining | Main | Failed Keyboard Logging… »
Optimizing Website integration with Amazon’s S3 Service
By Matt | February 28, 2009
At Participatory Culture Foundation we use Amazon’s S3 Service to host our static content — css, js, and images.
This accomplishes two things — it improves the performance for our visitors since Amazon has faster performance and reliability then we can afford on our own servers, and it does so at a lower cost.
In this post we’ll look at how much bandwidth and files/redirect we use without S3, then with various combination of local and redirected files, up to code as optimized as I have been able to make it — fully optimized we our servers only transfer 6.7% of the bytes that the “unoptimized” site would. Optimizing this single popular page to use S3 efficiently saves PCF about $1,000 a year in hosting costs.
1)
Let’s look at a redacted Squid log after the February, 2009 redesign of www.getmiro.com when not using S3 at all:
"GET http://www.getmiro.com/ HTTP/1.1" 200 21781 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.6) Gecko/2009011913 Firefox/3.0.6" "GET http://www.getmiro.com//css/nav.css HTTP/1.1" 200 5004 "http://www.getmiro.com/" "Mozilla/5.0 "GET http://www.getmiro.com//css/styles.css HTTP/1.1" 200 21515 "http://www.getmiro.com/" "Mozilla/5.0 "GET http://www.getmiro.com/css/index.css HTTP/1.1" 200 7709 "http://www.getmiro.com/" "Mozilla/5.0 "GET http://www.getmiro.com//i/blue_bg.png HTTP/1.1" 200 1198 "http://www.getmiro.com//css/styles.css" "Mozilla/5.0 "GET http://www.getmiro.com//i/nav_back.gif HTTP/1.1" 200 530 "http://www.getmiro.com//css/nav.css" "Mozilla/5.0 [blah blah blah...]
That’s 37 files, for a total of 338,359 Bytes.
2)
Now let’s look if we load CSS from our server, but use Apache to re-write images and js to the S3 service:
"GET http://www.getmiro.com/ HTTP/1.1" 200 21781 "http://www.getmiro.com/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.6) Gecko/2009011913 Firefox/3.0.6" "GET http://www.getmiro.com//css/nav.css HTTP/1.1" 200 5004 "http://www.getmiro.com/" "Mozilla/5.0 "GET http://www.getmiro.com//css/styles.css HTTP/1.1" 200 21515 "http://www.getmiro.com/" "Mozilla/5.0 "GET http://www.getmiro.com/css/index.css HTTP/1.1" 200 7709 "http://www.getmiro.com/" "Mozilla/5.0 "GET http://www.getmiro.com//i/blue_bg.png HTTP/1.1" 302 688 "http://www.getmiro.com//css/styles.css" "Mozilla/5.0 "GET http://www.getmiro.com//i/nav_back.gif HTTP/1.1" 302 690 "http://www.getmiro.com//css/nav.css" "Mozilla/5.0 [blah blah blah...]
Now it’s four files, plus 33 redirects — and only 78,705 bytes.
3)
Now let’s use Apache to redirect the CSS to be pulled from Amazon S3.
"GET http://www.getmiro.com/ HTTP/1.1" 200 21781 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)" "GET http://www.getmiro.com//css/nav.css HTTP/1.1" 302 684 "http://www.getmiro.com/" "Mozilla/4.0 "GET http://www.getmiro.com//css/styles.css HTTP/1.1" 302 690 "http://www.getmiro.com/" "Mozilla/4.0 "GET http://www.getmiro.com/css/index.css HTTP/1.1" 302 688 "http://www.getmiro.com/" "Mozilla/4.0 "GET http://www.getmiro.com//i/blue_bg.png HTTP/1.1" 302 687 "http://www.getmiro.com/" "Mozilla/4.0 "GET http://www.getmiro.com//i/nav_back.gif HTTP/1.1" 302 689 "http://www.getmiro.com/" "Mozilla/4.0 [blah blah blah...]
Still four files and 33 redirects, but down to 46,506 bytes.
So far it’s all been pretty standard stuff in Apache using mod_rewrite redirects. Apache sees a CSS sheet being called, it redirects it to S3.
RewriteRule ^/css/(.*) http://s3.getmiro.3.0.com.s3.amazonaws.com/css/$1
And then a css sheet may have a line like this:
background: url(../i/screen_dropshadow.png) -20px -36px no-repeat;
Now an observant user may note in the logs above that I switched from using Firefox to IE. Why? The browsers interpret the CSS differently.
Firefox interprets “../i/” relative to where the CSS style sheet is LOADED from — in our case http://s3.getmiro.3.0.com.s3.amazonaws.com/css/.
Internet Explorer interprets “../i/” relative to where the CSS style sheet is CALLED from — in our case http://www.getmiro.com/css/.
Those familiar with unix notation know that “../i/” from “getmiro.com/css/” gets you to “getmiro.com/i/”.
4)
Now we get fancy.
In implementing S3, we have a bash script which handles the synchronization between our servers and S3. So in that script, let’s intercept the CSS sheets, do a simple SED, and upload the modified files to a special location:
# getmiro css # This substitutes ../i with http://s3.getmiro.3.0.com.s3.amazonaws.com/ in the getmiro css code # and uploads them to a special directory in amazon. This is in turn re-written by Apache to point there. # Having the full url hard coded saves tens of thousands of redirects and gigs of bandwidth. # It's also more efficient then "php-ifying" css to do the url substitution. # First, copy they css to a working directory: cp /data/getmiro/css/*.css /scripts/getmiro_css # It's safer to just modify files we know about, rather then automate finding and modifying without foreknowledge: sed -i 's/\.\.\/i/http:\/\/s3.getmiro.3.0.com.s3.amazonaws.com\/i/g' /scripts/getmiro_css/download-features.css sed -i 's/\.\.\/i/http:\/\/s3.getmiro.3.0.com.s3.amazonaws.com\/i/g' /scripts/getmiro_css/index.css sed -i 's/\.\.\/i/http:\/\/s3.getmiro.3.0.com.s3.amazonaws.com\/i/g' /scripts/getmiro_css/nav.css sed -i 's/\.\.\/i/http:\/\/s3.getmiro.3.0.com.s3.amazonaws.com\/i/g' /scripts/getmiro_css/styles.css # And let's upload them: /usr/local/s3sync/s3sync.rb -r -p -v /scripts/getmiro_css/ s3.getmiro.3.0.com:css/s3_coded/
In the background so the web developers don’t have to worry about modifying the CSS sheets to include the hard link, transforming lines like:
background: url(../i/screen_dropshadow.png) -20px -36px no-repeat;
into
background: url(s3.getmiro.3.0.com.s3.amazonaws.com/i/screen_dropshadow.png) -20px -36px no-repeat;
It’s necessary to use the pattern \.\./i/ in sed in case a developer does hard code the amazon link. The \. means literally a period; regexes like this otherwise use a . as a wildcard to match one character, and just ../i/ would match any two characters before /i/.
In Apache, we change the redirect to this:
RewriteRule ^/css/(.*) http://s3.getmiro.3.0.com.s3.amazonaws.com/css/S3_coded/$1
With this change implemented:
"GET http://www.getmiro.com/ HTTP/1.1" 200 21781 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)" "GET http://www.getmiro.com//css/styles.css HTTP/1.1" 302 708 "http://www.getmiro.com/" "Mozilla/4.0 "GET http://www.getmiro.com//css/nav.css HTTP/1.1" 302 702 "http://www.getmiro.com/" "Mozilla/4.0 "GET http://www.getmiro.com/css/index.css HTTP/1.1" 302 706 "http://www.getmiro.com/" "Mozilla/4.0
Much better! One file, three redirects for 23,897 bytes. That one change represents nearly a 50% reduction in bandwidth usage on our physical servers from just the example above, and only about 1/3rd the bandwidth if we used CSS sheets being served locally even if they had the hard links to S3 on them.
5)
Finally one more tweak.
The default Apache redirect includes HTML code saying where a file has moved. But this isn’t necessary — a web browser just needs the correct headers to tell it where to go.
So replacing the redirect to S3 we used before, we use this:
RewriteRule ^/css/(.*) /custom_messages/css_rewrite.php
This is css_rewrite.php:
<?php /* This rewrites css just using headers. This saves about 300bytes per redirect -- which saves a heck of a lot of bandwidth over time when we're doing 3 css rewrites for every page view...works out to 30+ MB / day!
Invoke by: RewriteRule ^/css/(.*) /custom_messages/css_rewrite.php */
$new_server = "http://s3.getmiro.3.0.com.s3.amazonaws.com/css/s3_coded/";
$new_url = preg_replace('/^.*\//', $new_server, $_SERVER[REQUEST_URI]);
echo Header( "HTTP/1.1 301 Moved Permanently" ); echo Header( "Location: $new_url" ); ?>
This produces a very minimal redirect — under 400 bytes rather then over 700 bytes.
I haven’t done a complete analysis to know if this significantly slower then a native Apache redirect; initial review shows it is not slower for any given page load. This step would need a very, very busy site however to make a meaningful performance or cost impact. It’s something I’m noting though, because there could be other situations this type of optimization could be useful. Most importantly the variables Apache (or IIS) can pass to other programs like PHP. See this link for a list of them.
"GET http://www.getmiro.com/ HTTP/1.1" 200 21781 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)" "GET http://www.getmiro.com//css/styles.css HTTP/1.1" 301 397 "http://www.getmiro.com/" "Mozilla/4.0 "GET http://www.getmiro.com//css/nav.css HTTP/1.1" 301 394 "http://www.getmiro.com/" "Mozilla/4.0 "GET http://www.getmiro.com/css/index.css HTTP/1.1" 301 396 "http://www.getmiro.com/" "Mozilla/4.0
Now just one file, three redirects and 22,968.
Bottom line?
Let’s take a typical day when the getmiro.com homepage is called 20,000 times.
Scenario Size Total Daily Estimated
Bandwidth Daily Cost**
1 338,359 6.5GB $4.73
2 78,705 1.5GB 1.09
3 46,506 887MB* .65
4 23,897 455MB .33
5 22,968 438MB .32
* Unadjusted for Firefox's interpretation of CSS paths
** This estimate is based on purchasing enough fixed bandwidth (Mbps)
to cover our peak daily usage. Our communication costs for our
physical servers is approximately times as much as Amazon S3 based on
actual transfers.
So without S3 or any optimization, we’d be looking at a monthly cost around $142.00.
With S3 and with all our optimization, we’re looking at a monthly cost around $54.00.
For a small non-profit, that’s a nice savings over time.
Topics: Linux, Networking, Sysadmin Tools, Web Hosting Tools | No Comments »
Comments
You must be logged in to post a comment.