<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"><channel><atom:link rel="hub" href="http://tumblr.superfeedr.com/" xmlns:atom="http://www.w3.org/2005/Atom"/><description></description><title>Progressive Data Solutions</title><generator>Tumblr (3.0; @pdsolutions)</generator><link>http://blog.pdatasolutions.com/</link><item><title>What Second Life can teach your datacenter about scaling Web apps</title><description>&lt;a href="http://arstechnica.com/business/data-centers/2010/02/what-second-life-can-teach-all-companies-about-scaling-web-apps.ars/"&gt;What Second Life can teach your datacenter about scaling Web apps&lt;/a&gt;</description><link>http://blog.pdatasolutions.com/post/368148233</link><guid>http://blog.pdatasolutions.com/post/368148233</guid><pubDate>Tue, 02 Feb 2010 21:19:20 -0700</pubDate></item><item><title>Scaling lessons from Second Life</title><description>&lt;a href="http://arstechnica.com/business/data-centers/2010/02/what-second-life-can-teach-all-companies-about-scaling-web-apps.ars"&gt;Scaling lessons from Second Life&lt;/a&gt;: &lt;p&gt;&lt;a href="http://news.ycombinator.com/item?id=1095467"&gt;Comments&lt;/a&gt;&lt;/p&gt;</description><link>http://blog.pdatasolutions.com/post/367863154</link><guid>http://blog.pdatasolutions.com/post/367863154</guid><pubDate>Tue, 02 Feb 2010 18:35:18 -0700</pubDate></item><item><title>Generating Thousands of PDFs on EC2 with Ruby « Rails Dog</title><description>&lt;a href="http://railsdog.com/blog/2009/12/generating-pdfs-on-ec2-with-ruby/"&gt;Generating Thousands of PDFs on EC2 with Ruby « Rails Dog&lt;/a&gt;</description><link>http://blog.pdatasolutions.com/post/297817369</link><guid>http://blog.pdatasolutions.com/post/297817369</guid><pubDate>Wed, 23 Dec 2009 20:35:35 -0700</pubDate></item><item><title>Hadoop on Rails</title><description>&lt;p&gt;A few months ago, I wrote an &lt;a title="article" target="_self" href="http://blog.pdatasolutions.com/post/191978092/ruby-on-hadoop-quickstart"&gt;article&lt;/a&gt; about using Ruby with Hadoop, and more specifically, the Amazon Elastic MapReduce (EMR) service. I hope some of you found that article helpful.&lt;/p&gt;
&lt;p&gt;I figured it was time to post a follow-up - using Rails with Hadoop. Much of my work over the past few months has been building a system to efficiently store, process, and display large amounts of log data. Naturally I wanted to use Rails, but I also knew that I needed to use EMR. I was tasked with building the system myself, so rather than spend more time building and maintaining a Hadoop cluster, I opted to use EMR up-front and focus on the entire process.&lt;/p&gt;
&lt;p&gt;So what have I done with Hadoop and Rails? Essentially, I’ve built a system that processes large amounts of log data end-to-end with Ruby, Hadoop/Pig, and Rails.&lt;/p&gt;
&lt;p&gt;&lt;a target="_blank" href="http://img.skitch.com/20091210-qak4pqyx98f1jtknugauacyb55.jpg"&gt;&lt;img src="http://img.skitch.com/20091210-qak4pqyx98f1jtknugauacyb55.jpg" width="400px"/&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;So what’s going on in this image? Following is a description of each step. Note that all Ruby scripts / processing are done within a Rails app, often with script/runner to enable access to the apps data model:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Log data is collected and stored in S3. A Ruby script on an EC2 instance, started by cron, downloads the log data for the previous day, consolidates the many separate files into one, and then compress the file using bzip2 compression.&lt;/li&gt;
&lt;li&gt;A Ruby script sends the compressed file back to S3, storing it in a new bucket.&lt;/li&gt;
&lt;li&gt;The compressed file is also sent to the Rackspace CloudFiles service, for off-site backup. &lt;/li&gt;
&lt;li&gt;After log file consolidation and backup is complete, a Ruby script starts an Elastic MapReduce job.&lt;/li&gt;
&lt;li&gt;Data for the job, created during steps 1 and 2, is transferred from S3 to the temporary Hadoop cluster created by Elastic MapReduce. The data is processed using a Pig script which is also stored in an S3 bucket.&lt;/li&gt;
&lt;li&gt;Results of the EMR processing are stored in S3, in a separate bucket.&lt;/li&gt;
&lt;li&gt;Later, after the Elastic MapReduce job is complete, the output is downloaded via a Ruby script to the tmp/ directory within the Rails app.&lt;/li&gt;
&lt;li&gt;Once the data is downloaded, it is processed within the context of the Rails app, and loaded into a MySQL database residing on Amazon’s Relational Database Service.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Notice that each step makes use of Ruby and / or Rails. Ruby really is the glue that holds this system together, and it’s a very powerful glue. A lot of what I am doing is date-specific, and Ruby’s date library and methods make parsing and handling dates much easier (than using shell scrips).&lt;/p&gt;
&lt;p&gt;The other language used in this system, Pig, is used to filter, count, and group the large datasets. Once Pig has done its work, running on EMR, the output is just a series of text files that are parsed by Ruby, then stored in MySQL using ActiveRecord relationships. Hadoop / Pig does the heavy lifting, while Ruby / Rails controls everything.&lt;/p&gt;
&lt;p&gt;Each part of the system is designed to grow as needed. If log combination and compression is taking too long, it can be modified to run on a larger more powerful EC2 instance. Once that process gets too big for EC2, it could be moved into its own EMR process, using as many machines as necessary.&lt;/p&gt;
&lt;p&gt;Likewise, if the Hadoop/Pig processing takes too long, more machines can be added by adjusting one line in the controlling script. Even the MySQL storage can be increased or moved to a more powerful server if needed, thanks to RDS and its simple API.&lt;/p&gt;
&lt;p&gt;The biggest challenge in getting this system up and running was learning Pig. Once you understand that Pig is really for filtering, grouping, and counting data, you realize its power. Pig is not Turing Complete, so it can be challenging to solve problems with it. For instance, there are no loops, which can make certain types of problems difficult to solve. I worked around some problems I encountered by moving some of the processing into the Rails app.&lt;/p&gt;
&lt;p&gt;There are issues with this system that will need to be addressed soon. The log file combination and compression will need to be improved. I’ll probably switch from bzip2 compression to using splitable LZO, as detailed in this &lt;a title="article" target="_blank" href="http://www.cloudera.com/blog/2009/11/17/hadoop-at-twitter-part-1-splittable-lzo-compression/"&gt;article&lt;/a&gt;. Twitter is doing some pretty cool things with Pig / Hadoop and they make a strong case for using LZO. Another issue I’ll be looking at soon is how to streamline the EMR job process. I’m adding more jobs and at some point I’ll have to abstract what I’m doing into some sort of framework. There’s just too much code duplication there.&lt;/p&gt;
&lt;p&gt;Let me know if you have questions.&lt;/p&gt;</description><link>http://blog.pdatasolutions.com/post/277838137</link><guid>http://blog.pdatasolutions.com/post/277838137</guid><pubDate>Thu, 10 Dec 2009 12:32:54 -0700</pubDate></item><item><title>Production Rails Tuning with Passenger: PassengerMaxProcesses</title><description>&lt;a href="http://blog.scoutapp.com/articles/2009/12/08/production-rails-tuning-with-passenger-passengermaxprocesses"&gt;Production Rails Tuning with Passenger: PassengerMaxProcesses&lt;/a&gt;: &lt;p&gt;tuning phusion passenger performance&lt;/p&gt;</description><link>http://blog.pdatasolutions.com/post/275228799</link><guid>http://blog.pdatasolutions.com/post/275228799</guid><pubDate>Tue, 08 Dec 2009 16:34:22 -0700</pubDate></item><item><title>masterkain's ImageMagick-sl at master - GitHub</title><description>&lt;a href="http://github.com/masterkain/ImageMagick-sl"&gt;masterkain's ImageMagick-sl at master - GitHub&lt;/a&gt;: &lt;p&gt;working build script for ImageMagick on Snow Leopard&lt;/p&gt;</description><link>http://blog.pdatasolutions.com/post/268653444</link><guid>http://blog.pdatasolutions.com/post/268653444</guid><pubDate>Thu, 03 Dec 2009 22:15:50 -0700</pubDate></item><item><title>Passing environment variables to Ruby from Phusion Passenger « Phusion Corporate Blog</title><description>&lt;a href="http://blog.phusion.nl/2008/12/16/passing-environment-variables-to-ruby-from-phusion-passenger/"&gt;Passing environment variables to Ruby from Phusion Passenger « Phusion Corporate Blog&lt;/a&gt;: &lt;p&gt;getting mini_magick gem and ImageMagick to work with passenger&lt;/p&gt;</description><link>http://blog.pdatasolutions.com/post/268324913</link><guid>http://blog.pdatasolutions.com/post/268324913</guid><pubDate>Thu, 03 Dec 2009 18:05:04 -0700</pubDate></item><item><title>Pollux: Automatically Organize and Fix Your Music Library</title><description>&lt;a href="http://www.polluxapp.com/index.php"&gt;Pollux: Automatically Organize and Fix Your Music Library&lt;/a&gt;: &lt;p&gt;free app to clean itunes library&lt;/p&gt;</description><link>http://blog.pdatasolutions.com/post/267850725</link><guid>http://blog.pdatasolutions.com/post/267850725</guid><pubDate>Thu, 03 Dec 2009 10:11:49 -0700</pubDate></item><item><title>TidySongs.com – Tidy Up Your Music!</title><description>&lt;a href="http://www.tidysongs.com/"&gt;TidySongs.com – Tidy Up Your Music!&lt;/a&gt;: &lt;p&gt;clean itunes library&lt;/p&gt;</description><link>http://blog.pdatasolutions.com/post/267850730</link><guid>http://blog.pdatasolutions.com/post/267850730</guid><pubDate>Thu, 03 Dec 2009 10:11:49 -0700</pubDate></item><item><title>ActiveJS: The Cross Platform JavaScript MVC</title><description>&lt;a href="http://activerecordjs.org/record.html"&gt;ActiveJS: The Cross Platform JavaScript MVC&lt;/a&gt;: &lt;p&gt;ActiveRecord-like library for browsers&lt;/p&gt;</description><link>http://blog.pdatasolutions.com/post/266715307</link><guid>http://blog.pdatasolutions.com/post/266715307</guid><pubDate>Wed, 02 Dec 2009 14:58:33 -0700</pubDate></item><item><title>R A C K A M O L E</title><description>&lt;a href="http://rackamole.com/"&gt;R A C K A M O L E&lt;/a&gt;: &lt;p&gt;rack app for tracking site usage&lt;/p&gt;</description><link>http://blog.pdatasolutions.com/post/261547594</link><guid>http://blog.pdatasolutions.com/post/261547594</guid><pubDate>Sat, 28 Nov 2009 21:04:14 -0700</pubDate></item><item><title>Interview with Ezra Zygmuntowicz – Engine Yard</title><description>&lt;a href="http://howsoftwareisbuilt.com/2009/11/09/interview-with-ezra-zygmuntowicz-engine-yard/"&gt;Interview with Ezra Zygmuntowicz – Engine Yard&lt;/a&gt;: &lt;p&gt;interview with Ezra about the history of EngineYard&lt;/p&gt;</description><link>http://blog.pdatasolutions.com/post/255896977</link><guid>http://blog.pdatasolutions.com/post/255896977</guid><pubDate>Tue, 24 Nov 2009 12:01:37 -0700</pubDate></item><item><title>Neven Mrgan's tumbl</title><description>&lt;a href="http://mrgan.tumblr.com/post/125490362/glyphboard2"&gt;Neven Mrgan's tumbl&lt;/a&gt;: &lt;p&gt;glyphboard for iphone&lt;/p&gt;</description><link>http://blog.pdatasolutions.com/post/255896986</link><guid>http://blog.pdatasolutions.com/post/255896986</guid><pubDate>Tue, 24 Nov 2009 12:01:37 -0700</pubDate></item><item><title>#haml</title><description>&lt;a href="http://haml-lang.com/"&gt;#haml&lt;/a&gt;: &lt;p&gt;haml for rails&lt;/p&gt;</description><link>http://blog.pdatasolutions.com/post/251212371</link><guid>http://blog.pdatasolutions.com/post/251212371</guid><pubDate>Fri, 20 Nov 2009 15:56:23 -0700</pubDate></item><item><title>Apple's Mistake</title><description>&lt;a href="http://feedproxy.google.com/~r/PaulGrahamUnofficialRssFeed/~3/r0i05TVeB0o/apple.html"&gt;Apple's Mistake&lt;/a&gt;: &lt;p&gt;“Software isn’t like music or books. It’s too complicated for a third party to act as an intermediary between developer and user. And yet that’s what Apple is trying to be with the App Store: a…&lt;/p&gt;</description><link>http://blog.pdatasolutions.com/post/251026133</link><guid>http://blog.pdatasolutions.com/post/251026133</guid><pubDate>Fri, 20 Nov 2009 11:56:47 -0700</pubDate></item><item><title>Hackido: Install Ruby on Rails on Ubuntu Karmic Koala 9.10</title><description>&lt;a href="http://www.hackido.com/2009/11/install-ruby-on-rails-on-ubuntu-karmic.html"&gt;Hackido: Install Ruby on Rails on Ubuntu Karmic Koala 9.10&lt;/a&gt;: &lt;p&gt;setup rails on base ubuntu 9.10 image&lt;/p&gt;</description><link>http://blog.pdatasolutions.com/post/248727887</link><guid>http://blog.pdatasolutions.com/post/248727887</guid><pubDate>Wed, 18 Nov 2009 12:51:33 -0700</pubDate></item><item><title>Simple CouchDB multi-master clustering via Nginx — Ephemera</title><description>&lt;a href="http://ephemera.karmi.cz/post/247255194/simple-couchdb-multi-master-clustering-via-nginx"&gt;Simple CouchDB multi-master clustering via Nginx — Ephemera&lt;/a&gt;: &lt;p&gt;nginx front to couchdb&lt;/p&gt;</description><link>http://blog.pdatasolutions.com/post/248169607</link><guid>http://blog.pdatasolutions.com/post/248169607</guid><pubDate>Tue, 17 Nov 2009 23:22:18 -0700</pubDate></item><item><title>Delete Flash Cookies - OS X Daily</title><description>&lt;a href="http://osxdaily.com/2009/11/13/delete-flash-cookies/"&gt;Delete Flash Cookies - OS X Daily&lt;/a&gt;: &lt;p&gt;how to remove flash cookies&lt;/p&gt;</description><link>http://blog.pdatasolutions.com/post/243061284</link><guid>http://blog.pdatasolutions.com/post/243061284</guid><pubDate>Fri, 13 Nov 2009 17:41:57 -0700</pubDate></item><item><title>Twitter Data Dump: InfoChimps Puts 1B Connections Up For Sale</title><description>&lt;a href="http://www.readwriteweb.com/archives/twitter_data_dump_infochimp_puts_1b_connections_up.php"&gt;Twitter Data Dump: InfoChimps Puts 1B Connections Up For Sale&lt;/a&gt;: &lt;p&gt;large twitter datasets for sale&lt;/p&gt;</description><link>http://blog.pdatasolutions.com/post/242973040</link><guid>http://blog.pdatasolutions.com/post/242973040</guid><pubDate>Fri, 13 Nov 2009 15:56:48 -0700</pubDate></item><item><title>NSA to store yottabytes of surveillance data in Utah megarepository</title><description>&lt;a href="http://www.crunchgear.com/2009/11/01/nsa-to-store-yottabytes-of-surveillance-data-in-utah-megarepository/"&gt;NSA to store yottabytes of surveillance data in Utah megarepository&lt;/a&gt;: &lt;p&gt;&lt;a href="http://news.ycombinator.com/item?id=915971"&gt;Comments&lt;/a&gt;&lt;/p&gt;</description><link>http://blog.pdatasolutions.com/post/230933107</link><guid>http://blog.pdatasolutions.com/post/230933107</guid><pubDate>Mon, 02 Nov 2009 10:17:58 -0700</pubDate></item></channel></rss>
