GoTo Con AMS - Day 2
Posted: | More posts about amsterdam gotocon General
Day two of GoTo Con Amsterdam started with a keynote by former Squeak developer Dan Ingalls. He introduced the Lively kernel - a component architecture for HTML5 that runs in any browser and allows easy composition, sharing and programming of items. Having seen Squeak years ago and being thrilled by its concepts even back then it was amazing to see what you can do with Lively kernel in a browser. If you are a designer and have some spare minutes to spend, consider taking a closer look at this project and dedicating some of your time to help them get better graphics and shapes for their system.
After the keynote I had a hard time deciding on whether to watch Ross Gardler's introduction to the Apache way or Friso van Vollenhoven's talk on building three Hadoop clusters in a year - too many interesting stuff in parallel that day. In the end I went for the Hadoop talk - listening to presentations on what Hadoop is actually being used for is always interesting - especially if it involves institutions like RIPE who have the data to analyze the internet downtime in Egypt.
Frise gave a great overview of Hadoop and how you can even use it for personal purposes: Apache Whirr makes it easy to use Hadoop in the cloud by enabling simple EC2 deployment, the Twitter API is a never ending source for more data to analyze (if you don't have any yourself).
After Jim Webber's presentation on the graph database neo4j I joined the talk on HBase use cases by Michael Stack. He introduced a set of HBase usages, problems people ran into and lessons learned. HBase in itself is built on top of HDFS - and as such inherits its advantages, strengths and some of its weaknesses.
It is great for handling 100s of GB up to PB of data in an online, random access but strong consistency kind of model. It provides a ruby based shell, comes with a java api, map reduce connectivity, pig, hive and cascading integration, provides metrics through the hadoop metrics subsystem that are exposed via JMX and through Ganglia, provides server side filters and co-processors, hadoop security, versioning, replication and more.
Stumbleupon deals with 1B stumbles a month, has 20M users (growing), users spend approx 7h a month stumbling. For them HBase is the de-factor storage engine. It's now 2.5years in production and enabled a "throw nothing away" culture, streamlined development. Analysis is done on a separate HBase cluster from the online version. Their lessons learnt: Educate engineering on how it works, study production numbers (small changes can make a for big payoff), over provisioning makes your life easier and gets your weekends back.
... is a distributed, scalable time series database that collects, stores and serves metrics on the fly. For stumbleupon it is their ears and eyes into the system that quickly replaced the usual mix of ganglia, munin and cacti.
As announced earlier this year, Facebook's messaging system is based on HBase. Also facebook metrics and analytics are stored in HBase. The full story is available in a SIGMOD paper by Facebook.
In short - for Facebook Messaging HBase has to deal with 500M users, millions of messages and billions of instant messages per day. Most interesting piece of the system here was their migration path that by running both systems in parallel made switching over really smooth albeit still technologically challenging.
Their lessons learnt include the need to study production and adjust accordingly, to iterate on the schema to get it right. They also made the experience that there were still some pretty gnarly bugs - however with the help of the HBase community those could be sorted out bit by bit. They also concentrated on building a system that allows for locality - inter rack communication can kill you.
They keep their version of the bing webcrawl in HBase. They have high data ingest volumns (up to multiple TB/hour) from multiple streams. Atop their application also has a wide spectrum of access patterns (from scans down to single cell access). Yahoo right now runs the single larges known HBase cluster on top of 980 2.4 GHz nodes with 16 cores and 24GB Ram each in addition to 6x2TB of disk. Their biggest table has 50B documents, most of the data is loaded in bulk though.
... uses HBase as backend for their image hosting service. In contrast to the above HBase users they don't have a dedicated dev team but are highly motivated and skilled ops. Being cost senstive and with a little bit of bad luck with them really everything went bad that could go bad - from crashing JVMs, bad RAM crashes, bad glibc with a race condition, etc. Their lessons learnt include that it's better to run more smaller nodes than less big nodes. In addition lots of RAM is always great to avoid swapping.
The final talk I attended that day was on tackling the folklore around high performance computing. The speakers re-visited common wisdom that is generally known in the Java Community and re-evaluated it's applicability to recent hardware architectures. Make sure to check out their slides for details on common mis-conceptions when it comes to optimization patterns. Basic take away from this talk is to know not only your programming language and framework but also the VM you are implementing your code for, the system your application will run on and the hardware your stuff will be deployed to: Hardware vendors have gone to great length optimizing their systems, but software developers have been amazing at cancelling out those optimizations quicker then they were put in.
All in all a great conference with lots of inspiring input. Thanks to the organizers for their hard work. Looking forward to seeing some of you over in Vancouver for Apache Con NA 2011.