December Apache Hadoop Get Together Berlin

2011-11-24 20:14
First of all please note that meetup organisation is being transitioned over to our xing meetup group. So in order to be notified of future meetings, make sure to join that group. Please make also sure to register for the December event as in contrast to past meetups this time space will be limited, so make sure to grab a ticket. If you cannot make it, please let the organiser know so he can issue additional tickets.

For those of you currently following this blog only for announcements:

When: December 7th 2011, 7 p.m.

Where: Smarthouse GmbH, Erich-Weinert-Str. 145, 10409 Berlin

Speaker: Martin Scholl
Title: On Firehoses and Storms: Event Thinking, Event Processing

Speaker: Douwe Osinga
Title: Overview of the Data Processing Pipeline at Triposo

Looking forward to seeing you at the next Apache Hadoop Get Together Berlin in December.

Apache Con Wrap Up

2011-11-16 20:45
First things first - slides, audio and abstracts of Apache Con are all online now on their Lanyrd page. So if you missed the conference or could not attend a session due to conflicting with another interesting session - that's your chance to catch up.

For those of you who are speaking German, there's also a summary of Apache Con available on heise Open. (If you don't speak German, I have been told that the Google Translate version of the site captures the gist of the article reasonably well.)

Cloudera in Berlin

2011-11-14 20:24
Cloudera is hosting another round of trainings in Berlin in November this year. In addition to the trainings on Apache Hadoop this time around there will also be trainings on Apache HBase.

Register online via:

Being in San Francisco

2011-11-06 04:47
I spent the last two weeks together with Thilo in San Francisco - and neighboring areas. I had asked beforehand for recommendations on where to go and what to do, had purchased a "Rough Guide to California" as well as a "Lonely Planet guide for San Francisco". In addition I shared my arrival and departure times with a few people I know here. As to be expected our schedule quickly grew until it exploded, so we ended up doing greedy optimisation stuffing things in and putting them out again while we went along. Result of all that: An amazing two weeks that went by way to fast and the conclusion that we do need to return and bring way more time next time around.

A huge thanks for the warm welcome, for shared local knowledge on where to go, invitations for lunch or dinner, as well as fun hours at the Castro during Halloween. Special thanks to Datameer who when asked for accommodations recommendations kindly offered to host us - it makes such a big difference to get the chance to stay in a local neighborhood and avoid hotels altogether. All in all it's amazing to fly across an ocean, cross multiple time zones and arrive in a city that almost feels like being home.

Following a brief overview of what our final schedule turned out to be - happy to share pictures we took privately. After getting our car on Sunday, we went over to Berkeley to see their impressive campus - and got to see part of the Berkeley occupy movement.

Day one was reserved for sports

After biking the bridge we went down to Sausalito. Got some delicious food at fish - a restaurant serving all sorts of healthy and tasty fish dishes. After that we went out on a canoe from Sea Trek Kayaking, taking a closer look at the house boat community and a brief tour over to the Sausalito ferry port. When back at the shore we cycled to Sausalito and took the ferry to SFO.

Day 2 was booked for Highway #1 to Santa Cruz

We headed down scenic Highway#1, past the cliffs at Half Moon Bay. Went out for a hike in Butano park to hug some Redwood trees and take pictures of a fairy-tale like forest. Then headed over to Pigeon Point Lighthouse - even if you are not into staying at hostels you should stop there: this place is great for a nice view of the ocean and the lighthouse itself. Finally we went down to beautiful Santa Cruz.

Day 3 Muir woods

Not much to be said here: It's always impressive to walk underneath huge Redwood trees and go hiking in the mountains around to get a better view.

Weekend for Yosemite

Took the route via Highway 140 - the drive itself was interesting already as it took us to quite a different country side than what we had seen until then. Got through Mariposa in the end over to Midpines. We stayed at friendly Bug Rustic Mountain - classified as Hostel, features a spa with sauna and whirlpool. We were lucky enough to arrive on Friday before Halloween to get to see the screening of "Rocky Horror Picture Show" - including a printed transcript with tag lines to be shouted at the screen for those who do not know these already.

On Saturday we went hiking in amazing Yosemite park. Only then I realized how close different landscapes can be: It took us only half a day by car to go from coast and ocean over to high mountains. We chose to hike to the upper Yosemite fall - returning after 7m of rather steep trails on a sunny but fortunately quite clear day we were absolutely tired.

Day 7 Halloween

We spent Halloween morning over in St. Helena and went to the Francis Ford Coppola Winery. Hard to believe, but several tasty types of Californian wine do exist - at least that's what Thilo confirmed when in St. Helena.

Evening was booked for dressing up and going to the Castro, though presumably more calm than in past years the area is still a must-go for Halloween if you are into watching people walking around in impressing costumes.

Day 8 for Alcatras

Not much to be added here - don't miss if you visit SFO.

Day 9 for China town

After several busy days we only went to China town that day and tried to recover for the rest of the afternoon - and went out for Dia de los Muertos in the evening.

Day 10 for watching whales

Happy I took sea-sickness pre-cautions on that one. It was a bit of a rough ride with the catamaran. But in the end we got to see whales close to Farallon Islands. The naturalist did a great job explaining not only the life of the whales but also provided some background on the islands.

Day 11 for Point Reyes and North Beach

In the morning we followed the recommendation to get to see North Beach - if you like Friedrichshain/ Kreuzberg in Berlin or Dresden Neustadt - do not miss North Beach in San Francisco.

The afternoon was reserved for driving over to Point Reyes Lighthouse. Though we did not spot any whales in the water, the landscape already was amazing by itself.

Day 12 - final walk through SFO

We took some time to walk along Hayes street, through the Golden Gate park up to Cliff House and went back with a Muni bus to the city - just too long to walk right back.

Thanks again to Stefan, St.Ack, Jon, JD, Ted, Ellen, Doug, Anne, Johan, Doris, Jens, Lance, Felix, Markus and everyone else who helped make this trip as awesome as it really was. Sorry to everyone who I did not manage to meet or get in touch - hopefully we can fix that next time I'm here - or next time you are in Berlin.

Apache Hadoop Get Together - Hand over

2011-11-02 16:20
Apache Hadoop receives lots of attention from large US corporations who are using the project to scale their data processing pipelines:

“Facebook uses Hadoop and Hive extensively to process large data sets. [...]” (Ashish Thusoo, Engineering Manager at Facebook), "Hadoop is a key ingredient in allowing LinkedIn to build many of our most computationally difficult features [...]" (Jay Kreps, Principal Engineer, LinkedIn), "Hadoop enables [Twitter] to store, process, and derive insights from our data in ways that wouldn't otherwise be possible. [...]" (Kevin Weil, Analytics Lead, Twitter). Found on Yahoo developer blog.

However the system's use is not limited to large corporations only: With 101tec, Zanox,, nurago also local German players are using the project to enable new applications. Add components like Lucene, Redis, CouchDB, HBase and UIMA to the mix and you end up with a set of majour open source components that allow developers to rapidly develop systems that until a few years ago were possible only either in Google-like companies or in research.

The Berlin Apache Hadoop Get Together started in 2008 allowed to learn more on how the average local company leveraged this software. It is a platform to get in touch informally, exchange knowledge and best practices across corporate boundaries.

After three years of organising that event it is time to hand it over to new caring hands: David Obermann from Idealo kindly volunteered to take over organisation. He is a long-term attendee of the event and will continue it in the roughly the same spirit as before: Technical talks on success stories by users, new features by developers - not solely restricted to Hadoop only but also taking into account related projects.

A huge Thank You for taking up the work of co-ordinating, finding a venue and a sponsor for the videos goes to David! If any of you attending the event think that you have an interesting story to share, would like to support the event financially or just help out please get in touch with David.

Looking forward to the next Apache Hadoop Get Together Berlin. Watch this space for updates on when and where it will take place.

One Ring to rule them all

2011-10-28 19:23
One Ring to find them

One Ring to bring them all

and in the darkness bind them:

Apache Con NA

2011-10-25 10:50
Title: Apache Con NA
Location: Vancouver
Link out: Click here
Start Date: 2011-11-07
End Date: 2011-11-11

See you in Vancouver at Apache Con NA 2011

2011-10-24 13:49
Mid November Apache hosts its famous yearly conference - this time in Vancouver/Canada. They kindly accepted my presentations on Apache Mahout for intelligent data analysis (mostly focused on introducing the project to new comers and showing what happened within the project in the past year - if you have any wish concerning topics you would like to see covered in particular, please let me know) as well as a more committer focused one on Talking people into creating patches (with the goal of highlighting some of the issues new-comers to free software projects that want to contribute run into and initiating a discussion on what helps to convince them to keep up the momentum and over come and obstacles).

Looking forward to seeing you in Vancouver for Apache Con NA.

GoTo Con AMS - Day 2

2011-10-23 10:47
Day two of GoTo Con Amsterdam started with a keynote by former Squeak developer Dan Ingalls. He introduced the Lively kernel - a component architecture for HTML5 that runs in any browser and allows easy composition, sharing and programming of items. Having seen Squeak years ago and being thrilled by its concepts even back then it was amazing to see what you can do with Lively kernel in a browser. If you are a designer and have some spare minutes to spend, consider taking a closer look at this project and dedicating some of your time to help them get better graphics and shapes for their system.

After the keynote I had a hard time deciding on whether to watch Ross Gardler's introduction to the Apache way or Friso van Vollenhoven's talk on building three Hadoop clusters in a year - too many interesting stuff in parallel that day. In the end I went for the Hadoop talk - listening to presentations on what Hadoop is actually being used for is always interesting - especially if it involves institutions like RIPE who have the data to analyze the internet downtime in Egypt.

Frise gave a great overview of Hadoop and how you can even use it for personal purposes: Apache Whirr makes it easy to use Hadoop in the cloud by enabling simple EC2 deployment, the Twitter API is a never ending source for more data to analyze (if you don't have any yourself).

After Jim Webber's presentation on the graph database neo4j I joined the talk on HBase use cases by Michael Stack. He introduced a set of HBase usages, problems people ran into and lessons learned. HBase in itself is built on top of HDFS - and as such inherits its advantages, strengths and some of its weaknesses.

It is great for handling 100s of GB up to PB of data in an online, random access but strong consistency kind of model. It provides a ruby based shell, comes with a java api, map reduce connectivity, pig, hive and cascading integration, provides metrics through the hadoop metrics subsystem that are exposed via JMX and through Ganglia, provides server side filters and co-processors, hadoop security, versioning, replication and more.


Stumbleupon deals with 1B stumbles a month, has 20M users (growing), users spend approx 7h a month stumbling. For them HBase is the de-factor storage engine. It's now 2.5years in production and enabled a "throw nothing away" culture, streamlined development. Analysis is done on a separate HBase cluster from the online version. Their lessons learnt: Educate engineering on how it works, study production numbers (small changes can make a for big payoff), over provisioning makes your life easier and gets your weekends back.


... is a distributed, scalable time series database that collects, stores and serves metrics on the fly. For stumbleupon it is their ears and eyes into the system that quickly replaced the usual mix of ganglia, munin and cacti.


As announced earlier this year, Facebook's messaging system is based on HBase. Also facebook metrics and analytics are stored in HBase. The full story is available in a SIGMOD paper by Facebook.

In short - for Facebook Messaging HBase has to deal with 500M users, millions of messages and billions of instant messages per day. Most interesting piece of the system here was their migration path that by running both systems in parallel made switching over really smooth albeit still technologically challenging.

Their lessons learnt include the need to study production and adjust accordingly, to iterate on the schema to get it right. They also made the experience that there were still some pretty gnarly bugs - however with the help of the HBase community those could be sorted out bit by bit. They also concentrated on building a system that allows for locality - inter rack communication can kill you.


They keep their version of the bing webcrawl in HBase. They have high data ingest volumns (up to multiple TB/hour) from multiple streams. Atop their application also has a wide spectrum of access patterns (from scans down to single cell access). Yahoo right now runs the single larges known HBase cluster on top of 980 2.4 GHz nodes with 16 cores and 24GB Ram each in addition to 6x2TB of disk. Their biggest table has 50B documents, most of the data is loaded in bulk though.


... uses HBase as backend for their image hosting service. In contrast to the above HBase users they don't have a dedicated dev team but are highly motivated and skilled ops. Being cost senstive and with a little bit of bad luck with them really everything went bad that could go bad - from crashing JVMs, bad RAM crashes, bad glibc with a race condition, etc. Their lessons learnt include that it's better to run more smaller nodes than less big nodes. In addition lots of RAM is always great to avoid swapping.

The final talk I attended that day was on tackling the folklore around high performance computing. The speakers re-visited common wisdom that is generally known in the Java Community and re-evaluated it's applicability to recent hardware architectures. Make sure to check out their slides for details on common mis-conceptions when it comes to optimization patterns. Basic take away from this talk is to know not only your programming language and framework but also the VM you are implementing your code for, the system your application will run on and the hardware your stuff will be deployed to: Hardware vendors have gone to great length optimizing their systems, but software developers have been amazing at cancelling out those optimizations quicker then they were put in.

All in all a great conference with lots of inspiring input. Thanks to the organizers for their hard work. Looking forward to seeing some of you over in Vancouver for Apache Con NA 2011.

GoTo Con AMS - Day 1

2011-10-22 20:44
Last week GoTo Con took place in Amsterdam. Being a sister conference to GoTo in Aarhus the Amsterdam event focused on the broad topics of agile development, architectural challenges, backend and frontend development, platforms like the JVM and .NET. In addition the Amsterdam event featured a special Apache track tailored towards presentations focusing on the development model at Apache and the technologies developed at Apache.

Keynote: Dart

The first day started with the keynote by Kasper Lund who introduced Google's new language Dart. Kasper was involved with developing V8 at Google. Based on his (and other project members') experiences with large JavaScript projects the idea to create a new language for the browser was born. The goal was to build a language that had less pitfalls than JavaScript, makes it easier to provide tool support for and makes reasoning about code easier. Dart comes with class based single inheritance, lexical scoping, optional typing. It is by design single threaded. The concept of isolates cleanly introduces the concept of isolated workers that communicate through message passing only and thus can be run in parallel by the VM. One concept that seemed particularly interesting for an interpreted language was that of snapshots: An application can be serialized after it has loaded and initialized, the result can even be transferred, shortening load time substantially.

So far Dart is just a tech preview - on the agenda of the development team we find items such as better support for REST arguments, enums, reflection, pattern matching, tooling for test coverage and profiling. All code is freely available, also the language specification and tutorials are open. The developers would love to get more feedback from external teams.

Twitter JVM tuning best practices

In his presentation on JVM tuning Attila Szegedi went into quite some detail on what kind of measures Twitter usually takes when it comes to optimizing code that run on the JVM and exhibits performance issues. Broadly speaking there are three dimensions along which the usual culprits for bad performance hide:

  • Memory footprint of the applciation.
  • Latency of requests.
  • Thread coordination issues.

Memory footprint reduction

A first step always should be to verify that memory is actually responsible for the issues seen. Running the JVM with verbosegc turned on helps identify how often and how effective full GC cycles happen on the machine. Next step is to take into account the simple solution: Evaluate whether the application can simply be given more memory. If that does not help or is impossible start thinking about how to shrink memory requirements: Use caching to avoid having to load all data im memory at once, trim down the data representation used in your implementation, when looking into what to trim know exactly what amount of memory various objects need and how many of these object you actually keep in memory - this analysis should also go into detail when using code generated from frameworks like thrift.

Latency fights

When taking a simple view latency optimization boils down to making a tradeoff between memory usage and time. A little less naive view is to understand that actually it is a set of three goals to optimize:

Tuning an application means to take the product of the three, shift focus but keep the product stable. Optimization is assumed to increase the resulting product.

Biggest thread to latency are full gc cycles. Things to keep in mind when tuning and optimizing: Though the type of gc to run is configurable, this configuration does not apply to cleanup of eden space - Eden is always cleaned up with a stop-the-world gc. In general this is not too grave, as cleaning up objects that are no longer referenced is very cheap. However it can turn into a problem when there are too many surviving objects.

When it comes to selecting GC implementations: Optimize for throughput by delaying GC for as long as possible. This is especially handy for bulk jobs. When optimizing for responsiveness use low pause collectors - they incur a somewhat constant penalty however those avoid having single requests with extremely large response time. This is most handy for online jobs.

Other options to look into: Use adaptivesizepolicy and maxgcholdmillis to allow the jvm to size heap on its own based on your target characteristics. Use the printheapatgc option to view gc heap collection statistics - especially watch out for fromspace being less than 100%, use printtenuredistribution to keep an eye on number of ages, size distribution. In general, give an app as much memory as possible - when using concurrent mark and sweep implementation make sure to over-provision by about 25 to 30% to give the app a gc cushion for operation. If you can spare one cpu, set initiateoccupationfraction to 0 and let gc run all the time.

Thread coordination

The last issue in general causing delays are thread coordination issues. The facilities for multi-threaded programming in Java are still pretty low level - even worse, developers generally hardly know about synchronized - not so much about the atomic data types that are available - let alone other features of the concurrent package.

Make sure you check out the speaker's slides they certainly contain valuable information for developers that want to scale their Java applications.


Another talk that was pretty interesting to me was the introduction of Akka - a project I had only heard about before but did not have any deep technical background knowledge on. The goals when building it were fault tolerance, scalability and concurrency. Basically an easy way to scale up and out. Built in Scala, Akka also comes with Java bindings.

Akka is built around the actor model for easier distribution. Actors are isolated, communicate only via messages and have no shared memory - making it easy to run them in a distributed way without having to worry about synchronization. Distribution across machines is currently based on protocol buffers and NIO. However the resulting network topology is still hard wired during development time.

The goal of new Akka developments is to make roll-out dynamic and adaptive. For that they came up with a zookeeper based virtual address resolution, configurable load balancing strategies and the option for reconfiguration during runtime.

Concluding remarks

The first day was filled with lots of technical talks - so several remained more on the overview/introductory level - which is a good thing to learn about new technologies. In addition there were a few presentations on new features of upcoming and past releases for instance for Java 7 and Sprint 3.1 - it's always nice to learn about the rational behind changes and improvements.

As for the agile talks - most of them propagated pretty innovative ideas that need a lot of courage to put into practice. However in several cases I could not help but get the feeling that either the processes presented were very specific to the environment they were established in and would not survive sudden stress - be it decline in revenue or team issues. In addition quite a few ideas that were introduced as novelties were already inherent in existing processes: Trust and natural communication really is the goal when establishing things like Scrum. In the end, the meetings are just the tool to get there. Clarity wrt to vision and even business value is core to prioritizing work to be done. Understanding and finding suitable metrics to measure and monitor business value of a product should be at the heart of any development project.

Overall the first day brought together a good crowd of talented people exchanging interesting ideas, news on current projects and technical details of battle-field-stories. Being still rather small, the Amsterdam edition of GoTo con certainly made it easy to get in touch with speakers as well as other attendees over a cup of coffee and discuss the presented issues. Huge thanks to the organizers for putting together an interesting schedule, booking a really tasty meal and having a friendly answer to any question from confused attendees.