Apache Hadoop Get Together Berlin

2012-07-23 20:41
As seen on Xing - the next Apache Hadoop Get Together is planned to take place in August:

When: 15. August, 18 p.m.

Where: Immobilien Scout GmbH, Andreasstr. 10, 10243 Berlin

As always there will be slots of 30min each for talks on your Hadoop topic. After each talk there will be time for discussion.

It is important to indicate attendance. Only registered visitors will be permitted to attend.

Register here: https://www.xing.com/events/hadoop-get-together-1114707

Talks scheduled thus far:

Dragan Milosevic

Robust Communication Mechanisms in zanox Reporting Systems

It happened an annoying number of times that we wanted to improve only one particular component in our distributed reporting system, but often had to update almost everything due to the RPC version-mismatch, which occurred in a communication between the updated component and the rest of our system. To mitigate this problem and to significantly simplify the integration of new components, we extended the used RPC protocol to perform a version handshake before the actual communication starts. This RPC extension is accompanied with serialisation/deserialization methods, which are downward compatible due to being able to successfully deserialise any
serialised older version of exchanged objects. Putting together these extensions makes it possible for us to successfully operate multiple versions of frontend and backend components, and to have the power to autonomously decide what and when should be updated/improved in our distributed reporting system.

Two other talks are planned and I will provide you with further information soon.

A big Thank You goes to Immobilien Scout GmbH for providing the venue at no cost for our event and for sponsoring the videotaping of the presentations.

Looking forward to seeing you in Berlin,


Need your input: Failing big data projects - experiences from the wild

2012-07-18 20:11
A few weeks ago my talk on "How to fail your big data project quick and rapidly" was accepted at O'Reily Strata conference in London. The basic intention of this talk is to share some anti-patterns, embarrassing failure modes and "please don't do this at home" kind of advice with those entering the buzzwordy space of big data.

Inspired by Thomas Sundberg's presentation on "failing software projects the talk will be split in five chapters and highlight the top two failure-factors for each.

I only have so much knowledge of what can go wrong when dealing with big data. In addition no one likes talking about what did not work in their environment. So I'd like to invite you to share your war stories in a public etherpad - either anonymously or including your name so I can give credit. Some ideas are already sketched up - feel free to extend, adjust, re-rank or change.

Looking forward to your stories.

Note to self: Clojure with Vim and Maven

2012-07-17 20:07
Steps to get a somewhat working Clojure environment with vim:

Note: There is more convenient tooling for emacs (see also getting started with clojure and emacs) - its just that my fingers are more used to interacting with vim...

2nd note: This post is not an introduction or walk through on how to get Clojure setup in vim - it's not even particularly complete. This is intentional - if you want to start tinkering with Clojure: Use Emacs! This is just my way to re-discover through Google what I did the other day but forgot in the mean time.

Apache Sling and Jackrabbit event coming to Berlin

2012-07-12 20:59
Interested in Apache Sling and/or Apache Jackrabbit? Then you might be interested in hearing that on September 26th to 28th there will be an event in town on these two topics - mainly organised by Adobe, but labeled as community event, meaning that there will be a number of active community members attending the conference: adaptTo().

From their website:

In late September 2012 Berlin will become the global heart beat for developers working on the Adobe CQ technical stack. pro!vision and Adobe are working jointly to set up a pure technical event for developers that will be focused on Apache Sling, Apache Jackrabbit, Apache Felix and more specifically on Adobe CQ: adaptTo(), Berlin. September 26-28 2012.

Preparation done - clock is ticking

2012-05-31 21:24
The clock is ticking - only one more weekend to go before Berlin Buzzwords opens its doors for the main conference (check out the Wiki for the Sunday evening Barcamp and the Sunday Movie Hackday). Looking forward to an amazing week with awesome speakers and great attendees.

One word of warning before: Given all the buzz around that conference as of now until mid-next week I won't take any major decisions, most likely I won't be able to follow through with any additional organisation, probably I won't remember everyone I meet on-site.

In case I do take decisions - don't trust any of them. If you do need help organising some meetup or dinner - I'm happy to help out with recommendations on where to go and who to ask, I'm also happy to get you in touch with people relevant to your area of interest. However when it comes to selecting the restaurant, deciding on the day and time, booking a table and informing everyone involved you are on your own. In case you have any questions, requests or advise please make sure to send a copy to my inbox to make sure it will be dealt with (though it might take some time for me to get back to today's inbox zero level I'll make sure I'll get through all of it).

Other than that - thanks to ntc and Nick the Barcamp is all setup, the conference is well on track, thanks to many external helping hands we've again got a convincing line-up of satellite events. In addition I made sure the Apache Mahout people got a time and place to meet, I managed to review all proposals that sounded interesting at Strata London (great stuff on the business side of big data - go there if you want to learn more on the business side of the topics covered by Berlin Buzzwords and more). Everything else will have to wait at least until end next week.

CU in Berlin - bring sun and warm weather with you :)

Last minute Getting Around information for Berlin Buzzwords

2012-05-29 12:47
I've been sharing information on how to get around in Berlin more often than I'd like to type it out - putting it here for future reference.

Before going to Berlin make sure to put an app on your phone that helps with finding the right public transport mix to use for going from one place to another:

If you want to get around for sight seeing - other than making sure to pack a travel guide consider renting a bike for a day or two. It's rather safe to ride one in Berlin, there are several routes that are all green and calm. Checkout bbbike.de to plan your routes - though not the prettiest website it does have comprehensive information on road conditions and lets you avoid cobble stones or less well lit streets. Try it out - it served me very well.

To actually rent a bike - ask your hotel, usually they have decent offers or can point you at a local bike shop that has rental offers. Prizes should be roughly 10,- Euros a day or 50,- a week.

One warning to pedestrians and anyone renting a car: Bicycles are very common in Berlin in particular in summer. Watch out when turning, don't underestimate their speed. When walking on the sidewalks watch out for lanes reserved for bikes - usually they are red with white stripes but can look slightly different - see also some images on flickr.

Teddy in Zürich

2012-05-28 20:20
A few beautiful sunny though windy days in Zurich late April:

View from the path between Ütliberg and Adliswil/Felsenegg:

Strolling through the city and sitting next to Zürichsee enjoying the sun afterwards:


A boat trip to Rapperswil - started cold and cloudy, finished warm and sunny:


Teddy in Poznan

2012-05-27 20:03
Some images taken in Poznan after GeeCon - big Thanks! to Dawid for giving advise on where to go for sightseeing, exhibitions and going-out.

The tour started close to river Warta - it being a sunny day it seemed like a perfect fit to just walk through the city, starting along the river headed towards the cathedral:


After that Poznan Citadel was a great place to spend lunch time - sitting somewhere green and shady:

Afternoon was dedicated to discovering the city center, several local churches and the national galery:


GeeCon - Testing hell and how to fix it

2012-05-26 08:08
The last regular talk I went to was on testing hell at Atlassian – in particular the JIRA project. What happened to JIRA might actually be known to developers who have to deal with huge legacy projects that predate the junit and dependency injection era: Over time their test base grew into a monster that was hard to maintain and didn't help at all with making developers confident on checkin time that they would not break anything.

On top of 13k unit tests they head accumulated 4k functional tests, several hundreds of selenium user interface tests in 65 maven modules depending on 554 dependencies that represented quite some technology mix from old to new, ranging across different libraries for solving the same task. They used 60+ remote agents for testing, including AWS instances that were orchestrated by a Bamboo installation, had different plants for every supported version branch, tested in parallel.

Most expensive were platform tests that were executed every two to four weeks before each release – those tested JIRA with differing CPU configurations, JVMs, Browsers, databases, deployment containers. Other builds were triggered on commit, by dependencies or nightly.

Problem was that builds would take for 15 min for unit tests, one hour for functional tests, several hours for all the rest – that means developers get feedback only after they are home essentially blocking other developers' work. For unit tests that resulted in fix turnaround times of several hours, for integration tests several days. Development would slow down, developers became afraid of commits, it became difficult to release – in summary morale went down.

Their problems: Even tiny changes caused test avalanches. As tests were usually red, noone would really care. Developers would not run tests for effort reasons and got feedback only after leaving work.
Some obvious mistakes:

Tests were separate from the code they tested – in their case in a separate maven module. So on every commit the whole suite has to run. Also back when the code was developed dependency injection only just started to catch up which meant the code was entangled, closely coupled and hard to test in isolation. There were opaque fixtures hard coded in xml configuration files that captured application scope but had to be maintained in the tests.

Their strategy to better testing:

  • Introduce less fragile UI tests based on the page objects pattern to depend less on the actual layout and more on the functionality behind.
  • They put test fixtures into the test code by introducing REST APIs for modification and an introduction of backdoors, only open in the test environment.
  • Flickering tests were put to quarantine and either fixed quickly or deleted – if noone fixes them, they are probably useless anyway.

After those simple measures they started splitting the software into multiple real modules to limit scope of development and raise responsibility of development teams. That comes with the advantage of having tests close to the real code. But it comes with the cost of a more complex CI hierarchy. However in well organised software in such a project hierarchy commits turned out to tend to go into leaves only – which did lessen the number of builds quite a bit.

There is a tradeoff between speed vs. control: Modularizing means you no longer have all in one workspace, in turn it means faster development for most of your tasks. For large refactorings noone will stop you to put all code in one idea workspace.

The goal for Atlassian was to turn the pyramid of tests upside down: Have most but fast unit tests, have less REST/html tests and even less Selenium tests. Philosophy was to only provide REST tests if there is no way at all to cover the same function in a unit test.

In terms of speeding up execution they started batching tests against one instance to avoid installation time, merged tests, used in-process databases, mocked IO and webservers where possible. Also putting more hardware in does help, so does avoiding sleeping in tests.

In terms of splitting code – in addition to responsibility that can also be done by maturity to keep what is evolving quickly close together until it is stable.

The day finished with a really inspiring keynote by Kevlin Henney on Cool Code – showing several both either miserably failing or incredibly cool pieces of software. His intention when reading code is to extend a coders vocabulary when it comes to programming. That's why even the obfuscated c code competition does make for an interesting read as it tells you things about language features you otherwise might never have learned about before. One very important conclusion from his talk: “If you don't have the time to read, you have neither time nor tools to write.” - though being made by Stephen King on literature this statement might as well apply to software, after all to some extend what we produce is some kind of art, is some kind of literature in it's own right.

GeeCon - Solr at Allegro

2012-05-25 08:07
One particularly interesting to me was on Allegro's (polish Ebay) Solr usage. In terms of numbers: They have 20Mio offers in Poland, another 10Mio active offers in partnering countries. In addition in their index there are 50Mio inactive offers in Poland and 40 Mio closed offers outside that country. They serve 8Mio updates a day, that is 100 updates a second. Those are related to start/end of bidding phase, buy now actions, cancelled bids, bids themselves.

Per day they have 105Mio requests per day, on peak time in the evening that is 3.5k requests per second. Of those 75% are answered in less than 5ms, 90% in less than 20ms.

To achieve that performance they are using Solr. Coming from a database based system, going via a proprietary search product they are now happy users of Solr with much better customer support both from the community as well as from contractors than with their previous paid for solution.

The speakers went into some detail on how they solved particular technical issues: They had to decide to go for an external data feeder to avoid putting the database itself under too much load even when just indexing the updates. On updates they need to deal with having to reconstruct the whole document as updates for Solr right now mean deleting the old document and indexing the new one. In addition commits are pretty expensive, so they ended up delaying commits for as long as the SLA would allow (one minute) and committing them as batch.

They tried to shard indexes by category facetted by – that did not work particularly wrong as with their user behaviour it resulted in too many cross-shard requests. Index size was an issue for them so they reduced the amount of data indexed and stored in Solr to the absolute minimum – all else was out-sourced to a key-value store (in their case MongoDB).

When it comes to caching that proved to be the component that needed most tweaks – they put a varnish in front (Solr speaks xml over http which is simple enough to find caches for) – in relation with the index delay they had in place they could tune eviction times. Result were cache hit rates of about 30 to 40 percent. When it comes to internal caches: High eviction and low hit rates are a problem. Watch the Solr Admin Console for more statistics. Are there too many unique objects in your index? Are caches too small? Are there too many unique queries? They ended up binding users to solr backends by having a routing be sticky with the user's cookie – as users tend to drill down on the same dataset over and over again in their case that raised hit rates substancially. When tuning filter queries: Have them as independent as possible – don't use many unique combinations of the same filtering over and over again. Instead filter individually to better use that cache.

For them Solr proved to be a stable, efficient, flexible, easy to monitor and maintain and change system that ran without failure for the first 8 months with the whole architecture being designed and prototyped (at near production quality) by one developer in six months.

Currently the system is running on 10 solr slaves (+ power backup) compared to 25 nodes before. A full index takes 4 hours, bottlenecked at the feeder, potentially that could be pushed down to one hour. Updates of course flow in continuously.