GeeCon - Randomized testing

2012-05-21 08:02
I arrived late during lunch time on Thursday for GeeCon – however just in time to listen to one of the most interesting talks when it comes to testing. Did you ever have the issue of writing code that runs well in your development environment but crashes as soon as it's rolled out at customers only to find out that their Locale setting was causing the issues? Ever had to deal with random test failure because against better advise your tests did depend on execution order that is almost guaranteed to be different on new JVM releases?

The Lucene community has encountered many similar issues. In effect they are faced with having to test a huge number of different configuration combinations in order to make sure that their software runs in all client setups. In recent months they developed an approach called randomised testing to tackle this problem: Essentially on each run “random tests” are run multiple times, each time with a slightly different configuration, input, in a different environment (e.g. Locale settings, time zones, JVMs, operating systems). Each of these configurations are pseudo random – however on test failure the framework will reveal the seed that was used to initialize that pseudo random number generator and thus allow you to reproduce the failure deterministically.

The idea itself is not new: published in a paper by Ntafos, used in fuzzers to identify security holes in applications this kind of technique is pretty well known. However applying it to write tests is a new idea used at Lucene.

The advantage is clear: With every new run of the test suite you gain confidence that your code is actually stable to any kind of user input. The downside of course is that you will discover all sorts of different issues and bugs not only in your code but also in the JVM itself. If your library is being used in all sorts of different setups fixing these issues upfront however is crucial to avoid users being surprised that it does not work well in their setup. Make sure to fix these failures quickly though – developers tend to ignore flickering tests over time. Adding randomness – and thereby essentially increasing the number of tests in your testsuite – will add the amount of effort to invest in fixing broken code.

Dawid Weiss gave a great overview of how random tests can be used to harden a code base. He introduced the testframework written at carrot search that isolated the random test features: It comes with a RandomizedRunner implementation that can be used to subsitute junit's own runner. It's capable of tracking test isolation by tracking spawned threads that might be leaking out of tests. In addition it provides utilities for instance for creating random strings, locals, numbers as well as annotations to denote how often a test should run and when it should run (always vs. nightly).

So when having tests with random input – how do you check for correctness? The most obvious thing to do is when being able to check the exact output. When testing a sorting method, not matter what the implementation and the input is – the output should always be sorted, which is easy enough to check. Also checking against simpler, but maybe in practice more expensive algorithms is an option.

A second approach is to do sanity checks: Math.abs() at least should always return positive integers. The third approach is to do no checking at all in some cases. Why would that help? You'd be surprised by how many failures and exceptions you get by actually using your API in unexpected ways or giving your program unexpected input. This kind of behaviour checking does not need any assertions.

Note: Really loved the APad/ iMiga that Dawid used to give his talk! Been such a long time since I last played with my own Amiga...

Second steps with git

2012-04-22 20:34
Leaving this here in case I'll search for it later again - and I'm pretty sure I will.

The following is a simplification of the git workflow detailed earlier - in particular the first two steps and a little background.

Instead of starting by cloning the upstream repository on github and than going from there as follows:

#clone the github repository
git clone

#add upstream to the local clone
git remote add upstream git://

you can also take a slightly different approach and start with an empty github repository to push your changes into instead:

#clone the upstream repository
git clone git://

#add upstream your personal - still empty - repo to the local clone
git remote add personal

#push your local modifications branch mods to your personal repo
git push personal mods

That should leave you with branch mods being visible in your personal repo now.

Note to self - Java heap analysis

2012-02-09 21:30
As I keep searching for those URLs over and over again linking them here. When running into JVM heap issues (an out of memory exception is a pretty sure sign, so can be the program getting slower and slower over time) there's a few things you can do for analysis:

Start with telling the effected JVM process to output some statistics on heap layout as well as thread state by sending it a SIGQUIT (if you want to use the number instead - it's 3 - avoid typing 9 instead ;) ).

More detailed insight is available via jConsole - remote setup can be a bit tricky but is well doable and worth the effort as it gives much more detail on what is running and how memory consumption really looks like.

For an detailed analysis take a heap dump with either jmap, jConsole or by starting the process with the JVM option -XX:+HeapDumpOnOutOfMemoryError. Look at it either with jhat or the IBM heap analyzer. Also netbeans offers nice support for searching for memory leaks.

On a more general note on diagnosing java stuff see Rainer Jung's presentation on troubleshooting Java applications as well as Attila Szegedi's presentation on JVM tuning.

Dorkbot Berlin

2012-01-30 23:18
c-base - 8p.m. on a Monday evening - the room is packed (and pretty cloudy as well): Time for Dorkbot, a short series of talks on "People doing strange things with electricity" hosted by Frank Rieger.

First talk up on stage was Gismo on Raumfahrtagentur - a Berlin maker-space located in Wedding. Originating from the presenter's interest in electrical bikes a group of ten people interested in hardware hacking got together. Projects include but are not limited to 3D printing, 3D scanning, textile hacking, a collaborative podcast. Essentially the idea is to provide room and infrastructure to be used collaboratively by a group of members. From an organisational point of view the group is incorporated as a GmbH - however none of the projects is mainly targeted to commercialization: It's main target group are hobbyists, researchers and open hardware/software people. If interested: Each Monday evening there is a "Sunday of the Kosmonauts" where externals are invited to come visit.

Second talk was on the project Drinkenlights (Klackerlaken) - a way for children to learn the basics of electronics without any soldering (hardware available for three Euros max). Experiences made with giving the ingredients for creating these toys to children of varying ages were interesting: From kids of about five years playing around up to ten/eleven year olds that when in school seemingly had to re-learn being creative without being given much direction or instruction on the task at hand.

In the third talk Martin Kaltenbrunner introduced his Tworse Key - a nice symbiosis of old technology (a morse key) and new media (Twitter). Essentially built on top of an Arduino Ethernet board it made it possible to turn morse messages into Tweets. Martin also gave a brief overview of related art projects and briefly touched upon the changes that open source and open hardware bring to art: There are projects that open all design and source code to the public to benefit from a wider distribution channel (without having to actually produce anything), working on designs in a collaborative way and get improvements back to the original project. All of these form a stark contrast to the existing idea of having one single author whose contribution is to build a physical object that is then presented in exhibitions - providing both, new possibilities and new challenges to artists.

In the last presentation Milosch introduced his new project ETIB whose goal it is to bring hardware hacking geeks together with textile geeks to work on integrating circuits into clothes.

If you are interested in hacking spaces in general and what is happening in that direction in Berlin, mark this Friday in your calendar: c-base will be hosting a Hackerspace meetup - so if you want to know how hackerspaces work or want to create one yourself, this event might be interesting to you.

One day later

2012-01-05 23:57

Fun little new toy

2012-01-03 23:48
Yesterday Thilo invited me to attend an "Electronics 101" workshop including an introduction to soldering that was scheduled to start at 7p.m. this evening at the offices of IN-Berlin e.V.. As part of my studies back in university I do have a little bit of background in Electronics, but never before had tried any serious soldering (apart from fixing one of our audio cables) so I thought, why not.

The workshop turned out to be a lot of fun: The organisers Mitch Altman and Jimmie Rodgers had brought several pre-packaged kits for people to work on. Quite a few of them based on Arduino so after putting them together you can actually continue having fun with writing little programs. After giving a brief but very well done, easy to understand introduction to digital electronics Mitch showed attendees how to use a soldering iron (make sure to check out his comic "soldering is easy" if you want to know more) and got everyone started. Both Jimmie and Mitch did a great job answering any upcoming questions, fixing issues and generally helping out with any problems. Even those that never used a soldering iron before quickly got up to speed and in the end went home with that nice experience of having built something that you cannot only program but can touch and hold in your hands.

I got myself a LoL shield (still to be done), and a Diavolino. Still missing is the FTDI TTL-232R cable for getting the device hooked up to our laptops and be able to re-program it (though most likely that will be easier to find than a >1G Ohm resistor Thilo is looking for to be able to calibrate his Geiger counter).

Results of my first session are below:

The board First pins attached Last pins attached

Also thanks to Sven Guckes organising and announcing this workshop on short notice. And thanks to Thilo for talking me into that.

Update: Images of the event are available online.


2011-12-30 02:07

Restate my assumptions.

One: Mathematics is the language of nature.

Two: Everything around us can be represented and understood through numbers.

Three: If you graph the numbers of any system, patterns emerge. Therefore, there are patterns everywhere in nature.

The above is a quote from today's "Hackers in movies" talk at 28c3 - which amongst others also showed a brief snippet of the movie Pi. For several years I stayed well away from that one famous Hackers' conference in Berlin that takes place annually between Christmas and New Year. 23C3 was the last congress I attended until today. Though there were several fantastic talks and mean presentation quality was pretty good the standard deviation of talk quality was just too high for my taste. In addition due to limited room sizes with 4 tracks there were quite a few space issues.

In recent years much of that has changed: The maximum number of tickets is strictly enforced, there is an additional lounge area in a large tent next to the entrance, for the sake of having larger rooms the number of tracks was reduced to three. Streaming works for the most part making it possible for those who did not get one of the 3000 full conference tickets to follow the program from their preferred hacker space. In addition fem does an amazing job of recording, mastering, encoding and pushing videos online: Hacker Jeopardy - a show that wasn't over until early Thursday morning (about 3a.m.?) - was up on Youtube at least on Thursday at 7a.m if not earlier.

Several nice geeks got me talked into joining the crowd briefly this evening for a the last three talks in "Saal 1" depicted above: You cannot be in Berlin during 28c3 and not see the so-called "fnord Jahresrückblick" by Fefe and Frank Rieger, creators of the Alternativlos podcast.

Overall it is amazing to watch BCC being invaded by a large group of hackers. It's fun to see quite a few of them on Alexanderplatz, watch people have fun with a TV B Gone in front of large electronics stores. It's great to get to watch highly technical but also several political talks 4 days in a row from 11 a.m. until at least 2p.m. the following day that are being given by people who are very passionate about what they do and the projects they spend their time on.

If you are into tinkering, hacking, trying out sorting algorithms and generally having fun with technology make sure you check out the 28c3 Youtube channel. If you want to learn more on Hacker culture, mark the days between Christmas and New Year next year and attend 29c3 - don't worry if you do not speak German - the majority of talks is in English, most of the ones that aren't are being translated on the fly by volunteers. If you are good at translations, feel free to volunteer yourself for that task. Speaking of volunteering: My respect to all angels (helping hands), heralds (those introducing speakers), noc (network operating center), poc (phone operating center), the organisation team and anyone who helps keep make that event as enjoyable to attendees as it is.

Update: Thank you to the geeks who after staying in our apartment for #28c3 helped get it back to a clean state - actually cleaner than it was before. You rock!

Apache Mahout Hackathon Berlin

2011-03-21 21:39
Last year Sebastian Schelter from Berlin was added to the list of committers for Apache Mahout. With two committers in town the idea was born to meet some day, work on Mahout. So why not just announce that meeting publicly and invite others who might be interested in learning more about the framework? I got in touch with c-base - a hacker space in Berlin well suited to host a Hackathon - and quickly got their ok for the event.

As a result the first Apache Mahout Hackathon took place at c-base in Berlin last weekend. We had about eight attendees - arriving at varying times: I guess 11a.m. simply is way too early to get up for your average software developer on a Saturday. I got a few people surprised by the venue - especially those who were attending a Hackathon for the very first time and had expected c-base to be some IT company ;)

We started the day with a brief collection of ideas that everyone wanted to work on: Some needed help to use Mahout - topics included:

  • How to use Apache Mahout collaborative filtering with complex models.
  • How to use Apache Mahout via a web application?
  • How to use classification (mostly focussed on using Naive Bayes from within web applications).
  • Is HBase a solution for scalable graph mining algorithms?
  • Is there a frequent itemset algorithm that respects temporal changes in patterns?

Those more into Mahout development proposed a slightly different set of topics:

  • PLSI and Map/Reduce?
  • Build customisable sampling strategies for distributed recommendations.
  • Come up with a more Java API friendly configuration scheme for Mahout clusterings.
  • Complete the distributed SVD recommender.

Quickly teams of two to three (and more) people formed. First several user side questions could be addressed by mixing more experienced Mahout developers with newbie users. Apart from Mahout specifics also more basic questions of getting involved even by simply contributing to the online documentation, answering questions on the mailing lists or just providing structured access to existing material that users generally have trouble finding.

Another topic that is being overlooked all too when asking users to contribute to the project is the process of creating, submitting, applying and reviewing patches itself: Being deeply involved with free software projects dealing with patches, integration of issue tracker and svn with the project mailing lists all seems very obvious. However even this seemingly basic setup sometimes looks confusing and complex to regular users - that is very common but not limited to people who are just starting to work as software developers.

Thanks to Thilo Fromm for taking the group picture.

In the evening people finally started hacking more sophisticated tasks - working on the first project patches. On Sunday only the really hard core developers remained - leading to a rather focussed work on Mahout improvements which in the end led to first patches sent in from the Mahout Hackathon.

Note to self: svn:ignore usage

2011-02-25 20:47
Putting the information here to make retrieving it a bit easier next time.

When working with svn and some random IDE I'd really love to avoid checking in any files that are IDE specific (project configuration, classpath, etc.). The command to do that:

svn propedit svn:ignore $directory_to_edit

After issuing this command you'll be prompted to enter file patterns for files to ignore or the directory names.

More detailed information in the official documentation on svn:ignore.

Devoxx – Day 2 HBase

2010-12-09 21:25
Devoxx featured several interesting case studies of how HBase and Hadoop can be used to scale data analysis back ends as well as data serving front ends.


Dmitry Ryaboy from Twitter explained how to scale high load and large data systems using Cassandra. Looking at the sheer amount of tweets generated each day it becomes obvious that with a system like MySQL alone this site cannot be run.

Twitter has released several of their internal tools under a free software license for others to re-use – some of them being rather straight forward, others more involved. At Twitter each Tweet is annotated by a user_id, a time stamp (ok if skewed by a few minutes) as well as a unique tweet_id. In order to come up with a solution for generating the latter one they built a library called snowflake. Though rather simple algorithm even works in a cross data-centre set-up: The first bits are composed of the current time stamp, the following bits encode the data-centre, after that there is room for a counter. The tweet_ids are globally ordered by time and distinct across data-centres without the need for global synchronisation.

With gizzard Twitter released a rather general sharding implementation that is used internally to run distributed versions of Lucene, MySQL as well as Redis (to be introduced for caching tweet timelines due to its explicit support for lists as data structures for values that are not available in memcached).

FlockDB for large scale social graph storage and analysis. Rainbird for time series analysis, though with OpenTSDB there is something comparable available for HBase. Haplocheirus for message vector caching (currently based on memcached, soon to be migrated to Redis for its richer data structures). The queries available through the front-end are rather limited thus making it easy to provide pre-computed, optimised version in the back-end. As with the caching problem a tradeoff between hit rate on the pool of pre-computed items vs. storage cost can be made based on the observed query distribution.

In the back-end of Twitter various statistical and data mining analysis are run on top of Hadoop HBase To compute potentially interesting followers for users, to extract potentially interesting products etc.
The final take-home message here: Go from requirements to final solution. In the space of storage systems there is not such thing as a silver bullet. Instead you have to carefully evaluate features and properties of each solutions as your data and load increase.


When implementing Facebook Messaging (a new feature that was announced this week) Facebook decided to go for HBase instead of Cassandra. The requirements of the feature included massive scale, long-tail write access to the database (which more or less ruled out MySQL and comparable solutions) and a need for strict ordering of messages (which ruled out any eventually consistent system. The decision was made to use HBase.

A team of 15 developers (including operations and frontend) was working on the system for one year before it was finally released. The feature supports for integration of facebook messaging, IM, SMS and mail into one single system making it possible to group all messages by conversation no matter which device was used to send the message originally. That way each user's inbox turns into a social inbox.


Cosmin Lehene presented four use cases of Hadoop at Adobe. The first one dealt with creating and evaluating profiles of the Adobe Media Player. Users would be associated with a vector giving more information on what types of genre the meda they consumed belonged to. These vectors would then be used to generate recommendations for additional content to view in order to increase consumption rate. Adobe built a clustering system that would interface Mahout's canopy- and k-means implementations with their HBase backend for user grouping. Thanks Cosmin for including that information in your presentation!

A second use case focussed on finding out more on the usage of flash on the internet. Using Google to search for flash content was no good as only the first 2000 results could be viewed thus resulting in a highly skewed sample. Instead they used a mixture of nutch and HBase for storage to retrieve the content. Analysis was done with respect to various features of flash movies, such as frame rates. The analysis revealed a large gap between the perceived typical usage and the actual usage of flash on the internet.

The third use case involves analysis of images and usage patterns on the Photoshop-in-a-browser edition of The forth use case dealt with scaling the infrastructure that powers businesscatalyst – a turn-key online business platform solution including analysis, campaigning and more. When purchased by Adobe the system was very successful business-wise. However the infrastructure was by no means able to put up with the load it had to accommodate. Changing to a back-end based on HBase led to better performance, faster report generation.