2010-11-22 23:17
Currently on my way back from a series of conferences in the past three weeks in the IC from Schiphol. After three weeks of conferences, lots of new input and lots of interesting projects I learned about it is finally time to head back and put the stuff I have learned to good use.

View Travelling in a larger map

As seems normal with open source conferences I got far more input on interesting projects than I can expect to ever get applied in on a daily basis. Still it is always inspiring to meet with other developers in the same field – or even quite different fields and learn more on what projects they are working on, how they solve various problems.

A huge Thank You goes to the DICODE EU research project for sponsoring the Apache Con and Devoxx trips, another Thanks to for inviting me to Lisbon and covering travel expenses. A special thank you to the assistant at neofonie who made travel arrangements for Atlanta and Antwerp: It all worked without problems even up to me having a power outlet in the train that is finally taking me back.

Apache Mahout 0.4 release

2010-11-03 15:21
On last Sunday the Apache Mahout project published the 0.4 release. Nearly every piece of the code has been refactored and improved since the last 0.3 release. The release was timed to happen exactly before Apache Con NA in Atlanta. As such it was published on October 31st - the Halloween release, sort-of.

Especially mentionable are the following improvements:

  • Model refactoring and CLI changes to improve integration and consistency
  • Map/Reduce job to compute the pairwise similarities of the rows of a matrix using a customizable similarity measure
  • Map/Reduce job to compute the item-item-similarities for item-based collaborative filtering
  • More support for distributed operations on very large matrices
  • Easier access to Mahout operations via the command line
  • New vector encoding framework for high speed vectorization without a pre-built dictionary
  • Additional elements of supervised model evaluation framework
  • Promoted several pieces of old Colt framework to tested status (QR decomposition, in particular)
  • Can now save random forests and use it to classify new data

New features and algorithms include:

  • New ClusterEvaluator and CDbwClusterEvaluator offer new ways to evaluate clustering effectiveness
  • New Spectral Clustering and MinHash Clustering (still experimental)
  • New VectorModelClassifier allows any set of clusters to be used for classification
  • RecommenderJob has been evolved to a fully distributed item-based recommender
  • Distributed Lanczos SVD implementation
  • New HMM based sequence classification from GSoC (currently as sequential version only and still experimental)
  • Sequential logistic regression training framework
  • New SGD classifier
  • Experimental new type of NB classifier, and feature reduction options for existing one

There were many, many more small fixes, improvements, refactorings and cleanup. Go check out the new release, give the new features a try and report back to us on the user mailing list.

Scrumtisch November

2010-11-03 04:31
Title: Scrumtisch November
Location: Berlin
Link out: Click here
Description: Next Scrum meetup - Scrumtisch - is scheduled to take place end of November in Berlin FHain. See you there (that is at La Vecchia Trattoria) if you are interested in agile development. Please make sure to register with Marion first.
Start Time: 18:30
Date: 2010-11-29

Apache Mahout @ Devoxx Tools in Action Track

2010-11-01 09:32
This year's Devoxx will feature several presentations coming from the Apache Hadoop ecosystem including Tom White on the basics of Hadoop: HDFS, MapReduce, Hive and Pig as well as Michael Stack on HBase.

In addition there will be a brief Tools in Action presentation on Monday evening featuring Apache Mahout.

Please let me know if you are going to Devoxx - would be great to meet some more Apache people there, maybe have dinner at one of the conference days.

Apache Mahout @ Lisbon Codebits

2010-10-31 09:36
Second week of November I'll spend a few days in Lisbon - never would have thought that I'd return so quickly when I visited this beautiful city this summer during vacation. I'll be there for Codebits - thanks to Sapo for inviting me to be there.

Back in summer I learned only after I returned to Germany that there was someone form Portugal seeking to meet with other Apache people exactly when I was down there. I contacted the guy proposing to do an Apache Dinner to see how many other committers and friends could be reached. In addition Filipe asked me whether I could imagine flying down to Sapo to give a talk on Mahout as devs there would be interested in it. Well, I told him that if I got travel support, I'd be happy to be there. This 10min chat quickly turned into an invitation to a great conference in Lisbon. Looking forward to meet you there. (And looking forward to weather that compared to Germany is way warmer and more sunny right now. :) )

First steps with git

2010-10-30 19:47
A few weeks ago I started to use git not only for tracking changes in my own private repository but also for Mahout development and for reviewing patches. My setup probably is a bit unusual, so I thought, I'd first describe that before diving deeper into the specifc steps.

Workflow to implement

With my development I wanted to follow Mahout trunk very closely, integrating and merging any changes as soon as I continue to work on the code. I wanted to be able to work with two different machines on the client side that are located at two distinct physical locations. I was fine with publishing any changes or intermediate progress online.

The tools used

I setup a clone of the official Mahout git repository on github as a place the check changes into and as a place to publish my own changes.

On each machine used, I cloned this github repository. After that I added the official Mahout git repository as upstream repository to be able to fetch and merge in any upstream changes.

Command set

After cloning the official Mahout repository into my own github account, the following set of commands was used on a single client machine to clone and setup the repository. See also the Github help on forking git repositories.

#clone the github repository
git clone

#add upstream to the local clone
git remote add upstream git://

One additional piece of configuration that helped make life easier was to setup a list of files and file patterns to be ignored by git.

Each distinct changeset (be it code review, code style changes or steps towards own changes) would then be done in their own branches locally. To share them with other developers as well as make them accessible to my second machine I would use the following commands on the machine used for initial development:

#create the branch
git branch MAHOUT-666

#publish the branch on github
git push origin MAHOUT-666

To get all changes both from my first machine and from upstream into the second machine all that was needed was:

#select correct local branch
git checkout trunk

#get and merge changes from upstream
git fetch upstream
git merge upstream/trunk

#get changes from github
git fetch origin
git merge origin/trunk

#get branch from above
git checkout -b MAHOUT-666 origin/MAHOUT-666

Of course pushing changes into an Apache repository is not possible. So I would still end up creating a patch, submit that to JIRA for review and in the end apply and commit that via svn. As soon as these changes finally made it into the official trunk all branches created earlier were rendered obsolete.

What still makes me stick with git especially for reviewing patches and working on multiple changesets is it's capability to quickly and completely locally create branches. This feature totally changed my so-far established workflow for keeping changesets separate:

With svn I would create a separate checkout of the original repository from a remote server, make my changes or even just apply a patch for review. To speed things up or be able to work offline I would keep one svn checkout clean, copy that to a different location and only there apply the patch.

In combination with using an IDE this workflow would result in me having to re-import each different checkout as a separate project. Even though both Idea and Eclipse are reasonably fast with importing and setting up projects it would still cost some time.

With git all I do is one clone. After that I can locally create branches w/o contacting the server again. I usually keep trunk clean from any local changes - patches are applied to separate branches for review. Same happens to any code modifications. That way all work can happen when disconnected from the version control server.

When combined with IntelliJ Idea fun becomes even greater: The IDE regularly scans the filesystem for updated files. So after each git checkout I'll find the IDE automatically adjust to the changed source code - that way avoiding project re-creation. Same is of course possible with Eclipse - it just involves one additional click on the Refresh button.

For me git helped speed up my work processes and supported use cases that otherwise would have involved sending patches to and fro between separate mailboxes. That way work with patches and changeset seemed way more natural and better supported by the version control system itself. In addition it of course is a great relief to be able to checkin, diff, log, checkout etc. even when disconnected from the network - which for me still is one of the biggest advantages of any distributed version control system.

Lance Norskog recently pointed out one more step that is helpful:
You didn't mention how to purge your project branch out of the github fork. From Deleting a remote branch or tag

This command is a bit arcane at first glance… git push REMOTENAME :BRANCHNAME. If you look at the advanced push syntax above it should make a bit more sense. You are literally telling git “push nothing into BRANCHNAME on REMOTENAME”. And, you also have to delete the branch locally also.

Scientific debugging

2010-10-28 20:09
Quite some years ago I ready Why programs fail - a systematic guide to debugging - a book written on the art of debugging programs written by Andreas Zeller (Prof. at University Saarbrücken, researcher working on Mining Software Archives, Automated Debugging, Mutation Testing and Mining Models and one of the authors of famous Data Display Debugger).

One aspect that I found particularly intriguing about the book was up to then known to me only from the book Zen and the Art of Motorcycle Maintenance: Scientific debugging as a way to find bugs in a piece of software in a structured way with methods usually known from the scientific method.

When working together with software developers one theme that often occured to me as very strange is other developers - especially Junior level developers' - ways of debugging programs: When faced with a problem they would usually come up with some hypothesis as to what the problem could be. Without verifying the correctness of that hypothesis they would jump directly to implementing a fix for the perceived problem. Not too seldom this led to either half baked solutions somewhat similar to the Tweaking the Code behaviour on "The daily WTF", or - even worse - to no solution at all after days of implementation time were gone.

For the rest of this posting I'll make a distinction between the terms

  • defect for a programming error that might cause unintended behaviour.
  • failure for an observed mis-behaviour of a program. Failures are caused by specific defects.

For all but trivial bugs (for your favourite changing and vague definition of "trivial") I try to apply a more structured approach to identifying causes for failures.

  1. Identify a failure. Reproduce the failure locally whenever possible. To get most out of the process write some automated test that reproduces the failure.
  2. Form a hypothesis. Usually this is done sub-consciously by most developers. This time around we explicitly write down what we think what the underlying defect is.
  3. Devise an experiment. The goal here is to show experimentally that your hunch about the defect was correct. This may involve debug statements, more extensive logging, using your favourite debugger with a break point set exactly at the position where you think the defect lies. However the experiment could also involve using wireshark or tcpdump if you are debugging even simple distributed systems. On the other hand for extremely trivial defects the experiment could just be to fix the defect and see the test run through successfully.
  4. Observe results.
  5. Reach a conclusion. Interpret the result of your experiment. If they reject your hypothesis move on to your next potential cause for the failure. If they don't you can go on to either devise more experiments in support of your hypothesis if the last one didn't convince you (or your boss) or fix the defect just found. In any case you should tidy up any remains of your experiment before moving on in most cases.

When taking a closer look at the steps involved it's actually pretty straight forward. This makes this method so easy to use while still being most effective. When combined with automated testing it even helps when squashing bugs in code that is not written by the person fixing it. One way to make the strategy even stronger is to support the process by manually writing down a protocol of the debugging session with pen and paper. Not only does this help avoid checking the same hypothesis over and over again. It's also a way to quickly note down all hypothesis': In the process of debugging the person doing the analysis might be faster thinking of new possible causes than he can actually check. Noting them down helps keeping one's mind free and open as well as remembering all possible options.

CfP: Data Analysis Dev Room at Fosdem 2011

2010-10-27 06:56
Call for Presentations: Data Analysis Dev Room, FOSDEM
5 February 2011
1pm to 7pm
Brussels, Belgium

This is to announce the Data Analysis DevRoom co-located with FOSDEM. The first Meetup on analysing and learning from data, taking place in Brussels, Belgium.

Important Dates (all dates in GMT +2):

  • Submission deadline: 2010-12-17
  • Notification of accepted speakers: 2010-12-20
  • Publication of final schedule: 2011-01-10
  • Meetup: 2011-02-05

Data analysis is an increasingly popular topic in the hacker community. This trend is illustrated by declarations such as:

"I keep saying the sexy job in the next ten years will be statisticians. People think I’m joking, but who would’ve guessed that computer engineers would’ve been the sexy job of the 1990s? The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill in the next decades, not only at the professional level but even at the educational level for elementary school kids, for high school kids, for college kids. Because now we really do have essentially free and ubiquitous data. So the complimentary scarce factor is the ability to understand that data and extract value from it."

-- Hal Varian, Google’s chief economist

The event will comprise presentations on scalable data processing. We invite you to submit talks on the topics:

  • Information retrieval / Search
  • Large Scale data processing
  • Machine Learning
  • Text Mining
  • Computer vision
  • Linked Open Data
  • Sample list of related open source / data projects (not exhaustive) :
  • (including MapReduce, Pig, Hive, ...)
  • &
  • &

Closely related topics not explicitly listed above are welcome.

High quality, technical submissions are called for, ranging from principles to practice.

We are looking for presentations on the implementation of the systems themselves, real world applications and case studies.

Submissions should be based on free software solutions.

Proposals should be submitted at no later than 2010-12-17. Acceptance notifications will be sent out on 2010-12-20.

Please include your name, bio and email, the title of the talk, a brief abstract in English language. Please indicate the level of experience with the topic your audience should have (e.g. whether your talk will be suitable for newbies or is targeted for experienced users.)

The presentation format is short: 30 minutes including questions. We will be enforcing the schedule rigorously.

If you are interested in sponsoring the event (e.g. we would be happy to provide videos after the event, free drinks for attendees as well as an after-show party), please contact us. Note: "DataDevRoom sponsors" will not be endorsed as "FOSDEM sponsors" and hence not listed in the sponsors section on the website.

Follow @DataDevRoom on twitter for updates. News on the conference will be published on our website at

Program Chairs:

  • Olivier Grisel - @ogrisel
  • Isabel Drost - @MaineC
  • Nicolas Maillot - @nmaillot

Please re-distribute this CFP to people who might be interested.

Video: Max Heimel on sequence tagging w/ Apache Mahout

2010-10-26 19:58
Some time ago Max Heimel from TU Berlin gave presentation of the new HMM support in the Mahout 0.4 release at the Apache Hadoop Get Together in Berlin:

Mahout Max Heimel from Isabel Drost on Vimeo.

Thanks to JTeam for sponsoring video taping, thanks to newthinking for providing the location and thanks to Martin Schmidt from newthinking for producing the video.

Machine Learning Gossip Meeting Berlin

2010-10-25 18:51
This evening the first Machine Learning Gossip meeting is scheduled to take place at 9p.m. at Victoriabar: Professionals working in research advancing machine learning algorithms and industry projects putting machine learning algorithms to practical use meet for some drinks, food and hopefully lots of interesting discussions.

If successful the meeting is supposed to take place on a regular schedule. Ask Michael Brückner for the date and location of the next meetup.