Apache Hadoop at FOSDEM 2010

2009-12-11 09:19
Though the official schedule is not yet online: I will be giving an introductory talk about Apache Hadoop at next year's FOSDEM (Free and Open Source Developer European Meeting) in Brussles. This will be the 10th birthday of the event - looking forward to a fun event, meeting other free and open source software developers from all over Europe.





If you are a Apache Hadoop developer and would like me to include some particular topic in the talk - please feel free to contact me. If you are an Apache Hadoop user and would like to learn more on the project, please come to the talk and ask questions. If you are an Apache Hadoop Newbie - feel free to join us.

In addition there will be a NoSQL Dev Room at FOSDEM as well. The call for presentations is up already. So if you are doing fun stuff with CouchDB, HBase and friends or are a developer of these projects - submit a talk and join us in early-February in Brussles.

Open Source Expo 09

2009-11-16 22:17
I spent last Sunday and the following Monday at Open Source Expo Karlsruhe - co-located with web-tech and php-conference organized by the Software-and-Support Verlag. Together with Simon Willnauer I ran the Lucene/Mahout booth at the expo.

So far the conference is still very small (about 400 visitors) compared to free software community events. However the focus was set to be more on professional users, accordingly several projects showed that free software can be used successfully for various business use cases. Visitors were invited to ask Sun about their free software strategy. Questions concerning OpenJDK or MySQL were not uncommon. Large distributors like SuSE or Mandriva were present as well. But also smaller companies e.g. providing support for Apache OfBIZ were present.

The Apache Lucene project was invited as exhibitor as well. Together with PRC and ConCom we organized for an Apache banner. Lucid Imagination sponsored several Lucene T-Shirts to be distributed at the conference. At the very last minute information (abstract, links to projects and mailing lists and current users) was put together on flyers.

We arrived on Saturday, late evening. Together with a friend of mine we went for some indian food at a really good restaurant close to the hotel. Big thanks to her, for being our tourist guide - hope to see you back in Waldheim in December ;)



Sunday was pretty quiet - only few guests arrived at the weekend. I was invited by David Zuelke to give a brief introduction to Mahout during his MapReduce Hadoop tutorial workshop. Thanks, David. Though lunch was served already, people did stay to hear my presentation on large scale machine learning with Mahout. I got contacted by one of the students of Katarina Morik who was pretty interested in the project. Back at her research group people are working on Rapid Miner - a tool for easy machine learning. It comes with a graphical user interface that makes it simple to explore various algorithm configurations and data workflow setups. It would be interesting to see how this tool helps people to understand machine learning. Would also be very interesting to learn what form of contribution might be interesting and appropriate for research groups to contribute to Mahout. Maybe not code-wise but more in terms of discussions and background knowledge.

Sunday was a bit more busy, with more people attending the conferences. Simon got a slot to present Lucene at the Open Stage track and show off the new features of Lucene 2.9. Those using Lucene already could be tricked into telling their Lucene success-story at the beginning of the talk. At the booth we had a wide variety of people: From students trying to find a crawling and indexing system for their information retrieval course homework up to professionals with various questions on the Apache Lucene project. The experience of people at the conference varied widely. That proved to be a pretty good reality-check. Being part of the Lucene and the ASF community one might be tempted to think that not knowing about Lucene is almost impossible. Well, it seems to be less impossible than at least I expected.

One last success: As the picture shows, Yacy now is powered by Lucene as well - at least in terms of T-Shirt ;)

Open Source Expo

2009-10-29 07:38
Title: Open Source Expo
Location: Karlsruhe
Link out: Click here
Description: There will be a booth at Open source expo introducing interested visitors to the Apache projects Lucene and Mahout. Of course we are also happy to answer any questions on the ASF in general.
Start Date: 2009-11-15
End Date: 2009-11-16

First NoSQL Meetup in Germany

2009-09-09 18:58
On October 22nd 2009 the first NoSQL Meetup Germany is going to take place in newthinking store/ Berlin: http://nosqlberlin.de

Please submit your presentation proposals until September 22nd, accepted speakers will be notified soon after.

If you would like to sponsor the event, feel free to contact us: We would be very happy to provide videos after the event and free drinks for everyone during the event.

Hope to see you soon in Berlin.

September Apache Hadoop Get Together @ Berlin

2009-08-23 20:48
The upcoming Apache Hadoop Get Together Berlin is to take place on September 29th in newthinking store. Details are up on the web page at upcoming and will be sent out to the mailing list soon.

AMQP Erlang user group talk

2009-07-10 15:56
Last Wednesday at the Erlang user group Berlin Matthias Radestock from the RabbitMQ project gave a talk on RabbitMQ, AMQP and messaging in general. Slides are available online.

First Matthias motivated the need for an open standard for messaging: So far, their are a few provides of middleware systems like Tibco and IBM. But those solutions are usually closed, expensive, cumbersome to handle. In short they do not fit into a world where people rely on open standards for communication, free software for development and lightweight implementations.

AMQP aims to povide an open standard for messaging - that is decoupled communication between processes that may reside on separate boxes or in different datacenters. There are a few providers of AMQP implementations. Some examples are iMatix focussed on low latency communication, Apache Qpid and the corresponding project inside of RedHat and RabbitMQ.

RabbitMQ is implemented in Erlang (after all, the talk was hosted by the Erlang User Group Berlin ;) ). With about 7000 lines of code the code base is rather compact. The goal was not to built a super-fast implementation, but one that is scalable and highly available.

So far there is no facility for building reliable cross datacenter communication built into RabbitMQ. Yet, there are several projects available that aim at providing just that.

Lucene slides online

2009-06-30 10:04
The slides of the Lucene talk at the last Apache Hadoop Get Together Berlin are available online: Lucene Slides. Especially interesting to me are the last few slides which detail both index size and machine setup:

The installation is running on two standard PCs with 2 dual-core processors (usual speed, bought in January 2008 for about 4000 Euro). They have 32GB RAM, 24 GB are used as ramdisk for the index. Without ramdisk initial queries especially those accessing fields are slower but still acceptable. The index contains about 19 million documents, that is 80GB of indexed text + billions of annotated tags.

Data serialization

2009-06-26 08:39
XML, JSON and others are currently standard data exchange formats. Being human-readable but still structured enough to be easily parsable by programs is their main benefit. Problems are overhead in size and parsing time. In addition at least xml is not really as human-readable as it could be.

An alternative are binary formats. Yet those often are not platform independent (either C++ or Java or Python bindings) or are not upgradable (what if your boss comes along and wants you to add yet another field? Do you need to process all your data again?).

There are a few libraries that promise to solve at least some of these problems. Usually you specify your data format with an IDL, generate (Byte-)code from it and use mechanisms provided by the libraries to upgrade your format.

Yesterday at the Berlin Apache Hadoop Get Together Torsten Curdt gave a short introduction to two of these solutions: Thrift and Protocol buffers. He explained why Joost decided to use one of those libraries and highlighted why they went with Thrift instead of Protocol Buffers.

This morning I have gathered a list of data exchange libs that are currently available:

  • Thrift ... developed at Facebook, now in the Apache incubator, active community, Bindings for C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml.
  • ProtoBuf ... developed at Google, mainly one developer only, bindings for C++, Java und Python.
  • Avro ... started by Doug Cutting, skips code generation.
  • ETCH ... developed at Cisco, now in the Apache Incubator, Bindings for Java, C#, JavaScript.

There are some performance benchmarks online. Another recent, extensive comparison of serialization performance of various frameworks.

June 2009 Apache Hadoop Get Together @ Berlin

2009-06-21 21:33
Just a brief reminder: Next week on Thursday the next Apache Hadoop Get Together is scheduled to take place in Berlin. There are quite a few interesting talks scheduled:

  • Torsten Curdt: Data Legacy - the challenges of an evolving data warehouse
  • Christoph M. Friedrich, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI): "SCAIView - Lucene for Life Science Knowledge Discovery".
  • Uri Boness from JTeam in Amsterdam: Solr - From Theory to Practice.


See http://upcoming.yahoo.com/event/2488959/ for more information.

For those interested in NOSQL Meetups, the discussion over at the NOSQL mailing list might be of interest to you: http://blog.oskarsson.nu/2009/06/nosql-debrief.html

Scrum Table Berlin

2009-06-21 21:26
Last week I attended the scrum table Berlin. This time around Phillippe gave a presentation on "backlog colours", that is types of work items tracked in the backlog.

The easiest type to track are features - that is items that generate revenue and are on the wishlist of the customer. Second type of items he sees are infrastructure items - that is, things needed to implement several features but invisible to the customer. Third type are bugs. Basically these are diminishing the value of features one had already classified as done earlier in the process. Fourth and last type are technical debt items - that is shortcuts taken or bad design choices (either knowingly as intentional decision made to meet some deadline or unintentional due to lack of experience).

A very simple classification could be the following matrix:







Name Value Cost
Feature Visible Positive Positive
Infrastructure Invisible Positive Positive
Bug Visible Negative Positive
Technical Debt Invisible Negative Positive


All four types of items exist in the real world. The interesting part is making these visible, assigning costs to each of them and scheduling these items in the regular sprint intervals.

The full presentation can be downloaded: http://scrumorlando09.pbworks.com/f/kruchten_backlog_colours.pdf