Data serialization

2009-06-26 08:39
XML, JSON and others are currently standard data exchange formats. Being human-readable but still structured enough to be easily parsable by programs is their main benefit. Problems are overhead in size and parsing time. In addition at least xml is not really as human-readable as it could be.

An alternative are binary formats. Yet those often are not platform independent (either C++ or Java or Python bindings) or are not upgradable (what if your boss comes along and wants you to add yet another field? Do you need to process all your data again?).

There are a few libraries that promise to solve at least some of these problems. Usually you specify your data format with an IDL, generate (Byte-)code from it and use mechanisms provided by the libraries to upgrade your format.

Yesterday at the Berlin Apache Hadoop Get Together Torsten Curdt gave a short introduction to two of these solutions: Thrift and Protocol buffers. He explained why Joost decided to use one of those libraries and highlighted why they went with Thrift instead of Protocol Buffers.

This morning I have gathered a list of data exchange libs that are currently available:

  • Thrift ... developed at Facebook, now in the Apache incubator, active community, Bindings for C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml.
  • ProtoBuf ... developed at Google, mainly one developer only, bindings for C++, Java und Python.
  • Avro ... started by Doug Cutting, skips code generation.
  • ETCH ... developed at Cisco, now in the Apache Incubator, Bindings for Java, C#, JavaScript.

There are some performance benchmarks online. Another recent, extensive comparison of serialization performance of various frameworks.

Large Scalability - Papers and implementations

2009-06-23 12:08
In recent years the Googles and Amazons on this world have released papers on how to scale computing and processing to terrabytes of data. These publications have led to the implementation of various open source projects that benefit from that knowledge. However mapping the various open source projects to the original papers and assigning tasks that these projects solve is not always easy.

With no guarantee of completeness this lists provides a short mapping from open source project to publication.

There are further overviews available online as well as a set of slides from the NOSQL debrief.















Map Reduce Hadoop Core Map Reduce Distributed programming on rails, 5 Hadoop questions, 10 Map Reduce Tips
GFS HDFS (Hadoop File System) Distributed file system for unstructured data
Bigtable HBase, Hypertable Distributed storage for structured data, When to use HBase.
Chubby Zookeeper Distributed lock- and naming service
Sawzall PIG, Cascading, JAQL, Hive Higher level langage for writing map reduce jobs
Protocol Buffers Protocol Buffers, Thrift, Avro, more traditional: Hessian, Java serialization Data serialization, early benchmarks
Some NoSQL storage solutions CouchDB, MongoDB CouchDB: document database
Dynamo Dynomite, Voldemort, Cassandra Distributed key-value stores
Index Lucene Search index
Index distribution katta, Solr, nutch Distributed Lucene indexes
Crawling nutch, Heritrix, droids, Grub, Aperture Crawling linked pages



June 2009 Apache Hadoop Get Together @ Berlin

2009-06-21 21:33
Just a brief reminder: Next week on Thursday the next Apache Hadoop Get Together is scheduled to take place in Berlin. There are quite a few interesting talks scheduled:

  • Torsten Curdt: Data Legacy - the challenges of an evolving data warehouse
  • Christoph M. Friedrich, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI): "SCAIView - Lucene for Life Science Knowledge Discovery".
  • Uri Boness from JTeam in Amsterdam: Solr - From Theory to Practice.


See http://upcoming.yahoo.com/event/2488959/ for more information.

For those interested in NOSQL Meetups, the discussion over at the NOSQL mailing list might be of interest to you: http://blog.oskarsson.nu/2009/06/nosql-debrief.html

June 2009 Apache Hadoop Get Together @ Berlin

2009-06-21 21:33
Just a brief reminder: Next week on Thursday the next Apache Hadoop Get Together is scheduled to take place in Berlin. There are quite a few interesting talks scheduled:

  • Torsten Curdt: Data Legacy - the challenges of an evolving data warehouse
  • Christoph M. Friedrich, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI): "SCAIView - Lucene for Life Science Knowledge Discovery".
  • Uri Boness from JTeam in Amsterdam: Solr - From Theory to Practice.


See http://upcoming.yahoo.com/event/2488959/ for more information.

For those interested in NOSQL Meetups, the discussion over at the NOSQL mailing list might be of interest to you: http://blog.oskarsson.nu/2009/06/nosql-debrief.html

June 2009 Apache Hadoop Get Together @ Berlin

2009-06-21 21:33
Just a brief reminder: Next week on Thursday the next Apache Hadoop Get Together is scheduled to take place in Berlin. There are quite a few interesting talks scheduled:

  • Torsten Curdt: Data Legacy - the challenges of an evolving data warehouse
  • Christoph M. Friedrich, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI): "SCAIView - Lucene for Life Science Knowledge Discovery".
  • Uri Boness from JTeam in Amsterdam: Solr - From Theory to Practice.


See http://upcoming.yahoo.com/event/2488959/ for more information.

For those interested in NOSQL Meetups, the discussion over at the NOSQL mailing list might be of interest to you: http://blog.oskarsson.nu/2009/06/nosql-debrief.html

June 2009 Apache Hadoop Get Together @ Berlin

2009-06-21 21:33
Just a brief reminder: Next week on Thursday the next Apache Hadoop Get Together is scheduled to take place in Berlin. There are quite a few interesting talks scheduled:

  • Torsten Curdt: Data Legacy - the challenges of an evolving data warehouse
  • Christoph M. Friedrich, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI): "SCAIView - Lucene for Life Science Knowledge Discovery".
  • Uri Boness from JTeam in Amsterdam: Solr - From Theory to Practice.


See http://upcoming.yahoo.com/event/2488959/ for more information.

For those interested in NOSQL Meetups, the discussion over at the NOSQL mailing list might be of interest to you: http://blog.oskarsson.nu/2009/06/nosql-debrief.html

Scrum Table Berlin

2009-06-21 21:26
Last week I attended the scrum table Berlin. This time around Phillippe gave a presentation on "backlog colours", that is types of work items tracked in the backlog.

The easiest type to track are features - that is items that generate revenue and are on the wishlist of the customer. Second type of items he sees are infrastructure items - that is, things needed to implement several features but invisible to the customer. Third type are bugs. Basically these are diminishing the value of features one had already classified as done earlier in the process. Fourth and last type are technical debt items - that is shortcuts taken or bad design choices (either knowingly as intentional decision made to meet some deadline or unintentional due to lack of experience).

A very simple classification could be the following matrix:







Name Value Cost
Feature Visible Positive Positive
Infrastructure Invisible Positive Positive
Bug Visible Negative Positive
Technical Debt Invisible Negative Positive


All four types of items exist in the real world. The interesting part is making these visible, assigning costs to each of them and scheduling these items in the regular sprint intervals.

The full presentation can be downloaded: http://scrumorlando09.pbworks.com/f/kruchten_backlog_colours.pdf

Open Street Map @ FSFE meetup

2009-06-21 20:59
At the last meeting of the local FSFE group here in Berlin Sabine Stengel from cartogis gave a presentation on Open Street Map. But instead of focussing on the technical side she described the legal issues and showed the broad variety of commercial projects that are possible with this type of mapping information.

It was interesting to learn of how detailed and high quality the information provided by volunteers really is. I think it will be interesting to see, how the project keeps traction after "everything is mapped" - how it remains interesting to stay involved, to keep the information up to date over a longer period of time.

Keeping changesets small

2009-06-21 20:48
One trick of successful and efficient software development is tracking changes in the sources in source code management systems, be it centralized systems like svn or perforce or decentralized systems like git or mercurial. I started working with svn while working on my Diploma thesis project in 2003, continued to use this systems while researcher at HU Berlin. Today I am using svn at work as well as for Apache projects and have come to like git for personal sandboxes.

One thing that bothered me for the last few months was the question of how to keep changesets reasonably small to be easy to code-review but also complete in that they contain enough to implement at least part of a feature fully.

So what makes a clean changeset for me: It contains at least one unit test to show that the implementation works. It contains all sourcecode needed to make that test work and not break anything else in the source tree. It might contain every change needed to implement one specific feature. The second sort of changeset that comes to my mind might be rather large and contains all changes that are part of refactoring the existing code.

There are a few bad practices that in my experience lead to large unwieldy changesets:

  • Making two or more changes in one checkout. Usually this happens whenever you checkout your code, start working on a cool new feature but get distracted by some other incoming feature request, by some bugfix or by mixing your changes with a patch from another developer you are about to review. Mixing changes makes it extremely difficult to keep track of which change belongs to which task. Usually the result is not checking in some of your changes and breaking the build.
  • Refactoring while working on a feature. Imagine the following situation: You are happily working along implementing your new feature. But suddenly you realize that some of your code should be refactored to better match your needs. And whoops: Suddenly it is no longer clear whether changes were simply made due to the refactoring steps (that might even be automated) or due to your new feature. I tend to at least try to do refactorings either before my changes, in a new checkout or after I finished changing the code (if that is feasable).
  • The definition of feature is too large. I tend to get large changesets whenever I try to do too much in one go. That is, the "feature" that I am trying to implement simply is too large. Usually it is possible to break the task up into a set of smaller tasks that are easier to manage.


If using git, there is a nice option to avoid to re-checkout a project: You can simply "stash" away changes up to the current point in time, do all that is needed for what distracted you, check that in and re-apply the changes for your previous task.

This list will be updated as I learn about (that is "make") more mistakes that can be cleanly classified into new categories.

Scrum Tisch

2009-06-04 11:27
Title: Scrum Tisch
Location: Divino FHain
Link out: Click here
Description: Philippe will present his speech from the Orlando scrum Gathering where he will speak about backlog and time-box, about value versus cost, about visible features versus invisible features (and in particular software architecture), about defects and technical debt, and more generally about release planning and sprint planning for non-trivial and long-lived software development projects.
Start Time: 18:00
Date: 2009-06-16