gnuplot tutorial link

2009-09-02 12:30
As I am happily watching myself searching for the gnuplot tutorial over and over again - the direct link stored here to save future searching:

Converting a git repo to svn

2009-08-17 10:15
Pretty unlikely though it may seem, but there are cases when one might want to convert a git repo to svn and still keep all revisions intact. There is a nice explanation online on how to do that in the Google Open Source blog.

AMQP Erlang user group talk

2009-07-10 15:56
Last Wednesday at the Erlang user group Berlin Matthias Radestock from the RabbitMQ project gave a talk on RabbitMQ, AMQP and messaging in general. Slides are available online.

First Matthias motivated the need for an open standard for messaging: So far, their are a few provides of middleware systems like Tibco and IBM. But those solutions are usually closed, expensive, cumbersome to handle. In short they do not fit into a world where people rely on open standards for communication, free software for development and lightweight implementations.

AMQP aims to povide an open standard for messaging - that is decoupled communication between processes that may reside on separate boxes or in different datacenters. There are a few providers of AMQP implementations. Some examples are iMatix focussed on low latency communication, Apache Qpid and the corresponding project inside of RedHat and RabbitMQ.

RabbitMQ is implemented in Erlang (after all, the talk was hosted by the Erlang User Group Berlin ;) ). With about 7000 lines of code the code base is rather compact. The goal was not to built a super-fast implementation, but one that is scalable and highly available.

So far there is no facility for building reliable cross datacenter communication built into RabbitMQ. Yet, there are several projects available that aim at providing just that.

Solr at AOL

2009-07-02 13:06
Grant Ingersoll has posted a very interesting interview with Ian Holsman on Solr at Relegance, now AOL. It describes the business side of the decission to switch to an open source solution, provides some inside on the size of the installation and details which technological reasons have driven the decission to switch from a proprietary implementation to Solr:

Large Scalability - Papers and implementations

2009-06-23 12:08
In recent years the Googles and Amazons on this world have released papers on how to scale computing and processing to terrabytes of data. These publications have led to the implementation of various open source projects that benefit from that knowledge. However mapping the various open source projects to the original papers and assigning tasks that these projects solve is not always easy.

With no guarantee of completeness this lists provides a short mapping from open source project to publication.

There are further overviews available online as well as a set of slides from the NOSQL debrief.

Map Reduce Hadoop Core Map Reduce Distributed programming on rails, 5 Hadoop questions, 10 Map Reduce Tips
GFS HDFS (Hadoop File System) Distributed file system for unstructured data
Bigtable HBase, Hypertable Distributed storage for structured data, When to use HBase.
Chubby Zookeeper Distributed lock- and naming service
Sawzall PIG, Cascading, JAQL, Hive Higher level langage for writing map reduce jobs
Protocol Buffers Protocol Buffers, Thrift, Avro, more traditional: Hessian, Java serialization Data serialization, early benchmarks
Some NoSQL storage solutions CouchDB, MongoDB CouchDB: document database
Dynamo Dynomite, Voldemort, Cassandra Distributed key-value stores
Index Lucene Search index
Index distribution katta, Solr, nutch Distributed Lucene indexes
Crawling nutch, Heritrix, droids, Grub, Aperture Crawling linked pages

Keeping changesets small

2009-06-21 20:48
One trick of successful and efficient software development is tracking changes in the sources in source code management systems, be it centralized systems like svn or perforce or decentralized systems like git or mercurial. I started working with svn while working on my Diploma thesis project in 2003, continued to use this systems while researcher at HU Berlin. Today I am using svn at work as well as for Apache projects and have come to like git for personal sandboxes.

One thing that bothered me for the last few months was the question of how to keep changesets reasonably small to be easy to code-review but also complete in that they contain enough to implement at least part of a feature fully.

So what makes a clean changeset for me: It contains at least one unit test to show that the implementation works. It contains all sourcecode needed to make that test work and not break anything else in the source tree. It might contain every change needed to implement one specific feature. The second sort of changeset that comes to my mind might be rather large and contains all changes that are part of refactoring the existing code.

There are a few bad practices that in my experience lead to large unwieldy changesets:

  • Making two or more changes in one checkout. Usually this happens whenever you checkout your code, start working on a cool new feature but get distracted by some other incoming feature request, by some bugfix or by mixing your changes with a patch from another developer you are about to review. Mixing changes makes it extremely difficult to keep track of which change belongs to which task. Usually the result is not checking in some of your changes and breaking the build.
  • Refactoring while working on a feature. Imagine the following situation: You are happily working along implementing your new feature. But suddenly you realize that some of your code should be refactored to better match your needs. And whoops: Suddenly it is no longer clear whether changes were simply made due to the refactoring steps (that might even be automated) or due to your new feature. I tend to at least try to do refactorings either before my changes, in a new checkout or after I finished changing the code (if that is feasable).
  • The definition of feature is too large. I tend to get large changesets whenever I try to do too much in one go. That is, the "feature" that I am trying to implement simply is too large. Usually it is possible to break the task up into a set of smaller tasks that are easier to manage.

If using git, there is a nice option to avoid to re-checkout a project: You can simply "stash" away changes up to the current point in time, do all that is needed for what distracted you, check that in and re-apply the changes for your previous task.

This list will be updated as I learn about (that is "make") more mistakes that can be cleanly classified into new categories.

Open Source Development is good for you

2009-05-21 09:08
GSoC (Google summer of code) - one of the open source programs of Google - has started again in 2009. Students come to work for open source projects during the summer and on success are paid by Google a fair amount of money.

This program is an ideal oportunity for students to get into open source projects: You get a mentor, you have pre-defined task to work on with a goal you set yourself. And in the end there is money.

At the beginning of GSoC student ranking Ted Dunning posted a very interesting mail on his view on why students should participate in open source development:

  • It is a perfect chance to work together with senior developers that are passionate about what they do.
  • Usually universities teach the theoretical side of life, which is good. But if working in industry later, students need experience with current development best practices and tools. They need to be aware of test driven development, they need to know how to use source control systems, continuous integration tools, build management frameworks, bug tracking tools. Open source projects usually are a great place to try out these technologies and learn how to best apply them.
  • Working on open source students need to coordinate with their peers. They need to learn that development is not only about coding, but about communication as well.
  • Last but not least this is a chance to chose yourself what you are working on and achieve so much more than when starting yet another brand new single developer project.

In the end all this adds up to learning and practicing the skills needed to successfully work on software development projects with more than just a few developers.

Apache Con Europe 2009 - part 1

2009-03-29 18:41
The past week members, committers and users of Apache software projects gathered in Amsterdam for another Apache Con EU - and to celebrate the 10th birthday of the ASF. One week dedicated to the development and use of Free Software and the Apache Way.

Monday was BarCamp day for me, the first BarCamp I ever attended. Unfortunately not all participants proposed talks. So some of the atmosphere of an unconference was missing. The first talk by Danese Cooper was on "HowTo: Amsterdam Coffee Shops". She explained the ins and outs of going to coffee shops in Amsterdam, gave both legal and practical advise. There was a presentation of the Open Street Map project, several Apache projects. One talk discussed transfering the ideas of Free Software to other parts of life. Ross Gardler started a discussion on how to advocate contributions to Free Software projects in science and education.

Tuesday for me meant having some time for Mahout during the Hackathon. Specifically I looked into enhancing matrices with meta information. In the evening there were quite a few interesting talks at the Lucene Meetup: Jukka gave an overview of Tika, Grant introduced Solr. After Grant's talk some of the participants shared numbers on their Solr installations (number of documents per index, query volumn, machine setup). To me it was extremely interesting to gain some insight into what people actually accomplish with Solr. The final talk was on Apache Droids, a still incubating crawling framework.

The Wednesday tracks were a little unfair: The Hadoop track (videos available online for a small fee) was right in parallel to the Lucene track. The day started with a very interesting keynote by Raghu from Yahoo! on their storage system PNUTS. He went into quite some technical detail. Obviously there is interest in publishing the underlying code under an open source license.

After the Mahout introduction by Grant Ingersoll I changed room to the Hadoop track. Arun Murthy shared his experience on tuning and debugging Hadoop applications. After lunch Olga Natkovich gave an introduction to Pig - a higher language on top of Hadoop that allows for specifications of filter operations, joins and basic control flow of map reduce jobs in just a few lines of Pig Latin code. Tom White gave an overview of what it means to run Hadoop on the EC2 cloud. He compared several options for storing the data to process. Today it is very likely that there will soon be quite a few more providers of cloud services in addition to Amazon.

Allen Wittenauer gave an overview of Hadoop from the operations point of view. Steve Lougran finally covered the topic of running Hadoop on dynamically allocated servers.

The day finished with a pretty interesting BOF on Hadoop. There still are people that do not clearly see the differences of Hadoop based systems to database backed applications. Best way to find out whether the model fits: Set up a trial cluster and do experiment yourself. Noone can tell which solution is best for you except for yourself (and maybe Cloudera setting up the cluster for you :) ).

After that the Mahout/UIMA BOF was scheduled - there were quite a few interesting discussions on what UIMA can be used for and how it integrates with Mahout. One major take home message: We need more examples integrating both. We developers do see the clear connections. But users often do not realize that many Apache projects should be used together to get the biggest value out.

Books I found particularly helpful

2009-03-12 18:44
During the last few years I have quite a few books that one could easily file under the category "Hacking books". Some of them were particularly interesting to me and have influenced the way I write code. The following list certainly is not complete at all - but it is a nice starting point.

  • Effective C++ - I have comparably little experience with C++ but this book really helped understand some of the particularities.
  • Effective Java - even though I have been developing in Java since a few years reading and revisiting Effective Java helps understanding and dealing with some of the quirks of the JVM.
  • Mythical Man Month - although classical literature for people dealing with software projects, although very well known, although easy to understand it is scaring to see that the exact same mistakes are still common in today's software projects.
  • Concurrent programming in Java - quick start on concurrent programming patterns - primarily focussed on Java. Fortunately no collection of recipes but thorough background information.
  • Working effectively with legacy code - I really like to have a look into this book from time to time. Shows great ways of untangling bad code, refactoring it and making it testable.
  • XP books by Kent Beck - if you ever had any questions on what XP programming is and how you should implement it: These are the books to read. Don't trust what people call XP in practice as long as they are not willing to refine and improve their "agile processes". Keep on working on what stops you from delivering great code.
  • Why programs fail - a guide to systematic debugging - If you ever had to debug complex programs - and I bet you had - this is the book that explains how to do this systematically. How to even have fun along the way.
  • Zen and the art of motorcycle maintenance - Not particularly on Software Development but the techniques described match stunningly well on software development.
  • Release It! - just about to read that one. But already the first few pages are not only valuable and interesting but also entertaining.
  • Implementation Patterns - forgot that yesterday.
  • Presentation Zen - another one I forgot. Really helped me to make better presentations.

There are still quite a few good books on my list. If you have any recommendations - please leave them in the comments.

There are a few other book lists online in various blogs. Two examples are the ones below:

Basic statistics of a set of values

2009-03-09 11:21
Just in order to find that when searching for it yet another time:

Problem: You have a set of values (for instance time it took to process various queries). You want a quick overview of how the values are distributed.

Solution: Store the values in a file separated by newline, read the file with R and output summary statistics.

R: times Read 30000 records
R: summary(times[[1]])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   6.00   12.00   13.00   16.75   14.00 8335.00

That's it.