Mahout in Action

2010-11-30 21:07
Flying to Atlanta I finally had a few hours of time to finalize the review of the Mahout in Action MEAP edition. The book is intended for potential users of the Apache Mahout, a project focussing on implementing scalable algorithms for machine learning.
Describing machine learning algorithms and their application to practioners is a non-trivial task: Usually there is more than one algorithm available for seemingly identically problem settings. In addition each algorithm usually comes with multiple parameters for fine-tuning its behaviour to the problem setting at hand.
Sean Owen does an awesome job explaining the basic concepts behind building recommender systems in that book. In a very intuitive way he highlights the properties of each algorithm and its options. Based on one example setting taken from a real world problem (parents buying music Cds for their children based on more or less background information) he highlights the properties of each available recommender algorithm.
The second section of the book highlights available implementations for clustering documents, that is grouping documents by similarity – a problem that is very common when it comes to grouping texts into topics and detecting upcoming new topics in a stream of publications. Robin Anil and Ted Dunning make it very easy to understand what clustering is all about, explain how to use, configure and use the current implementations in Mahout in various practical settings.
The book looks very promising. It is well suited for engineers looking for an explanation of how to successfully use Mahout to solve real world problems. In contrast to existing publications it makes it easy to grasp the basic concepts event without wading through complicated computations. The book is specially targeted to Mahout users. However it does give important background information on the algorithms available that is needed to decide on exactly which implementation and which configuration to use. Looking forward to the last section on classification algorithms.

Mahout in Action

2010-11-30 21:07
Flying to Atlanta I finally had a few hours of time to finalize the review of the Mahout in Action MEAP edition. The book is intended for potential users of the Apache Mahout, a project focussing on implementing scalable algorithms for machine learning.
Describing machine learning algorithms and their application to practioners is a non-trivial task: Usually there is more than one algorithm available for seemingly identically problem settings. In addition each algorithm usually comes with multiple parameters for fine-tuning its behaviour to the problem setting at hand.
Sean Owen does an awesome job explaining the basic concepts behind building recommender systems in that book. In a very intuitive way he highlights the properties of each algorithm and its options. Based on one example setting taken from a real world problem (parents buying music Cds for their children based on more or less background information) he highlights the properties of each available recommender algorithm.
The second section of the book highlights available implementations for clustering documents, that is grouping documents by similarity – a problem that is very common when it comes to grouping texts into topics and detecting upcoming new topics in a stream of publications. Robin Anil and Ted Dunning make it very easy to understand what clustering is all about, explain how to use, configure and use the current implementations in Mahout in various practical settings.
The book looks very promising. It is well suited for engineers looking for an explanation of how to successfully use Mahout to solve real world problems. In contrast to existing publications it makes it easy to grasp the basic concepts event without wading through complicated computations. The book is specially targeted to Mahout users. However it does give important background information on the algorithms available that is needed to decide on exactly which implementation and which configuration to use. Looking forward to the last section on classification algorithms.

Apache Con – Wrap up

2010-11-29 23:27
After one week of lots of interesting input the ASF's user conference was over. With a focus on Apache software users quite a few talks are not too well suited for conference regulars but more or less targeted at newbies who want to know who too successfully apply the software. As a developer of Mahout with close ties to the Lucene and Hadoop community what is of course most interesting to me are stories of users putting the software into production.
The conference was well organised: The foundation features way more projects at the moment than can reasonably be covered in just one conference. As a result Apache Con only covers only a subset of what is being developed at Apache. As such more and more smaller community organised events are being run by individuals as well as corporations. Still Apache Con is a nice place to get in touch with other developers – and to get a glimpse of what is going on in project outsides ones own regular community.

Christmas Scrumtisch

2010-11-29 22:59
Today the last Scrumtisch Berlin in 2010 took place in Friedrichshain. Thanks to Marion Eickmann and Andrea Tomasini for organising the Scrum user group regularly for the past years.

Though no presentation had been scheduled ahead of time the Scrumtisch was well attended by over twenty people, mostly from companies based in Berlin who are either using Scrum already or are currently in a transition phase.

We went straight into collecting and voting on topics for discussion. In total we ended up having eight potential topics listed, including but not limited to


  • Scrum and non-feature teams, does it work - and if, how?
  • Transitioning to Scrum - what are the stake holders that must be convinced first in a company?
  • Scrum with teams made of people buying into Scrum and those that don't - does that work?
  • Can Scrum be combined with people who want to telecommute?
  • Scrum and platform development - how does that get combined?
  • Scrum in systems engineering, embedded development - how to setup teams?


After voting we had the two clear winners discussing Scrum in teams that don't buy into the method completely as well as telecommuting with Scrum teams.

Scrum with broken teams


The situation described: The attendee proposing the topic has the problem of being Scrum master at a team that does not completely buy into Scrum. There are a few developers who like being self-organising, who love short feedback cycles. However there are a few others who would rather stick with their technological niche, get tasks assigned to them and avoid taking over tasks from others.

During discussion we found out that in this company Scrum had been introduced as a grass-roots movement little over a year ago. The introduction of the method led to success clearly visible in the company. In turn the method was tried on a larger team as well. However at the moment the team is at a point where it is about to break apart: Into developers happy with change, flexible enough to adapt to shift in technology and a second half that would rather continue developing the old way.

One very important point was raised by one of the attendees: With Scrum getting introduced so fast, compared to the length in time the company had been living before, it may well be time to slow down a bit. To sit down with the team in a relaxed environment and find out more on how everyone is assessing the current situation. Find out more on what people like about the new approach, and about what should be changed and still needs improvement. In the end it's not a process issue but a people problem - there is a need to get the team on-board.

Team-building activities might help as well - let the team experience what it means to be able to rely on each other. What does it mean to learn new things in short time, to co-operate to solve tasks so far untackled?

If team-members start ignoring the sprint backlog working on other tasks instead there is a question about whether there is enough trust in the product-owner's decisions. On the other hand with pressure resting on the team's shoulders there might be a need to stop the train, fix all open issues and continue only after the project is back in shape. However also this needs all team members working towards a common goal - with everyone willing to take up any open task.

Scrum and telecommuting


Basically the question was whether it works at all (clear yes from the audience) and if, which best practices to use. To be more precise: Does Scrum still work if some of the team members work from home a few days a week but are in the office all other time. The risk of course lies in loosing information, in the team building common knowledge. And thus becoming less productive.

There are technical tools that can help the process: electronic scrum boards (such as Greenhopper for JIRA or Agilo) as well as tele-conferencing systems, wikis, social networking tools, screen sharing for easier pair programming. All tools used must entail less overhead then the provide in benefit to the team however. Communication will become more costly - however if and to what extend this translates to a loss in productivity varies greatly.

There must be a clear commitment from both sides - the telecommuter as well as the team on-site - to keep the remote person in the loop. Actually it is easier with teams that are completely remote. This experience is sort of familiar from any open source project: With people working in different time zones it comes naturally to take any decision on a mailing list. However with some people having the chance to communicate face-to-face suddenly decisions become way less transparent. At Apache we even go as far as telling people that any decision that is not taken on the mailing list, never really was taken at all. A good insight into how distributed teams at Apache work has been given earlier by Bertrand Delacrétaz.

For team building reasons it may make sense to start out with a co-located team and split off people interested in working from home later on. That way people have a chance to get to know each other face-to-face which makes later digital-only communication way easier.

Thanks again to Marion and Andrea for organising today's Scrumtisch. If you are using Scrum and happen to be in Berlin - send an e-mail to Marion to let her know you are interested in the event, or simply join us at the published date.

Teddy in Atlanta

2010-11-28 23:24
While I was happily attending Apache Con US in Atlanta/GA my teddy had a closer look at the city: He first went to the centennial olympic park, took a picture of the world of coca-cola (wondering what strange kinds of museums there are in the US.



After that he headed over to Midtown having a quiet time in the Piedmont park. And finally had a closer look at the private houses still decorated for Halloween. Seems like it was squirrel day that day: Met more than ten squirrels he told me.



I found quite some impressive pictures of the arts museum on my camera after his trip out – as well as several images taken at the campus of the Georgia tech university. It's amazing to see what facilities are available to students – especially compared to the equipment of German universities.

Apache Con – last day

2010-11-27 23:23
Day three of Apache Con started with interesting talks on Tomcat 7, including an introduction to the new features of that release. Those include better memory leak prevention and detection capabilities – the implementation of these capabilities have lead to the discovery of various leaks that appear under more or less weird circumstances in famous open source libraries and the JVM itself. But also better management and reporting facilities are part of the new release.

As I started the third day over at the Tomcat track, unfortunately I missed the Tika and Nutch presentations by Chris Mattman – so happy, that at least the slides were published online: $LINK. The development of nutch was especially interesting for me as that was the first Apache project I got involved with back in 2004. Nutch started out as a project with the goal of providing an open source alternative internet-scale search engine. Based on Lucene as a indexer kernel, it also providing crawling, content extraction and link analysis.

Focussed on building an internet scale search engine the need for a distributed processing environment quickly became apparent. Initial implementations of a nutch distributed file system and a map reduce engine lead to the creation of the Apache Hadoop project.

In recent years it was comparably quiet around nutch. Besides Hadoop also content extraction was factored out of the project into Apache Tika. At the moment development is getting more momentum again. Future developments are supposed to be focussed on building an efficient crawling engine. As storage backend the project wants to leverage Apache HBase, for content extraction Tika is to be used, as indexing backend Solr.

I loved the presentation by Geoffrey Young on how they used Solr to replace their old MySQL search based system for better performance and more features at Ticketmaster. Indexing documents representing music CDs presents some special challenges when it comes to domain modeling: There are bands with names like “!!!”. In addition users are very likely to misspell certain artists names. In contrast to large search providers like Google these businesses usually have neither human resources nor enough log data to provide comparable coverage e.g. when implementing spell-checking. A very promising and agile approach taking instead was to parse log files for most common failing queries and from that learn more about features needed by users: There were many queries including geo information coming from users looking for an event at one specific location. As a result geo information was added to the index leading to happier users.

Apache Con – Mahout, commons and Lucene

2010-11-26 23:21
The second day the track interesting to me provided an overview of some of the Apache commons projects. So seemingly small in scope and light-weight in implementation and dependencies these projects provide vital features not yet well supported by the Sun JVM. There is a commons math implementation featuring a fair amount of algebraic, numeric and trigonometric functions (among others), the commons exec framework for executing processes externally to the JVM w/o running into the danger of creating dead-locks or wasting resources.

After that the Mahout and Lucene presentations were up. Grant gave a great overview of various use-cases of machine learning in the wild, rightly claiming that anyone using the internet today makes use of some machine learning powered application each day – be it e-mail spam filtering, the Gmail priority inbox, recommendaed articles on news sites, recommended items to buy at shopping sites or targeted advertisements shown when browsing. The talk was concluded by a more detailed presentation of how to successfully combine the features of Mahout and Lucene/Solr to built next generation web services that integrate user feedback into their user experience.

ApacheCon - Keynotes

2010-11-25 23:20
The first keynote was given by Dana Blankenhorn – a journalist and blogger regularly publishing tech articles with a clear focus on open source projects. Focussed on the evolution of open source projects with a special focus on Apache.

Coming from a research background the keynote given by Daniel Crichton from NASA was very interesting to me: According to the speaker scientists are facing challenges that are all to known to large and distributed corporations. Most areas in science is currently becoming more and more dependent on data intensive experiments. Examples include but are not limited to

  • The field of biology where huge numbers of experiments are needed to decipher the internal workings of proteins, or to be able to understand the fundamental concepts underlying data encoded in DNA.
  • In physics hadron collider experiments huge amounts of data are generated with each experiment. With facilities for running such experiments are expensive to build and the amount of data generated is for too large to be analysed by just one team groups of scientists are suddenly facing the issue of exchanging data with remote research groups. They suddenly run into the requirement of integrating their system to those of other groups. All of a sudden data formats and interfaces have to somehow be standardised.
  • Running space missions used to be limited to just a very small number of research institutions in a very tiny number of countries. However this is about to change as more countries are gaining the knowledge and facilities to run space missions. Again this leads to the need to be able to collaborate towards one common goal.

Not only are software systems so far distinct and incompatible. Even data formats used usually are incompatible. The result are scientists spending most of their time re-formatting, converting and importing datasets before being able to get any real work done. At the moment research groups are not used to working collaboratively in distributed teams. Usually experiments are run on specially crafted, one-of software that cannot be easily re-used, that does not adhere to any standards and that is being re-written over and over again by every research group. Re-using existing libraries is oftentimes a huge cultural shift as researchers seemingly are afraid of external dependencies, afraid of giving up control over part of their system.

One step into the right direction was taken by NASA earlier this year: They released their decision making support system OODT under a free software license (namely the Apache Software License) and put the project under incubation at Apache. The project currently is about to graduate to its own top level Apache project. This step is especially remarkable as successfully going through the incubator also means to have established a healthy community that is not only diverse but also open to accepting incoming patches and changes to the software. This means to not only give up control over your external dependencies but also having the project run in a meritocratic, community driven model. For the contributing organisation, this boils down to no longer having total control over the future roadmap of the project. In return this usually leads to higher community participation, and higher adoption in the wild.

Apache Con – Hadoop, HBase, Httpd

2010-11-24 23:19
The first Apache Con day featured several presentations on NoSQL databases (track sponsored by Day software), a Hadoop track as well as presentations on Httpd and an Open source business track.

Since its inception Hadoop always was intended to be run in trusted environments firewalled from hostile users or even attackers. As such it never really supported any security features. This is about the change with the new Hadoop release including better Kerberos based security.

When creating files in Hadoop a long awaited feature was append support. Basically up to now writing to Hadoop was a one-of job: Open a file, write your data, close it and be done. Re-opening and appending data was not possible. This situation is especially bad for HBase as its design relies on being able to append data to an existing file. There have been efforts for adding append support to HDFS earlier as well as an integration of such patches by third party vendors. However only with a current Hadoop version Append is officially supported by HDFS.

A very interesting use case-wise of the Hadoop stack was presented by $name-here from $name. They are using a Hadoop cluster to provide a repository of code released under free software licenses. The business case is to enable corporations to check their source code against existing code and spot license infringements. This does not only include linking to free software under incompatible licenses but also developers copying pieces of free code, e.g. copying entire classes or functions into internal projects that originally were available only under copyleft licenses.

The speaker went into some detail explaining the initial problems they had run into: Obviously it's no good idea to mix and match Hadoop and HBase versions freely. Instead it is best practice to use only versions claimed to be compatible by the developers. Another common mistake is to leave parameters of both projects at their defaults. The default parameters are supposed to be fool-proof. However they are optimised to work well for Hadoop newbies who want to try out the system on a single node cluster and in a distributed setting obviously need more attention. Other anti-patterns include storing only tiny little files in the cluster thus quickly running out of memory on the namenode (that stores all file information including block mappings in main memory for faster access).

In the NoSQL track Jonathan Grey from Facebook gave a very interesting overview on the current state of HBase. Turns out that Facebook would announce only a few days after that their internal use of HBase for the newly launched feature of Facebook messaging.

HBase has adopted a release cycle including development/ production releases to get their systems into interested users' hands more quickly: Users willing to try out new experimental features can use the development releases of HBase. Those who are not should go for the stable releases.

After focussing on improving performance in the past months the project is currently focussing on stability: Data loss is to be avoided by all means. Zookeeper is to be integrated more tightly for storing configuration information thus enabling live reconfiguration (at least to some extend). In addition also HBase is targeting to integrate stored procedures like behaviour: As explained in Googles Percolator paper $LINK batch oriented processing get's you only so far. If data that gets added constantly it makes sense to give up on some of the throughput batch-based systems provide and instead optimise for shorter processing cycles by implementing event triggered processing.

On recommendation of one of neofonie's sys-ops I visited some of the httpd talks: First Rich Bowen gave an overview of unusual tasks one can solve with httpd. The list included things like automatically re-writing http response content to match your application. There is even a spell checker for request URLs: Given marketing has given your flyer to the press with a typo in the url, chances are that the spellchecking module can fix these automatically for each request: Common mistakes covered are switched letters, numbers replaced by letters etc. The performance cost has to be paid only in case no hit could be found – so instead of returning a 404 right away the server first tries to find the document by taking into account common mis-spellings.

Apache Con – Hackathon days

2010-11-23 23:17
This year on Halloween I left for a trip to Atlanta/GA. Apache Con US was supposed to take place there featuring two presentations on Apache Mahout – one by Grant Ingersoll explaining how to use Mahout to provide better search features in Solr, one by myself with a general introduction to what features Mahout provides, giving a bit more detailed information on how to use Mahout for classificaiton.

I spent most of Monday in Sally Khudairi's media training. In the morning session she explained the Ins and Outs of successfully marketing your open source project: One of the most important questions is to be able to provide a dense but still accessible explanation of what your project is all about and how it differentiates from other projects potentially in the same space. As a first exercise attendees would meet in pairs interviewing each other about their respective project. When summarising the information I had gotten, Sally quickly pointed out additional pieces of valuable information I had totally forgotten to ask about:


  • First of all the full name of the interviewee, including the sur-name.
  • Second the background of the person with respect to the project. It seemed all to natural that someone you meet at Apache Con in a Media Training almost certainly is either founder or core-committer to the project. Still it is interesting to know more on how long he has been contributing, whether he maybe even co-founded the project.


After that first exercise we would go into detail on various publication formats. When releasing project information the first format that comes to mind are press releases. For software projects at the ASF these are created in a semi-standardised format containing

  • Background on the foundation itself.
  • Some general background on the project.
  • A few paragraphs on the news to be published on the project in an easily digestible format.
  • Contact information for more details.


Some of these parts can be re-used across different publications and occasions. It does make sense to keep these building blocks as a set of boilerplates ready to use when needed.

After lunch Michael Coté from redmonk visited us. Michael has a development background, currently he works as business analyst for redmonk. It is fairly simple to explain technical projects to fellow developers. To get some experience in explaining our project also to non-technical people Sally invited Michael to interview us. By the end of the interview Michael asked each whether they had any question for him. As understanding what machine learning can do for your average Joe programmer is not all to trivial I simply asked him for strategies for better explaining or show-casing our project. One option that came to his mind was to come up with one – or a few – example show cases where Mahout is applied to freely available datasets. Currently most data analysis systems are rather simple or based only on a very limited set of data. Showing on a few selected use cases what can be done with Mahout should be a good way to get quite some media attention for the project.

During the remaining time of the afternoon I started working a short explanation of Mahout and our latest release. The text was reviewed by the Mahout community. The text was published by Sally on the blog of the Apache Software foundation. I also used it as a basis for an article on heise open that got published that same day.

The second day was reserved for a mixture of attending the Barcamp session and hacking away at the Hackathon. Ross had talked me into giving an overview of various Hadoop use cases as that was requested by one of the attendees. However it turned out the guy wasn't really interested in specific use cases: The discussion quickly turned into the more fundamental question of how far the ASF should go in promoting its projects. Should there be a budget for case studies? Should there even be some marketing department. Well, clearly that is out of scope for the foundation. And in addition would run contrary to it being a neutral ground for vendors to collaborate towards common goals while still separately making money providing consulting services, selling case studies etc.

During the Hackathon I was turned into a Mentor for Stanbol, a new project entering incubation just now. In addition I spent some time to finally catch up with the Mahout mailing list.