Learning to Rank Challenge

2010-03-09 19:49
In one of his recent blog posts, Jeff Dalton published an article on currently running machine learning challenges. Especially interesting for those working on search engines and interested in learning new rankings from data should be the Yahoo! Learning to Rank Challenge to be held in conjunction with this year's ICML 2010 in Haifa, Israel. The goal is to show that your algorithm does not only scale on real-world data provided by Yahoo!. Tasks are split in two. The first one focusses on traditional learning to rank procedures, the second one on transfer learning. Tracks are open to participants from industry and research.

A second challenge was published by the machine learning theory blog. The challenge is hosted by Yahoo! as well and deals with Key scientific challenges in statistics and machine learning.

Both programs look pretty interesting - would be great to lots of people from the community participating and comparing their systems.

Mahout at Berlin ignite

2010-03-01 22:24
This evening the first Berlin ignite event took place in the "Festsaal" in Berlin X-Berg. Organiser of the event was Matt Biddulph from Nokia Gate 5. We had eleven fantastic talks (ok, to be more precise: At least ten fantastic ones, my own can only be judged by the audience ;) ).

Topics included things you can learn when starting to collect data, themes from (agile) project management, RepRap machines (see also the Rep Rap FOSDEM 2010 talk), bots and robots. The talks finished with a presentation of a Part time scientist's vision of getting to the moon - an article on the project is available on heise newsticker.

The room was filled with more then 120 people resulting in a location packed with interested attendees. It was great seeing the talks on such diverse topics. Hope to have more events of this format here in Berlin. Thanks go to Matt, all speakers and everyone involved in generally making the event a big success.

For those who didn't make it to the event, slides and audio should go online soon. At least the slides on Mahout are available online.

Preliminary schedule online for ignite Berlin

2010-02-23 19:13
Today first talks scheduled for ignite Berlin were published. If you yourself would like to give a talk: Submission seems to still be open.

FOSDEM 2010 - 10 years FOSDEM

2010-02-03 19:33
I'm going to FOSDEM, the Free and Open Source Software Developers' European Meeting

The final schedule of FOSDEM 2010 is up: Looks like bad news - 306 interesting talks within just one weekend. Lots of interesting talks in the main track including Greg Kroah-Hartman on "Write and Submit your first Linux kernel Patch", David Recordon from Facebook on "Scaling Facebook with OpenSource tools", Bernard Li on "Ganglia: 10 years of monitoring clusters and grids", Andrew Tanenbaum with his "MINIX 3: a Modular, Self-Healing POSIX-compatible Operating System" talk, BenoƮt Chesneau on "CouchDB! REST and Database!" and many, many more.

In addition there will be many interesting DevRooms, including one on NoSQL, one on Free Java, the Mono DevRoom featuring a talk by Miguel de Icaza...

Looks like a weekend packed with interesting talks and discussions. If you are going there and are interested in an ad-hoc Hadoop-Beer-drinking meetup, make sure to contact me before the event.

Mahout in Action

2010-01-11 20:22
As noted earlier by Grant Ingersoll, the first chapters of Mahout in Action are already online at Manning:

Sean, Robin, keep up the great work! I would love to read more of the book in the near future.

Mahout in Action

2010-01-11 20:22
As noted earlier by Grant Ingersoll, the first chapters of Mahout in Action are already online at Manning:

Sean, Robin, keep up the great work! I would love to read more of the book in the near future.

With a little help from my friends

2009-12-31 23:55
The end of the year 2009 is quickly approaching. To me it feels a little like it ran away far too quickly. So instead of taking part in the annual review of past events, I would like to use it as an opportunity to say thank you: The past twelve months were a lot of fun with lots of interesting, nice people from all over the world. I got the chance to meet quite a bit of the Mahout community, I got lots and lots of new developers from all over Germany - or more precisely the EU - to attend the Apache Hadoop Get Together in Berlin. The interest in Mahout has grown tremendously over the past year.

All of this would not have been possible without the help of many people: First of all I'd like to thank Thilo Fromm - for making me happy whenever I was disappointed, for solacing me when I when I was sad, for patiently listening to me nervously whining before each and every talk, for kindly reviewing my slides and last but not least for helping me fix some of the problems that bugged me. Oh - and, thanks for helping me fix the issue in the zookeeper c-client within minutes that puzzled me for days.

Another big Thanks goes to family, first and foremost my mum, who kindly took care of organizing quite a bit of my paperwork and kept me on schedule with so many "unimportant" tasks like getting an appointment with some hospital to finally get the screws taken out of my knee ;)

A special thanks goes to the growing Mahout community as well as to the Lucene people - you know, who you are - keep up the great work: You rock!

Furthermore there are students at TU Berlin who have shown that with Mahout it is "dead-simple" to write an application that, given a stream of documents, groups them by topic and makes the result searchable in Solr. Thanks to you for solving the minor and major problems, for communicating with the community, for transparently communicating problems. Looking forward to continue working together with you next year.

Finally a big thank you to all of the speakers, sponsors and attendees of the Apache Hadopp Get Together, the NoSQL conference and the Apache Dinner Berlin - without you these events would never have been possible. Looking forward to seeing you again in January/ March 2010!

I hope I didn't forget too many people - just in case: I am pretty grateful for all the input, help and feedback I got this year.

PS: Another thanks to the spaceboyz visiting Berlin for 26C3 for helping Thilo tidy up our apartment after Congress was over this year ;)

Mahout 0.2 released

2009-11-18 10:52
Apache Mahout 0.2 has been released and is now available for public download at http://www.apache.org/dyn/closer.cgi/lucene/mahout

Up to date maven artifacts can be found in the Apache repository at

Apache Mahout is a subproject of Apache Lucene with the goal of delivering scalable machine learning algorithm implementations under the Apache license. http://www.apache.org/licenses/LICENSE-2.0

Mahout is a machine learning library meant to scale: Scale in terms of community to support anyone interested in using machine learning. Scale in terms of business by providing the library under a commercially friendly, free software license. Scale in terms of computation to the size of data we manage today.

Built on top of the powerful map/reduce paradigm of the Apache Hadoop project, Mahout lets you solve popular machine learning problem settings like clustering, collaborative filtering and classification
over Terabytes of data over thousands of computers.

Implemented with scalability in mind the latest release brings many performance optimizations so that even in a single node setup the library performs well.

The complete changelist can be found here:


New Mahout 0.2 features include

  • Major performance enhancements in Collaborative Filtering, Classification and Clustering
  • New: Latent Dirichlet Allocation(LDA) implementation for topic modelling
  • New: Frequent Itemset Mining for mining top-k patterns from a list of transactions
  • New: Decision Forests implementation for Decision Tree classification (In Memory & Partial Data)
  • New: HBase storage support for Naive Bayes model building and classification
  • New: Generation of vectors from Text documents for use with Mahout Algorithms
  • Performance improvements in various Vector implementations
  • Tons of bug fixes and code cleanup

Getting started: New to Mahout?

For more information on Apache Mahout, see http://lucene.apache.org/mahout

A very BIG Thank You to all those who made this release happen!

Open Source Expo 09

2009-11-16 22:17
I spent last Sunday and the following Monday at Open Source Expo Karlsruhe - co-located with web-tech and php-conference organized by the Software-and-Support Verlag. Together with Simon Willnauer I ran the Lucene/Mahout booth at the expo.

So far the conference is still very small (about 400 visitors) compared to free software community events. However the focus was set to be more on professional users, accordingly several projects showed that free software can be used successfully for various business use cases. Visitors were invited to ask Sun about their free software strategy. Questions concerning OpenJDK or MySQL were not uncommon. Large distributors like SuSE or Mandriva were present as well. But also smaller companies e.g. providing support for Apache OfBIZ were present.

The Apache Lucene project was invited as exhibitor as well. Together with PRC and ConCom we organized for an Apache banner. Lucid Imagination sponsored several Lucene T-Shirts to be distributed at the conference. At the very last minute information (abstract, links to projects and mailing lists and current users) was put together on flyers.

We arrived on Saturday, late evening. Together with a friend of mine we went for some indian food at a really good restaurant close to the hotel. Big thanks to her, for being our tourist guide - hope to see you back in Waldheim in December ;)

Sunday was pretty quiet - only few guests arrived at the weekend. I was invited by David Zuelke to give a brief introduction to Mahout during his MapReduce Hadoop tutorial workshop. Thanks, David. Though lunch was served already, people did stay to hear my presentation on large scale machine learning with Mahout. I got contacted by one of the students of Katarina Morik who was pretty interested in the project. Back at her research group people are working on Rapid Miner - a tool for easy machine learning. It comes with a graphical user interface that makes it simple to explore various algorithm configurations and data workflow setups. It would be interesting to see how this tool helps people to understand machine learning. Would also be very interesting to learn what form of contribution might be interesting and appropriate for research groups to contribute to Mahout. Maybe not code-wise but more in terms of discussions and background knowledge.

Sunday was a bit more busy, with more people attending the conferences. Simon got a slot to present Lucene at the Open Stage track and show off the new features of Lucene 2.9. Those using Lucene already could be tricked into telling their Lucene success-story at the beginning of the talk. At the booth we had a wide variety of people: From students trying to find a crawling and indexing system for their information retrieval course homework up to professionals with various questions on the Apache Lucene project. The experience of people at the conference varied widely. That proved to be a pretty good reality-check. Being part of the Lucene and the ASF community one might be tempted to think that not knowing about Lucene is almost impossible. Well, it seems to be less impossible than at least I expected.

One last success: As the picture shows, Yacy now is powered by Lucene as well - at least in terms of T-Shirt ;)

Apache Con US Wrap Up

2009-11-16 22:10
some weeks ago I attended ApacheConUS09 in Oakland/ California. In the mean time, videos of one of the sessions have been published online:

You can find a wrap up of the most prominent topics at the conference at heise (unfortunately Germany-only).

By far the largest topics at the conference:
  • Lucene - there was a meetup with over 100 attendees as well as two main tracks with Lucene focussed talks. New features of Lucene 2.9.* were in the center of interest: The new range search capabilities, segment search that improves caching, a new token stream api that makes annotating terms more flexible as well as a lot of performance improvements. Shortly after the conference, Lucene 2.9.1 as well as Solr 1.4 was released so end-users switching to the new version now benefit from better performance and several new features.
  • Hadoop - large scale data processing currently is one of the biggest topics. Be it logfile analysis, business intelligence or ad-hoc analysis of user data. Hadoop was covered by a user meetup as well as one track on the first conference day. The track started with an introduction by Owen O'Malley and Doug Cutting. It continued with talks on HBase, Hive, Pig and other projects from the Hadoop ecosystem.

But also projects like Apache Tomcat and Apache HTTPD were well covered within one to two sessions each.

Currently a hot topic within the foundation is the challenge of bringing the community together face-to-face. Apache projects have become so numerous that covering them all within 3+2 days of conference and trainings seems no longer feasable. One way to mitigate these problems might be to motivate people to do more local meetups potentially supported by ConCom as has already happened in the Lucene- and Hadoop-communities. A related topic is the task of community building and community growth within the ASF. Google Summer of Code has been a great way to integrate new people. However the model does not scale that well for the foundation. With ComDev a new project was founded with the goal to work on community development issues, talking to research, getting students into open source early on. The project is largely supported by Ross Gardler, who already has experience with teaching and promoting open source and free software in the research context being part of the open source watch project in the UK.

Apache Con US 09 brought together a large community of Apache software developers and users from all over the world who gathered in California, not only for the talks but also for face-to-face communication, coding together and exchanging ideas.

Update: Slides of my Mahout talk are now online.