Mahout in Action

2010-01-11 20:22
As noted earlier by Grant Ingersoll, the first chapters of Mahout in Action are already online at Manning:





Sean, Robin, keep up the great work! I would love to read more of the book in the near future.

Mahout in Action

2010-01-11 20:22
As noted earlier by Grant Ingersoll, the first chapters of Mahout in Action are already online at Manning:





Sean, Robin, keep up the great work! I would love to read more of the book in the near future.

How much of Scrum is implemented?

2010-01-06 22:36
I have started using Scrum for various purposes: It has inspired the way software is developed at my current employer. I use it to organize a students' project at university. In addition we are using it at home to get all personal tasks (preparing breakfast, doing the laundry, meeting with friends...) in line for each week.

Constantly looking for ways to evaluate, refine and improve work - I am also looking for ideas on how to evaluate which aspects of the Scrum implementation can actually be improved. One pretty common way to do this evaluation is to do the so-called "Nokia-Test". A set of questions on the project management that gives a possibility to judge your implementation of Scrum. As an example lets just have a closer look at our "Scrum Housework" implementation.


Question 1 - Iterations

  • No iterations - 0

  • Interations > 6 weeks - 1

  • Variable length < 6 weeks - 2

  • Fixed iteration length 6 weeks - 3

  • Fixed iteration length 5 weeks - 4

  • Fixed iteration 4 weeks or less - 10




Currently we are doing one week iterations - planning ahead for longer just seems impossible, except for events like going to conferences or regular birthdays. So that would be 10 points for iterations.


Question 2 - Testing within the Sprint

  • No dedicated QA - 0

  • Unit tested - 1

  • Feature tested - 5

  • Features tested as soon as completed - 7

  • Software passes acceptance testing - 8

  • Software is deployed - 10




Hmm. Admittedly there is no real testing in place except for smoke testing for stuff like emptying the dish washer.

    Question 3 - Agile Specification
  • No requirements - 0
  • Big requirements documents - 1
  • Poor user stories - 4
  • Good requirements - 5
  • Good user stories - 7
  • Just enough, just in time specifications - 8
  • Good user stories tied to specifications as needed - 10


We do not have big documents describing how to setup the christmas tree. But at the beginning of each sprint there is a set of user stories, if needed with acceptance criteria specified. So something like "Tidy up computer desk" would be augmented by the information: "To the extend that there are no items except for the laptop on the desk afterwards and the desk was dusted". That might probably make a 10.


Question 4 - Product Owner
  • No Product Owner - 0
  • Product Owner who doesn’t understand Scrum - 1
  • Product Owner who disrupts team - 2
  • Product Owner not involved with team - 2
  • Product owner with clear product backlog estimated by team before Sprint Planning meeting (READY) - 5
  • Product owner with release roadmap with dates based on team velocity - 8
  • Product owner who motivates team - 10



We are both familiar with Scrum. However, due to the nature of the tasks and due to lack of people in the loop we are exchanging the role of the product owner regularly. We are still missing a product backlog - currently it is loosely defined as a pile of post-it notes with estimates put beside each item that define its complexity. So I would give some 3 points on that one.


Question 5 - Product Backlog

  • No Product Backlog - 0
  • Multiple Product Backlogs - 1
  • Single Product Backlog - 3
  • Product Backlog clearly specified and prioritized by ROI before Sprint Planning (READY) - 5
  • Product Owner has release burndown with release date based on velocity - 7
  • Product Owner can measure ROI based on real revenue, cost per story point, or other metrics - 10



We only have one product backlog though it is very informal. So that would make 2 points.






Question 6 - Estimates

  • Product Backlog not estimated - 0
  • Estimates not produced by team - 1
  • Estimates not produced by planning poker - 5
  • Estimates produced by planning poker by team - 8
  • Estimate error < 10% - 10



Naturally those doing the tasks are those producing the estimates. Thanks to Agile42 in Berlin we now even have a set of planning poker cards: Yeah! That makes 8 points. Just as an example: Getting returnable bottles back to the shop makes for 8 complexity points, going to cinema are 16 just like storing the christmas decoration back in it's boxes, preparing breakfast is just about 3 points ;)


Question 7 - Sprint Burndown Chart
  • No burndown chart - 0
  • Burndown chart not updated by team - 1
  • Burndown chart in hours/days not accounting for work in progress (partial tasks burn down) - 2
  • Burndown chart only burns down when task in done (TrackDone pattern) - 4
  • Burndown only burns down when story is done - 5
  • Add 3 points if team knows velocity
  • Add two point if Product Owner release plan based on known velocity



Our we do have whiteboard with post-it notes on them that are checked out and moved to done as soon as they are done - there is no arguing about the laundry being done before it is cleaned, dried, ironed and back to the closet. ;) So that would make for 5 points. In addition we know our velocity, which would make another 3 points:





Naturally we are pretty capable of telling what can reasonably be expected to be done within the coming sprints. That might add another 2 points for the release being based on that velocity.


Question 8 - Team Disruption

  • Manager or Project Leader disrupts team - 0
  • Product Owner disrupts team - 1
  • Managers, Project Leaders or Team leaders telling people what to do - 3
  • Have Project Leader and Scrum roles - 5
  • No one disrupting team, only Scrum roles - 10



There are events and people interrupting running sprints: Say, NoSQL meetups that are planned spontaneously or new articles that get written and printed within less than a week. But usually these events are rather seldom and are kept to a minimum due to the short sprint length. So that might make for 3 points.


Question 9 - Team

  • Tasks assigned to individuals during Sprint Planning – 0
  • Team members do not have any overlap in their area of expertise – 0
  • No emergent leadership - one or more team members designated as a directive authority -1
  • Team does not have the necessary competency - 2
  • Team commits collectively to Sprint goal and backlog - 7
  • Team members collectively fight impediments during the sprint - 9
  • Team is in hyperproductive state - 10



Currently we are in a state where we have identified and started emerging impediments - declining tasks that cannot reasonably be done within the given timeframe, getting a real product backlog up, tracking even minor tasks like writing e-mails to organize the Apache Hadoop Get Together. So that makes for 9 points.

In total that makes for 54 points (excuse for computing incorrectly: It is 23:34, I am a little tired but cannot sleep due to caffeine). How does your team score on the Nokia test?

Third "December Hadoop Get Together" video online

2010-01-05 19:29
In the following video taken at the last Hadoop Get Together in Berlin Jörg Möllenkamp explains why Hadoop is interesting for Sun - and why Sun Hardware might be a good fit for Hadoop applications:

Hadoop Jörg Möllenkamp from Isabel Drost on Vimeo.



In a blog post published after the event, Jörg gives more details on his idea of Parasitic Hadoop he introduced at the meetup.

Second December Hadoop Get Together video

2010-01-03 14:57
Richard Hutton from nugg.ad explained how they scaled their ad recommendation system to an increasing number of users with the help of Hadoop. To learn more on their use case and details on which problems they solved with Hadoop, watch the video below:

Hadoop Richard Hutton from Isabel Drost on Vimeo.

With a little help from my friends

2009-12-31 23:55
The end of the year 2009 is quickly approaching. To me it feels a little like it ran away far too quickly. So instead of taking part in the annual review of past events, I would like to use it as an opportunity to say thank you: The past twelve months were a lot of fun with lots of interesting, nice people from all over the world. I got the chance to meet quite a bit of the Mahout community, I got lots and lots of new developers from all over Germany - or more precisely the EU - to attend the Apache Hadoop Get Together in Berlin. The interest in Mahout has grown tremendously over the past year.

All of this would not have been possible without the help of many people: First of all I'd like to thank Thilo Fromm - for making me happy whenever I was disappointed, for solacing me when I when I was sad, for patiently listening to me nervously whining before each and every talk, for kindly reviewing my slides and last but not least for helping me fix some of the problems that bugged me. Oh - and, thanks for helping me fix the issue in the zookeeper c-client within minutes that puzzled me for days.

Another big Thanks goes to family, first and foremost my mum, who kindly took care of organizing quite a bit of my paperwork and kept me on schedule with so many "unimportant" tasks like getting an appointment with some hospital to finally get the screws taken out of my knee ;)

A special thanks goes to the growing Mahout community as well as to the Lucene people - you know, who you are - keep up the great work: You rock!

Furthermore there are students at TU Berlin who have shown that with Mahout it is "dead-simple" to write an application that, given a stream of documents, groups them by topic and makes the result searchable in Solr. Thanks to you for solving the minor and major problems, for communicating with the community, for transparently communicating problems. Looking forward to continue working together with you next year.

Finally a big thank you to all of the speakers, sponsors and attendees of the Apache Hadopp Get Together, the NoSQL conference and the Apache Dinner Berlin - without you these events would never have been possible. Looking forward to seeing you again in January/ March 2010!

I hope I didn't forget too many people - just in case: I am pretty grateful for all the input, help and feedback I got this year.

PS: Another thanks to the spaceboyz visiting Berlin for 26C3 for helping Thilo tidy up our apartment after Congress was over this year ;)

First December Apache Hadoop Berlin video online

2009-12-31 20:27
The video of Nikolaus' Pohle's talk at the December Apache Hadoop Get Together Berlin is online already - more to come soon.

hadoop nikolaus pohle from Isabel Drost on Vimeo.



Thanks to Martin from newthinking for video taping and uploading. Thanks to StudiVZ for sponsoring the video.

Screws are out

2009-12-26 16:45


Before: Some time in between: After:







On December 22nd those screws got taken out of my knee: Early in the morning (early as in arrive at 6:45am) I was to be at the hospital. In return I was allowed to go home the same day in the afternoon: Finally some time for reading and refining MAHOUT-85 ;)

Winter arrived at Berlin

2009-12-18 20:10
Finally winter seems to have arrived at Berlin as well:





Looks a little like Christmas is drawing closer. Only disadvantage of the weather: One of the breaks of my bike was frozen after very few minutes. Luckily for me, my bike has one of those old-fashioned back pedal brakes ;)

Summary - December Get Together

2009-12-16 22:23
Today the seventh Apache Hadoop Get Together took place in Berlin. The room was again packed with more than 40 people from various companies with and without practical experience with Hadoop: There were people from Nokia Gate 5, Sun, nurago, StudiVZ, Dawanda, Last.fm, nugg.ad. There were people from academia, e.g. HPI Potsdam. And a few Freelancers interested in the topic or providing help with Hadoop.

We had three very interesting talks. The first one was given by Richard Hutton from nugg.ad on their usage of Hadoop. They provide targeted advertisement services to their clients. Naturally they do need to process lots of user interactions to be able to draw reliable conclusions. nugg.ad started out with a traditional system setup: Erlang loggers in front, data got fed to well known data warehouse infrastructures, analysed and results pushed back to the frontends. However this architecture would scale only so far. So in the beginning of 2009 they started migrating their systems over to Hadoop. (A Thanks from the speaker to Tom White for publishing the Hadoop book at O'Reilly that obviously helped the developers a lot.). Today, nugg.ad is down from one to two days for analysis to one to two hours. I will link the slides of the talk as soon as I have the pdf version available.

Second talk was given by Jörg Möllenkamp on what Sun is doing with Hadoop. Sun does have "special hardware" - special in that the have systems with up to 512 virtual processors on one chip. With Solaris they do have an operating system that scales to that architecture. But now they are looking for applications that can use such hardware efficiently as well. Hadoop is well suited for distributing computations - so it looked like a great fit for Sun. Slides are available online.



The last talk was given by Nikolaus Pohle from nurago. They switched to Hadoop only recently. Coming from online market analysis, they have to analyse lots of user interaction data. Currently they are moving away from a MySQL based architecture to a distributed system based on HDFS and Map/Reduce. In order to ease writing M/R jobs for their employees they built their own abstract language on top of Hadoop that helps formulating recurring jobs. That does sound a lot like what PIG or Cascading already does - but is specially targeted at the type of jobs they have. Slides are available online. There is also a pdf version for users who prefer open formats.

If anyone should be interested in it, I also put my introductory slides online.

Next meetup will be in March 2010. It will feature a talk by Zanox on their Hadoop usage, one talk by eCircle from Munich as well as one talk by Nokia. You are very welcome to join us. If you would like to give a presentation yourself - please do contact me. If you would like to sponsor the event, please send me an e-mail.

A big Thank You to all the speakers - Nikolaus Pohle from nurago, Jörg Möllenkamp from Sun and Richard Hutton from nugg.ad - without you, the event would not be possible. Another big Thank You to newthinking for providing the venue for free. And, last but not least, another big Thank You to StudiVZ for sponsoring the videos. They will be linked to from here as well as from the StudiVZ blog as soon as they are available.