March 2010 Apache Hadoop Get Together Berlin

2010-01-29 08:40
This is to announce the next Apache Hadoop Get Together that will take place in newthinking store in Berlin.


  • When: March 10th, 4p.m.
  • Where: Newthinking store Berlin


As always there will be slots of 20min each for talks on your Hadoop topic. After each talk there will be a lot time to discuss. You can order drinks directly at the bar in the newthinking store. If you like, you can order pizza. We will go to Cafe Aufsturz after the event for some beer and something to eat.


View Larger Map



Talks scheduled so far:

Chris Male (JTeam/ Amsterdam): Spatial Search with Solr

Abstract: The rise in popularity of Google Maps and mobile devices with GPS have resulted in a trend in the search field. People are no longer content with finding results that match a text query, they also want to find results which are near a location. So called spatial search differs considerably from traditional free text search in that it cannot be achieved through common search techniques such as inverted indexes. Instead, new algorithms and data structures had to be developed that achieve efficient and accurate spatial search, that also allow spatial search to have a role in the determination of a result's relevance. This technology has primarily been found in proprietary closed source search applications, however in the last 12-18 months, considerable effort has been invested into bringing open source spatial search support to Apache Solr and Lucene. While much is still left to be done, this talk will introduce how spatial search is currently supported in Solr, what work is happening currently, and a roadmap for future developments.


Dragan Milosevic (zanox/ Berlin: Product Search and Reporting powered by Hadoop

Abstract:

To efficiently process and index 80 million products, as well as store and analyse 30 million clicks and 500 million views daily, Zanox AG is using Hadoop HDFS and Map?Reduce technologies. This talk will present product-processing and reporting frameworks running on 17 node Hadoop cluster, being able to (1) robustly store products and tracking data in distributed manner, (2) rapidly consolidate, normalise and categorise products, (3) merge and aggregate tracking data and (4) efficiently builds indexes for supporting distributed search and reporting, running in several search clusters.

Bob Schulze (eCircle/ Munich): Database and Table Design Tips with HBase

Abstract: Recurring design patterns for the BigTable/HBase storage model.

A big Thanks goes to the newthinking store for providing a room in the center of Berlin for us. Another big thanks goes to Nokia Gate 5 for sponsoring videos of the talks. Links to the videos will be posted here.

Please do indicate on the following Upcoming event if you are planning to attend to make planning (and booking tables at Aufsturz) easier. Registration through Xing is possible as well.

Looking forward to seeing you in Berlin,
Isabel

Apache Dinner January 2010

2010-01-18 22:48
This evening in X-Berg several local committers met for the second "Apache Dinner" - an informal gathering of local Apache committers, friends and associates for food, beer and interesting discussions. Next one is probably to be scheduled some time in February. Feel free to send a message to Torsten Curdt to be included on the next invitation mail. Thanks for organizing a nice evening, Torsten. Hope to see even more Apache friends at the next dinner ;)

How much of Scrum is implemented?

2010-01-06 22:36
I have started using Scrum for various purposes: It has inspired the way software is developed at my current employer. I use it to organize a students' project at university. In addition we are using it at home to get all personal tasks (preparing breakfast, doing the laundry, meeting with friends...) in line for each week.

Constantly looking for ways to evaluate, refine and improve work - I am also looking for ideas on how to evaluate which aspects of the Scrum implementation can actually be improved. One pretty common way to do this evaluation is to do the so-called "Nokia-Test". A set of questions on the project management that gives a possibility to judge your implementation of Scrum. As an example lets just have a closer look at our "Scrum Housework" implementation.


Question 1 - Iterations

  • No iterations - 0

  • Interations > 6 weeks - 1

  • Variable length < 6 weeks - 2

  • Fixed iteration length 6 weeks - 3

  • Fixed iteration length 5 weeks - 4

  • Fixed iteration 4 weeks or less - 10




Currently we are doing one week iterations - planning ahead for longer just seems impossible, except for events like going to conferences or regular birthdays. So that would be 10 points for iterations.


Question 2 - Testing within the Sprint

  • No dedicated QA - 0

  • Unit tested - 1

  • Feature tested - 5

  • Features tested as soon as completed - 7

  • Software passes acceptance testing - 8

  • Software is deployed - 10




Hmm. Admittedly there is no real testing in place except for smoke testing for stuff like emptying the dish washer.

    Question 3 - Agile Specification
  • No requirements - 0
  • Big requirements documents - 1
  • Poor user stories - 4
  • Good requirements - 5
  • Good user stories - 7
  • Just enough, just in time specifications - 8
  • Good user stories tied to specifications as needed - 10


We do not have big documents describing how to setup the christmas tree. But at the beginning of each sprint there is a set of user stories, if needed with acceptance criteria specified. So something like "Tidy up computer desk" would be augmented by the information: "To the extend that there are no items except for the laptop on the desk afterwards and the desk was dusted". That might probably make a 10.


Question 4 - Product Owner
  • No Product Owner - 0
  • Product Owner who doesn’t understand Scrum - 1
  • Product Owner who disrupts team - 2
  • Product Owner not involved with team - 2
  • Product owner with clear product backlog estimated by team before Sprint Planning meeting (READY) - 5
  • Product owner with release roadmap with dates based on team velocity - 8
  • Product owner who motivates team - 10



We are both familiar with Scrum. However, due to the nature of the tasks and due to lack of people in the loop we are exchanging the role of the product owner regularly. We are still missing a product backlog - currently it is loosely defined as a pile of post-it notes with estimates put beside each item that define its complexity. So I would give some 3 points on that one.


Question 5 - Product Backlog

  • No Product Backlog - 0
  • Multiple Product Backlogs - 1
  • Single Product Backlog - 3
  • Product Backlog clearly specified and prioritized by ROI before Sprint Planning (READY) - 5
  • Product Owner has release burndown with release date based on velocity - 7
  • Product Owner can measure ROI based on real revenue, cost per story point, or other metrics - 10



We only have one product backlog though it is very informal. So that would make 2 points.






Question 6 - Estimates

  • Product Backlog not estimated - 0
  • Estimates not produced by team - 1
  • Estimates not produced by planning poker - 5
  • Estimates produced by planning poker by team - 8
  • Estimate error < 10% - 10



Naturally those doing the tasks are those producing the estimates. Thanks to Agile42 in Berlin we now even have a set of planning poker cards: Yeah! That makes 8 points. Just as an example: Getting returnable bottles back to the shop makes for 8 complexity points, going to cinema are 16 just like storing the christmas decoration back in it's boxes, preparing breakfast is just about 3 points ;)


Question 7 - Sprint Burndown Chart
  • No burndown chart - 0
  • Burndown chart not updated by team - 1
  • Burndown chart in hours/days not accounting for work in progress (partial tasks burn down) - 2
  • Burndown chart only burns down when task in done (TrackDone pattern) - 4
  • Burndown only burns down when story is done - 5
  • Add 3 points if team knows velocity
  • Add two point if Product Owner release plan based on known velocity



Our we do have whiteboard with post-it notes on them that are checked out and moved to done as soon as they are done - there is no arguing about the laundry being done before it is cleaned, dried, ironed and back to the closet. ;) So that would make for 5 points. In addition we know our velocity, which would make another 3 points:





Naturally we are pretty capable of telling what can reasonably be expected to be done within the coming sprints. That might add another 2 points for the release being based on that velocity.


Question 8 - Team Disruption

  • Manager or Project Leader disrupts team - 0
  • Product Owner disrupts team - 1
  • Managers, Project Leaders or Team leaders telling people what to do - 3
  • Have Project Leader and Scrum roles - 5
  • No one disrupting team, only Scrum roles - 10



There are events and people interrupting running sprints: Say, NoSQL meetups that are planned spontaneously or new articles that get written and printed within less than a week. But usually these events are rather seldom and are kept to a minimum due to the short sprint length. So that might make for 3 points.


Question 9 - Team

  • Tasks assigned to individuals during Sprint Planning – 0
  • Team members do not have any overlap in their area of expertise – 0
  • No emergent leadership - one or more team members designated as a directive authority -1
  • Team does not have the necessary competency - 2
  • Team commits collectively to Sprint goal and backlog - 7
  • Team members collectively fight impediments during the sprint - 9
  • Team is in hyperproductive state - 10



Currently we are in a state where we have identified and started emerging impediments - declining tasks that cannot reasonably be done within the given timeframe, getting a real product backlog up, tracking even minor tasks like writing e-mails to organize the Apache Hadoop Get Together. So that makes for 9 points.

In total that makes for 54 points (excuse for computing incorrectly: It is 23:34, I am a little tired but cannot sleep due to caffeine). How does your team score on the Nokia test?

Third "December Hadoop Get Together" video online

2010-01-05 19:29
In the following video taken at the last Hadoop Get Together in Berlin Jörg Möllenkamp explains why Hadoop is interesting for Sun - and why Sun Hardware might be a good fit for Hadoop applications:

Hadoop Jörg Möllenkamp from Isabel Drost on Vimeo.



In a blog post published after the event, Jörg gives more details on his idea of Parasitic Hadoop he introduced at the meetup.

Second December Hadoop Get Together video

2010-01-03 14:57
Richard Hutton from nugg.ad explained how they scaled their ad recommendation system to an increasing number of users with the help of Hadoop. To learn more on their use case and details on which problems they solved with Hadoop, watch the video below:

Hadoop Richard Hutton from Isabel Drost on Vimeo.

With a little help from my friends

2009-12-31 23:55
The end of the year 2009 is quickly approaching. To me it feels a little like it ran away far too quickly. So instead of taking part in the annual review of past events, I would like to use it as an opportunity to say thank you: The past twelve months were a lot of fun with lots of interesting, nice people from all over the world. I got the chance to meet quite a bit of the Mahout community, I got lots and lots of new developers from all over Germany - or more precisely the EU - to attend the Apache Hadoop Get Together in Berlin. The interest in Mahout has grown tremendously over the past year.

All of this would not have been possible without the help of many people: First of all I'd like to thank Thilo Fromm - for making me happy whenever I was disappointed, for solacing me when I when I was sad, for patiently listening to me nervously whining before each and every talk, for kindly reviewing my slides and last but not least for helping me fix some of the problems that bugged me. Oh - and, thanks for helping me fix the issue in the zookeeper c-client within minutes that puzzled me for days.

Another big Thanks goes to family, first and foremost my mum, who kindly took care of organizing quite a bit of my paperwork and kept me on schedule with so many "unimportant" tasks like getting an appointment with some hospital to finally get the screws taken out of my knee ;)

A special thanks goes to the growing Mahout community as well as to the Lucene people - you know, who you are - keep up the great work: You rock!

Furthermore there are students at TU Berlin who have shown that with Mahout it is "dead-simple" to write an application that, given a stream of documents, groups them by topic and makes the result searchable in Solr. Thanks to you for solving the minor and major problems, for communicating with the community, for transparently communicating problems. Looking forward to continue working together with you next year.

Finally a big thank you to all of the speakers, sponsors and attendees of the Apache Hadopp Get Together, the NoSQL conference and the Apache Dinner Berlin - without you these events would never have been possible. Looking forward to seeing you again in January/ March 2010!

I hope I didn't forget too many people - just in case: I am pretty grateful for all the input, help and feedback I got this year.

PS: Another thanks to the spaceboyz visiting Berlin for 26C3 for helping Thilo tidy up our apartment after Congress was over this year ;)

First December Apache Hadoop Berlin video online

2009-12-31 20:27
The video of Nikolaus' Pohle's talk at the December Apache Hadoop Get Together Berlin is online already - more to come soon.

hadoop nikolaus pohle from Isabel Drost on Vimeo.



Thanks to Martin from newthinking for video taping and uploading. Thanks to StudiVZ for sponsoring the video.

Winter arrived at Berlin

2009-12-18 20:10
Finally winter seems to have arrived at Berlin as well:





Looks a little like Christmas is drawing closer. Only disadvantage of the weather: One of the breaks of my bike was frozen after very few minutes. Luckily for me, my bike has one of those old-fashioned back pedal brakes ;)

Summary - December Get Together

2009-12-16 22:23
Today the seventh Apache Hadoop Get Together took place in Berlin. The room was again packed with more than 40 people from various companies with and without practical experience with Hadoop: There were people from Nokia Gate 5, Sun, nurago, StudiVZ, Dawanda, Last.fm, nugg.ad. There were people from academia, e.g. HPI Potsdam. And a few Freelancers interested in the topic or providing help with Hadoop.

We had three very interesting talks. The first one was given by Richard Hutton from nugg.ad on their usage of Hadoop. They provide targeted advertisement services to their clients. Naturally they do need to process lots of user interactions to be able to draw reliable conclusions. nugg.ad started out with a traditional system setup: Erlang loggers in front, data got fed to well known data warehouse infrastructures, analysed and results pushed back to the frontends. However this architecture would scale only so far. So in the beginning of 2009 they started migrating their systems over to Hadoop. (A Thanks from the speaker to Tom White for publishing the Hadoop book at O'Reilly that obviously helped the developers a lot.). Today, nugg.ad is down from one to two days for analysis to one to two hours. I will link the slides of the talk as soon as I have the pdf version available.

Second talk was given by Jörg Möllenkamp on what Sun is doing with Hadoop. Sun does have "special hardware" - special in that the have systems with up to 512 virtual processors on one chip. With Solaris they do have an operating system that scales to that architecture. But now they are looking for applications that can use such hardware efficiently as well. Hadoop is well suited for distributing computations - so it looked like a great fit for Sun. Slides are available online.



The last talk was given by Nikolaus Pohle from nurago. They switched to Hadoop only recently. Coming from online market analysis, they have to analyse lots of user interaction data. Currently they are moving away from a MySQL based architecture to a distributed system based on HDFS and Map/Reduce. In order to ease writing M/R jobs for their employees they built their own abstract language on top of Hadoop that helps formulating recurring jobs. That does sound a lot like what PIG or Cascading already does - but is specially targeted at the type of jobs they have. Slides are available online. There is also a pdf version for users who prefer open formats.

If anyone should be interested in it, I also put my introductory slides online.

Next meetup will be in March 2010. It will feature a talk by Zanox on their Hadoop usage, one talk by eCircle from Munich as well as one talk by Nokia. You are very welcome to join us. If you would like to give a presentation yourself - please do contact me. If you would like to sponsor the event, please send me an e-mail.

A big Thank You to all the speakers - Nikolaus Pohle from nurago, Jörg Möllenkamp from Sun and Richard Hutton from nugg.ad - without you, the event would not be possible. Another big Thank You to newthinking for providing the venue for free. And, last but not least, another big Thank You to StudiVZ for sponsoring the videos. They will be linked to from here as well as from the StudiVZ blog as soon as they are available.

On Wednesday: December Apache Hadoop @ Berlin

2009-12-14 20:15
This week on Wednesday at 5p.m. the December Hadoop Get Together takes place in newthinking store Berlin.

Talks scheduled so far:


  • Richard Hutton (nugg.ad): “Moving from five days to one hour.”
  • Jörg Möllenkamp (Sun): “Hadoop on Sun”
  • Nikolaus Pohle (nurago): “M/R for MR - Online Market Research powered by Apache Hadoop. Enable consultants to analyze online behavior for audience segmentation, advertising effects and usage patterns.”


There will be videos after the event linked to by StudiVZ (thanks for sponsoring) after the Meetup is over.

As this is the last Meetup before Christmas there will be cookies waiting for you.

If you want to get notifications of future events on Apache Hadoop, NoSQL, Apache Lucene - be it trainings, meetups or conferences - feel free to subscribe to the Mailinglist or join the Xing Group that accompanies the Berlin Get Together.