Open Street Map @ FSFE meetup

2009-06-21 20:59
At the last meeting of the local FSFE group here in Berlin Sabine Stengel from cartogis gave a presentation on Open Street Map. But instead of focussing on the technical side she described the legal issues and showed the broad variety of commercial projects that are possible with this type of mapping information.

It was interesting to learn of how detailed and high quality the information provided by volunteers really is. I think it will be interesting to see, how the project keeps traction after "everything is mapped" - how it remains interesting to stay involved, to keep the information up to date over a longer period of time.

Scrum Tisch

2009-06-04 11:27
Title: Scrum Tisch
Location: Divino FHain
Link out: Click here
Description: Philippe will present his speech from the Orlando scrum Gathering where he will speak about backlog and time-box, about value versus cost, about visible features versus invisible features (and in particular software architecture), about defects and technical debt, and more generally about release planning and sprint planning for non-trivial and long-lived software development projects.
Start Time: 18:00
Date: 2009-06-16

Ken Schwaber in Berlin XBerg

2009-05-24 18:56
Last week I attended a discussion meetup with Ken Schwaber in Berlin/ Kreuzberg. The event was scheduled pretty shortly - still quite a few developers and project managers from various companies in Berlin showed up.

Ken started with a brief summary of the history of Scrum: Before there was such a thing as an IT industry programming actually was a lot of fun. But somehow the creative job was turned into something people tend to suffer from pretty quickly as people tried to apply principles from manufacturing industries to software "production". Suddenly there was a distinction between testers, programmers, architects... People tried to plan ahead for months or even years noticing only very late in the process that the outcome was by no means what was needed when the product finally was ready.

In contrast to waterfall Scrum comes with very short feedback loops. It comes with developers working with very strong focus on one task at a time. Change is not hated but embraced and built into development.

Some features of Scrum that are often forgotten but never the less essential that were discussed that evening:

  • Scrum is all about transparency - it's about telling your customers what is going on. It is about telling your customer honest estimations. It is about telling development to the best of your knowledge all that can makes up for a feature.
  • Scrum is neither easy nor a solution in itself. It is simply a way of uncovering problems very quickly that are easier to hide in waterfall processes. You have one person who is an isle of knowledge in your company? At every sprint planning this problem will become obvious until you find a way to solve it.
  • Scrum is about giving developers a box of time that is not to be interrupted. Developing software asks for a lot of concentration. Getting interrupted and resuming work on the task again is so expensive that there is close to nothing this can be justified with.
  • A nice way of doing Scrum is to use Scrum for management and XP for development. Scrum does not provide any solutions on how to reach the goals set - it does not tell you exactly how to arrive at a stable release by the end of your sprint. It just sets the goal for you. On the other hand XP holds quite a few development best practices that can help achieve these goals.
  • It needs time to change how customers and developers are working: Yearlong experience has trained them to think in certain ways. So at the beginning Scrum is all about teaching and training people. It takes time to learn a new way of getting things done.

There are ways to do fixed price contracts with Scrum. You just have a few more freedoms to offer to your customer:

  • Tell your customer that your clients usually change their mind underway. Give them the freedom to change anything not yet implemented. An item can be exchanged with an item of equal cost for no increase in prize. An item can be exchanged with a cheaper item with a decrease of cost, it can be exchanged with a more expensive item for a rise in cost.
  • Tell your customer that you already have pre-priorized items. The client is free to re-prioritize items as he wishes - as long as the item was not implemented already.
  • Tell you customer that as you are implementing those items at first that have a high priority you may come to a point where those items not done are not important for release so he could eventually stop early and pay less.

In summary the evening was very interesting and insightful for me. It helps to talk about Scrum implementation problems. To learn which problems others have and how they attack these problems.

Back from Zürich

2009-05-05 16:58
I spend the last five days in Zurich. I wanted to visit the city again - and still owed one of my friends there a visit. I am really happy the weather was quite nice over the weekend. That way I could spend quite some time in town (got another one of those puzzles) and go for a hike on the Ütli mountain: I took the steep way up that had quite a lot of stairs. Interestingly though, despite being quite tired when I finally arrived on top, my legs did not have sore muscles the next day. Seems going to work and back again by bike does indeed help a bit, even if we have no hills in Berlin.

Yesterday I was allowed to present the Apache project Mahout in a Google tech talk. Usually I am talking to people well familiar with the various Apache projects. Giving my talk I asked people who was familiar with Lucene, with Hadoop. To me it was pretty unusual that very few engineers were aware of these. It almost seemed like it is unusual to have a look at what is going outside the company? Or was it just the selection of people that were interested in my talk?

I tried to cover most of the basics, put Mahout into the context of the Lucene umbrella project. I tried to show some of the applications that can be built with Mahout and detailed some of the things that are on our agenda.

Some of the questions I received were on the scalability of Hadoop, on the general distribution of people being paid to work on Free Software projects vs. those working on them in their freetime. Another question was whether the project is targeted to text only applications (which of course it is not, as feature extraction so far has been left to the user). Last but not least the relation to UIMA was brought up by a former IBM-UIMA engineer.

To summarize: For me it was a pretty interesting experience to give this tech talk. I hope it did help me to do away with some of my "Apache bias". It is always valuable to look into what is going outside one's community.

DIMA @ TU Berlin

2009-05-03 07:26
On Friday, the 24th of April Prof. Volker Markl organised a Welcome Workshop at TU Berlin. The day started with an introduction by the Dekan of the faculty. First talk was given by Rudolf Bayer on the topic "From B-Trees to UB-Trees". Second presentation was by Guy Lohman on "LEO, DB2's Learning Optimizer".

After the coffee break, Volker Markl gave an introduction to his selected research field, outstanding tasks and the way he is going to accomplish his goals. Seems like scalability is playing a major role in his tasks. Interestingly Hadoop was chosen as an infrastructure basis.

In his talk Volker Markl announced the newly started BBI Colloquium. It is a regular meeting in Berlin dedicated to the scientific discurs on topics relevant to the participating researchers. Participating researchers are Prof. Oliver Günther, Prof. Johann-Christoph Freytag, Prof. Ulf Leser from HU Berlin, Prof. Dr. Volker Markl from TU Berlin, Prof. Dr. Heinz Schweppe from FU Berlin and Prof. Dr. Felix Naumann from HPI Potsdam.

Scrum Table with Thoralf Klatt

2009-04-29 09:19
On Wednesday, the 22nd of April, about 20 people interested in Scrum gathered in the DiVino in Friedrichshain/Berlin. The event was split in two parts: In the first half we gathered topics participants were interested in, put priorities next the them and discussed the most highly ranked topic: "Scrum in large teams, splitting large tasks across teams."

The basic take home messages of the discussion:
  • One way to cleanly split a task across teams is to first do a design sprint together, fix the API and then split up. Problem with that: Integration and validation of what you do theoretically up front.

  • Another way is to continously integrate all parts, that way you get direct feedback. Might be impractical without a sort of fixed API though.

  • Do keep in mind that increasing the team exponentially increases management overhead.

  • Do track the progress and performance with well known values (delivered value per sprint, velocity, define KPIs etc.)

The second part of the meetup was covered by the talf of Thoralf from Nokia Siemens networks on how they do scrum across countries and continents. Main interessting points for me:
  • Face to face communication is necessary - good video equipment can help with that.
  • Integrating ready made products into new solutions create new challenges to solve.
  • Transparency and communication with developers become a challenge.

More information on the event can be found on the blog of the round table.

Feedback from the Hadoop User Group UK

2009-04-29 08:54
A few weeks after the Hadoop User Group UK is over, there are quite a few postings on the event online. I will try to keep this page updated if there are any further reviews. The one I found so far: - the wrap-up of the event itself. - a short summary by the organiser - Thanks again for a great event. - a short summary on the Cloudera blog. - a quick overview with a Mahout focus by Adam Rae.

June 2009 Apache Hadoop Get Together @ Berlin

2009-04-23 19:30
Title: Apache Hadoop Get Together @ Berlin
Location: newthinking store Berlin Mitte
Link out: Click here
Description: I just announced the fifth Apache Hadoop Get Together in Berlin at the newthinking store. Torsten Curdt offered to give a talk on data serialization with Thrift and Protocol Buffers.

If you have a topic you would like to talk about: Feel free to just bring your slides - there will be a beamer and lots of people interested in scalable information retrieval.
Start Time: 17:00
Date: 2009-06-25

Hadoop User Group UK

2009-04-21 20:34
On Tuesday the 14th the second Hadoop User Group UK took place in London. This time venue and pizza was sponsored by Sun. The room quickly filled approximately 70 people.

Tom opened the session with a talk on 10 practical tips on how to get the most benefit from Apache Hadoop. The first question users should ask themselves is which type of programming language they want to use. There is a choice between structured data processing languages (PIG or Hive), dynamic languages (Streaming or Dumbo), or using Java which is closest to the system.

Tom's second hint dealt with the size of files to process with Hadoop: Both - too large unsplittable and too small ones are bad for performance. In the first case, the workload cannot be easily distributed across the nodes in the latter case each unit of work is to small to account for startup and coordination overhead. There are ways to remedy these problems with sequence files and map files though. Another performance optimization would be to chain individual jobs - PIG and Hive do a pretty decent job in automatically generating such jobs. ChainMapper and ChainReducer can help with creating chained jobs.

Another important task when implementing map reduce jobs is to tell Hadoop the progress of your job. For once, this is important for long running jobs in order for them to remain alive and not be killed by the framework due to timeouts. Second, it is convenient for the user as he can view the progress in the web UI of Hadoop.

Usual suspects for tuning a job: Number of mappers and reducers, usage of combiners, compression customised data serialisation, shuffling tweaks. Of course there is always the option to let someone else do the tuning: Cloudera does provide support as well as pre-built packages init scripts and the like ;)

In the second talk I did a brief Mahout intro. It was surprising to me that half of the attendees already employed machine learning algorithm implementations in their daily work. Judging from the discussion after the talk and from questions I received after it the interest in the project seems pretty high. The slide I liked the most: The announcement of our first 0.1 release. Thanks to all Mahout committers and contributors who made this possible.

After the coffee break Craig gave an introduction to Terrier an extensible information retrieval plattform developed at the university of Glasgow. He mentioned a few other open IR platforms namely Tuple Flow, Zettair, Lemur/Indri, Xapian, as well as of course nutch/Solr/Lucene.

What does Terrier have to do with the HugUK? Well index creation in Terrier is now based on an implementation that makes use of Hadoop for parallelization. Craig did some very interesting analysis on scalability of the solution: The team was able to achieve scaling near linear in the number of nodes added (at least as long as more than reducer is used ;) ).

After the pizza Paolo described his experiences implementing the vanilla pagerank computation with Hadoop. One of his test datasets was the Citeseer citation graph. Interestingly enough: Some of the nodes in this graph have self references (maybe due to extraction problems), duplicate citations, and the data comes in an invalid xml format.

The last talk was on HBase by Michael Stack. I am really happy I attended HugUK as I missed that talk in Amsterdam at the ApacheCon. First Michael gave an overview of which features of a typical RDBMS are not supported by HBase: Relations, joins, and of course JDBC being among the limitations. On the pro site HBase offers a multiple node solutions that has scale out and replication built in.

HBase can be used as source as well as as sink for map reduce jobs and thus integrates nicely with the Apache Hadoop stack. The framework provides a simple shell for administrative tasks (surgery on sick clusters forced flushes non sql get scan and put methods). In addition the master comes with a UI to monitor the cluster state.

Your typical DBA work though differs with HBase: Data locality and physical layout do matter and can be configured. Michaels recommendation was to start out testing with the XL instance on EC2 and decrease instances if you find out that it is too large.

The talk finished with an outlook of the features in the upcoming release the issues on the todo list and an overview of companies already using HBase.

After talks were finished quite a few attendees went over to a pub close by: Drinking beer, discussing new directions and sharing war stories.

I would to thank Johan Oskarsson for organising the event. And a special thanks to Tom for letting me use his Laptop for the Apache Mahout presentation: the hard disk of mine broke exactly one day before.

Last but not least thank you to Sylvio and Susi for letting me stay at their place - and thanks to Helene for crying only during daytime when I was out anyway ;)

Hope to see at least some of the attendees again at the next Hadoop Meetup in Berlin. Looking forward to the next Hadoop User Group UK.

GSoC: Student applications.

2009-04-05 09:57
Title: GSoC: Student applications closed
Link out: Click here
Description: After this date no more student applications are accepted. Internal ranking at Apache starts 7 days earlier. The ranking process finishes at 16th of April.
Date: 2009-04-03