Apache Hadoop Get Together Berlin

2009-09-29 16:38
The Get Together started just a few minutes ago. The room is packed with more than 35 people this time. This is the first Hadoop Get Together in Berlin that will be recorded on video, thanks to Martin from newthinking for doing the recording and post processing as well as to Cloudera for sponsoring the videos.

The first talk was given by Thorsten Schuett on solving puzzles with map reduce. His disclaimer: Working at ZIB Berlin he had a large cluster in the basement to put to good use. However the cluster does not run Hadoop. It is based on Lustre FS and does not rely on commodity hardware. So he implemented a solver for 4x4 sliding puzzles in a map reduce framework targeted for "his" cluster.

Second talk was by Thilo Goetz on JAQL, a language for querying JSON documents that can run queries on top of a Hadoop cluster.

In the third and last talk, Uwe Schindler gave an overview of the new features and performance improvements of last weeks Lucene 2.9 release.

After raffling the Hadoop books donated by O'Reilly, we will move to a bar close by after the talks are over to have some beer and continue discussions. A summary that includes more details as well as links to the slides will be online soon.

Update: I had reserved a table at Cafe Aufsturz close to newthinking store for about 15 people - maybe less, maybe more. We ended up going there with more than 25 people - really glad there were still enough tables left for us :)

Update 2: Next meetup - December 16th, I already got one definite and two tentative proposals for talks.

Upcoming: Apache Hadoop Get Together Berlin

2009-09-23 19:00
This is a friendly reminder that the next Apache Hadoop Get Together takes place next week on Tuesday, 29th of September* at newthinking store (Tucholskystr. 48, Berlin).


  • Thorsten Schuett, Solving Puzzles with MapReduce.
  • Thilo Götz, Text analytics on jaql.
  • Uwe Schindler, Lucene 2.9 Developments.

Big thanks goes to newthinking store for providing the venue for free and to Cloudera for sponsoring videos of the talks. Links to the videos will be posted on , on the upcoming page linked above, as well as on the Cloudera Blog soon after the event. Yet another thanks goes to O'Reilly for providing three "Hadoop: The Definitive Guide" books to be raffled at the event.

The 7th Get Together is scheduled for December, 16th. If you would like to submit a talk or sponsor the event, please contact me.


Hope to see you in Berlin next week.


* The event is scheduled right before the UIMA workshop in Potsdam, which may be of interest to you if you are a UIMA user.

September Apache Hadoop Get Together @ Berlin

2009-08-23 20:48
The upcoming Apache Hadoop Get Together Berlin is to take place on September 29th in newthinking store. Details are up on the web page at upcoming and will be sent out to the mailing list soon.

September 2009 Hadoop Get Together Berlin

2009-08-17 09:11
The newthinking store Berlin is hosting the Hadoop Get Together user group meeting. It features talks on Hadoop, Lucene, Solr, UIMA, katta, Mahout and various other projects that deal with making large amounts of data accessible and processable. The event brings together leaders from the developer and user communities. The speakers present projects that build on top of Hadoop, case studies of applications being built and deployed on Hadoop. After the talks there is plenty of time for discussion, some beer and food.

There is also a related Xing Group on the topic of building scalable information retrieval systems. Feel free to join and meet other developers dealing with the topic of building scalable solutions.


Agenda:

Please see upcoming page for updates.


  • Thilo Götz: JAQL
  • Uwe Schindler: Lucene 2.9
  • nugg.ad: Ad Recommendation with Hadoop
  • T. Schuett: Solving puzzles with Hadoop.


If you yourself would like to give a presentation: There are additional slots of 20 minutes each available. There is a beamer provided. Just bring your slides. To include your topic on this web site as well as the upcoming.org entry, please send your proposal to Isabel.

After the talks there will be time for an open discussion. We are going into a nearby restaurant after the event so there will be plenty of time for talking, discussing and new ideas.

Location

The Apache Hadoop Get Together takes place at the newthinking store Berlin:



newthinking store GmbH

Tucholskystr. 48

10117 Berlin



View Larger Map

Accomodation

  • Homeli - not exactly in walking distance, but only a few S-Bahn stations away. Very nice Bed and Breakfast hotel. (The offer is only valid if you stay for at least three nights.)

  • Circus Berlin is a combination of hostel and hotel close by.

  • Zimmer in Berlin is yet another Bed and Breakfast hotel.

  • House boat near Friedrichshain



Announcements

If you would like to be notified on news please subscribe to our mailinglist. The meetings usually are also announced on the project mailing lists as well as on the newthinking store website.


Contact

In case you have any trouble reaching the location or finding accomodation feel free to contact the organiser Isabel.

Past events

Lucene slides online

2009-06-30 10:04
The slides of the Lucene talk at the last Apache Hadoop Get Together Berlin are available online: Lucene Slides. Especially interesting to me are the last few slides which detail both index size and machine setup:

The installation is running on two standard PCs with 2 dual-core processors (usual speed, bought in January 2008 for about 4000 Euro). They have 32GB RAM, 24 GB are used as ramdisk for the index. Without ramdisk initial queries especially those accessing fields are slower but still acceptable. The index contains about 19 million documents, that is 80GB of indexed text + billions of annotated tags.

Data serialization

2009-06-26 08:39
XML, JSON and others are currently standard data exchange formats. Being human-readable but still structured enough to be easily parsable by programs is their main benefit. Problems are overhead in size and parsing time. In addition at least xml is not really as human-readable as it could be.

An alternative are binary formats. Yet those often are not platform independent (either C++ or Java or Python bindings) or are not upgradable (what if your boss comes along and wants you to add yet another field? Do you need to process all your data again?).

There are a few libraries that promise to solve at least some of these problems. Usually you specify your data format with an IDL, generate (Byte-)code from it and use mechanisms provided by the libraries to upgrade your format.

Yesterday at the Berlin Apache Hadoop Get Together Torsten Curdt gave a short introduction to two of these solutions: Thrift and Protocol buffers. He explained why Joost decided to use one of those libraries and highlighted why they went with Thrift instead of Protocol Buffers.

This morning I have gathered a list of data exchange libs that are currently available:

  • Thrift ... developed at Facebook, now in the Apache incubator, active community, Bindings for C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml.
  • ProtoBuf ... developed at Google, mainly one developer only, bindings for C++, Java und Python.
  • Avro ... started by Doug Cutting, skips code generation.
  • ETCH ... developed at Cisco, now in the Apache Incubator, Bindings for Java, C#, JavaScript.

There are some performance benchmarks online. Another recent, extensive comparison of serialization performance of various frameworks.

June 2009 Apache Hadoop Get Together @ Berlin

2009-06-21 21:33
Just a brief reminder: Next week on Thursday the next Apache Hadoop Get Together is scheduled to take place in Berlin. There are quite a few interesting talks scheduled:

  • Torsten Curdt: Data Legacy - the challenges of an evolving data warehouse
  • Christoph M. Friedrich, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI): "SCAIView - Lucene for Life Science Knowledge Discovery".
  • Uri Boness from JTeam in Amsterdam: Solr - From Theory to Practice.


See http://upcoming.yahoo.com/event/2488959/ for more information.

For those interested in NOSQL Meetups, the discussion over at the NOSQL mailing list might be of interest to you: http://blog.oskarsson.nu/2009/06/nosql-debrief.html

June 2009 Apache Hadoop Get Together @ Berlin

2009-04-23 19:30
Title: Apache Hadoop Get Together @ Berlin
Location: newthinking store Berlin Mitte
Link out: Click here
Description: I just announced the fifth Apache Hadoop Get Together in Berlin at the newthinking store. Torsten Curdt offered to give a talk on data serialization with Thrift and Protocol Buffers.

If you have a topic you would like to talk about: Feel free to just bring your slides - there will be a beamer and lots of people interested in scalable information retrieval.
Start Time: 17:00
Date: 2009-06-25

June 2009 Apache Hadoop Get Together @ Berlin

2009-04-23 19:30
Title: Apache Hadoop Get Together @ Berlin
Location: newthinking store Berlin Mitte
Link out: Click here
Description: I just announced the fifth Apache Hadoop Get Together in Berlin at the newthinking store. Torsten Curdt offered to give a talk on data serialization with Thrift and Protocol Buffers.

If you have a topic you would like to talk about: Feel free to just bring your slides - there will be a beamer and lots of people interested in scalable information retrieval.
Start Time: 17:00
Date: 2009-06-25

March 2009 Hadoop Get Together Berlin

2009-03-07 19:50
Since last summer, newthinking store Berlin is hosting a Hadoop Meetup every quarter of the year. The scope of these user group meetings is not only on Hadoop projects but deals with technologies necessary with storing, processing and searching large amounts of data.

The meeting last Thursday featured a talk by Lars George on his experiences using HBase in customer projects as early as in 2007. His talk discussed his requirements for a distributed database. He then explained the basics of HBase and described his experiences using the software for customer projects. Bottom line for me is that although in a very early stage the project does provide a lot of value: Instead of re-implementing your own solution it is possible to benefit from the efforts of others. One thing I consider especially remarkable is the effort of the HBase community helping users in case they run into problems.

The second talk was from Jan Lehnardt on CouchDB. Jan explained the main design goals of the system. He detailed the architecture of CouchDB. Then he explained how Erlang made it possible to reach the goals in comparably short time.

The slides of the talks are both available online:

Lars George: HBase

Jan Lehnardt: CouchDB

The talks were followed by several questions and interesting discussions (with some beer in the Keyser Soze close by).

The next Get Together will be held in June 2009. Looking forward to see you in Berlin by then.