CFP - Berlin Buzzwords 2011 - search, score, scale

2011-01-26 08:00
This is to announce the Berlin Buzzwords 2011. The second edition of the successful conference on scalable and open search, data processing and data storage in Germany,
taking place in Berlin.

Call for Presentations Berlin Buzzwords

Berlin Buzzwords 2011 - Search, Store, Scale

6/7 June 2011

The event will comprise presentations on scalable data processing. We invite you to submit talks on the topics:

  • IR / Search - Lucene, Solr, katta or comparable solutions
  • NoSQL - like CouchDB, MongoDB, Jackrabbit, HBase and others
  • Hadoop - Hadoop itself, MapReduce, Cascading or Pig and relatives

Closely related topics not explicitly listed above are welcome. We are looking for presentations on the implementation of the systems themselves, real world applications and case studies.

Important Dates (all dates in GMT +2)

  • Submission deadline: March 1st 2011, 23:59 MEZ
  • Notification of accepted speakers: March 22th, 2011, MEZ.
  • Publication of final schedule: April 5th, 2011.
  • Conference: June 6/7. 2011

High quality, technical submissions are called for, ranging from principles to practice. We are looking for real world use cases, background on the architecture of specific projects and a deep dive into architectures built on top of e.g. Hadoop clusters.

Proposals should be submitted at no later than March 1st, 2011. Acceptance notifications will be sent out soon after the submission deadline. Please include your name, bio and email, the title of the talk, a brief abstract in English language. Please indicate whether you want to give a lightning (10min), short (20min) or long (40min) presentation and indicate the level of experience with the topic your audience should have (e.g. whether your talk will be suitable for newbies or is targeted for experienced users.) If you'd like to pitch your brand new product in your talk, please let us know as well - there will be extra space for presenting new ideas, awesome products and great new projects.

The presentation format is short. We will be enforcing the schedule rigorously.

If you are interested in sponsoring the event (e.g. we would be happy to provide videos after the event, free drinks for attendees as well as an after-show party), please contact us.

Follow @berlinbuzzwords on Twitter for updates. News on the conference will be published on our website at

Program Chairs: Isabel Drost, Jan Lehnardt, and Simon Willnauer.

Schedule and further updates on the event will be published on Please re-distribute this CfP to people who might be interested.

Contact us at:

newthinking communications GmbH
Schönhauser Allee 6/7
10119 Berlin, Germany
Julia Gemählich
Isabel Drost
+49(0)30-9210 596

WiFi at the Apache Hadoop Get Together

2011-01-18 20:40
Just a brief reminder: The next Apache Hadoop Get Together is scheduled to take place on Thursday, January 27th at 6p.m. at the Zanox Event Campus at Media Spree Berlin.

We have three very interesting talks, though thirty guests registered already, we still have a few free seats. Head over to the xing event page to register if you have not done so yet.

If you would like to have access to the local WiFi please let me know - I need to register your mail address for that two days before the event with the venue.

A huge thanks to Zanox for providing the location for free, another huge thanks to Cloudera for sponsoring video taping of the event.

Apache Hadoop Get Together Berlin - January 2011

2010-12-28 16:31
This is to announce the next Apache Hadoop Get Together sponsored by Cloudera and Zanox that will take place in the Zanox Event Campus in Berlin.

When: January 27th 2011, 6p.m.

Where: zanox Event Campus (Please mark the changed event location.)

Größere Kartenansicht

As always there will be slots of 30min each for talks on your Hadoop topic. After each talk there will be a lot time to discuss. We head over to a bar after the event for some beer and something to eat.

Talks scheduled so far:

Simon Willnauer: "Lucene 4 - Revisiting problems for speed"

Abstract: This talk presents a brief case study of long standing problems in Lucene and how they have been approached to gain sizable performance improvements. Each of the presented problems will have brief introduction, implemented solution and resulting performance improvements. This talk might be interesting even for non-lucene folks.

Josh Devins: "Title: Hadoop at Nokia"
Abstract: In this talk, Josh will outline some of the ways in which Nokia is using Hadoop. We will start by having a quick look at the practical side of getting started with Hadoop and outline cluster hardware and configuration and management with tools like Puppet. Next we'll dive head first into how Hadoop and its' ecosystem are being utilized on a daily basis to perform business analytics, drive machine learning and help build data-driven products. We will also touch on how we go about collecting metrics from dozens of applications distributed in multiple data centers around the world. An open Q&A session will follow.

Paolo Negri: "The order of magnitude challenge: from 100K daily users to 1M "
Abstract: "Social games backends share many aspects of normal web applications, but exasperate scaling problems, follow this talk to see how we evolved and brought a plain ruby on rails app to sustain 5000 reqs/sec, moved part of our data from sql to nosql to reach 5 millions queries per minute and see what we learned from this experience."

Please do indicate on Upcoming or Xing if you are coming so we can more safely plan capacities.

A big Thank You goes to zanox for providing the venue for free for our event as well as to Cloudera for supporting videos being taped of the presentations.

Looking forward to seeing you in Berlin,

Apache Hadoop - Trainings by Cloudera in Berlin

2010-12-22 23:53
Cloudera is offering trainings both for Administrators as well as for Developers early next year in Berlin. If your are getting started in using Apache Hadoop this might be a great option to get your developers and operations up to speed with the framework. If you are a regular of the local Apache Hadoop Get Together a discount code should have been sent to you by mail.

Apache Mahout Hackathon Berlin

2010-12-14 20:50
Early next year - on February 19th/20th to be more precise - the first Apache Mahout Hackathon is scheduled to take place at c-base. The Hackathon will take one weekend. There will be plenty of time to hack on your favourite Mahout issue, to get in touch with two of the Mahout committers and get your machine learning project off the ground.

Please contact if you are planning to attend this event or register with the xing event so we can plan for enough space for everyone. If you have not registered for the event there is now guarantee you will be admitted.

If you'd like to support the event: We are still looking for sponsors for drinks and pizza.

Devoxx – Day two – Hadoop and HBase

2010-12-08 21:24
In his session on the current state of Hadoop Tom went into a little more detail not only on the features released in the latest release or on the roadmap for upcoming releases (including Kerberos based security, append support, warm standby namenode and others).
He also gave a very interesting view on the current Hadoop ecosystem. More and more projects are currently being created that either extend Hadoop or are built on top of Hadoop. Several of these are being run as projects at the Apache Software Foundation, however some are available outside of Apache only. Using graphviz he created a graph of projects depending on or extending Hadoop and from that provided a rough classification of these projects.

As to be expected HDFS and Map/Reduce are part of the very basis of this ecosystem. Right next to them sits zookeeper, a distributed coordination and looking service.

Storage systems extending the capabilities of HDFS include HBase that adds random read/write as well as realtime access to the otherwise batch-oriented distributed file-system. With PIG and Hive and Cascading three projects are making it easier to formulate complex queries for Hadoop. Among the three, PIG is mainly focussed on expressing data filtering and processing, with SQL support being added over time as well. Hive came from the need for SQL formulation on Hadoop clusters. Cascading goes a slightly different way, providing a Java API for easier query formulation. The new kid on the block sort of is Plume, a project initiated by Ted Dunning that has the goal of coming up with a Map/Reduce abstraction layer inspired by Google's Flume Java publication.

There are several projects for data import into HDFS. Sqoop can be used for interfacing with RDMBS. Chukwa and Flume deals with feeding log data into the filesystem. For general co-ordination and workflow orchestration there is the release of Oozie, originally developed at Yahoo! as well as support for workflow definition in Cascading.

When storing data in Hadoop it is a common requirement to find a compact, structured representation of the data to store. Though human readable, xml files are not very compact. However when using any binary format, schema evolution commonly is a problem: Adding, renaming or deleting fields in most cases causes the need to upgrade all code interacting with the data as well as re-formatting already stored data. With Thrift, Avro and Protocol Buffers there are three options available for storing data in a compact, structured binary format. All three projects come with support for schema evolution by providing users no only to deal with missing data but also by providing a means to map old to new fields and vice versa.

Apache Con – Hadoop, HBase, Httpd

2010-11-24 23:19
The first Apache Con day featured several presentations on NoSQL databases (track sponsored by Day software), a Hadoop track as well as presentations on Httpd and an Open source business track.

Since its inception Hadoop always was intended to be run in trusted environments firewalled from hostile users or even attackers. As such it never really supported any security features. This is about the change with the new Hadoop release including better Kerberos based security.

When creating files in Hadoop a long awaited feature was append support. Basically up to now writing to Hadoop was a one-of job: Open a file, write your data, close it and be done. Re-opening and appending data was not possible. This situation is especially bad for HBase as its design relies on being able to append data to an existing file. There have been efforts for adding append support to HDFS earlier as well as an integration of such patches by third party vendors. However only with a current Hadoop version Append is officially supported by HDFS.

A very interesting use case-wise of the Hadoop stack was presented by $name-here from $name. They are using a Hadoop cluster to provide a repository of code released under free software licenses. The business case is to enable corporations to check their source code against existing code and spot license infringements. This does not only include linking to free software under incompatible licenses but also developers copying pieces of free code, e.g. copying entire classes or functions into internal projects that originally were available only under copyleft licenses.

The speaker went into some detail explaining the initial problems they had run into: Obviously it's no good idea to mix and match Hadoop and HBase versions freely. Instead it is best practice to use only versions claimed to be compatible by the developers. Another common mistake is to leave parameters of both projects at their defaults. The default parameters are supposed to be fool-proof. However they are optimised to work well for Hadoop newbies who want to try out the system on a single node cluster and in a distributed setting obviously need more attention. Other anti-patterns include storing only tiny little files in the cluster thus quickly running out of memory on the namenode (that stores all file information including block mappings in main memory for faster access).

In the NoSQL track Jonathan Grey from Facebook gave a very interesting overview on the current state of HBase. Turns out that Facebook would announce only a few days after that their internal use of HBase for the newly launched feature of Facebook messaging.

HBase has adopted a release cycle including development/ production releases to get their systems into interested users' hands more quickly: Users willing to try out new experimental features can use the development releases of HBase. Those who are not should go for the stable releases.

After focussing on improving performance in the past months the project is currently focussing on stability: Data loss is to be avoided by all means. Zookeeper is to be integrated more tightly for storing configuration information thus enabling live reconfiguration (at least to some extend). In addition also HBase is targeting to integrate stored procedures like behaviour: As explained in Googles Percolator paper $LINK batch oriented processing get's you only so far. If data that gets added constantly it makes sense to give up on some of the throughput batch-based systems provide and instead optimise for shorter processing cycles by implementing event triggered processing.

On recommendation of one of neofonie's sys-ops I visited some of the httpd talks: First Rich Bowen gave an overview of unusual tasks one can solve with httpd. The list included things like automatically re-writing http response content to match your application. There is even a spell checker for request URLs: Given marketing has given your flyer to the press with a typo in the url, chances are that the spellchecking module can fix these automatically for each request: Common mistakes covered are switched letters, numbers replaced by letters etc. The performance cost has to be paid only in case no hit could be found – so instead of returning a 404 right away the server first tries to find the document by taking into account common mis-spellings.

Apache Mahout @ Devoxx Tools in Action Track

2010-11-01 09:32
This year's Devoxx will feature several presentations coming from the Apache Hadoop ecosystem including Tom White on the basics of Hadoop: HDFS, MapReduce, Hive and Pig as well as Michael Stack on HBase.

In addition there will be a brief Tools in Action presentation on Monday evening featuring Apache Mahout.

Please let me know if you are going to Devoxx - would be great to meet some more Apache people there, maybe have dinner at one of the conference days.

CfP: Data Analysis Dev Room at Fosdem 2011

2010-10-27 06:56
Call for Presentations: Data Analysis Dev Room, FOSDEM
5 February 2011
1pm to 7pm
Brussels, Belgium

This is to announce the Data Analysis DevRoom co-located with FOSDEM. The first Meetup on analysing and learning from data, taking place in Brussels, Belgium.

Important Dates (all dates in GMT +2):

  • Submission deadline: 2010-12-17
  • Notification of accepted speakers: 2010-12-20
  • Publication of final schedule: 2011-01-10
  • Meetup: 2011-02-05

Data analysis is an increasingly popular topic in the hacker community. This trend is illustrated by declarations such as:

"I keep saying the sexy job in the next ten years will be statisticians. People think I’m joking, but who would’ve guessed that computer engineers would’ve been the sexy job of the 1990s? The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill in the next decades, not only at the professional level but even at the educational level for elementary school kids, for high school kids, for college kids. Because now we really do have essentially free and ubiquitous data. So the complimentary scarce factor is the ability to understand that data and extract value from it."

-- Hal Varian, Google’s chief economist

The event will comprise presentations on scalable data processing. We invite you to submit talks on the topics:

  • Information retrieval / Search
  • Large Scale data processing
  • Machine Learning
  • Text Mining
  • Computer vision
  • Linked Open Data
  • Sample list of related open source / data projects (not exhaustive) :
  • (including MapReduce, Pig, Hive, ...)
  • &
  • &

Closely related topics not explicitly listed above are welcome.

High quality, technical submissions are called for, ranging from principles to practice.

We are looking for presentations on the implementation of the systems themselves, real world applications and case studies.

Submissions should be based on free software solutions.

Proposals should be submitted at no later than 2010-12-17. Acceptance notifications will be sent out on 2010-12-20.

Please include your name, bio and email, the title of the talk, a brief abstract in English language. Please indicate the level of experience with the topic your audience should have (e.g. whether your talk will be suitable for newbies or is targeted for experienced users.)

The presentation format is short: 30 minutes including questions. We will be enforcing the schedule rigorously.

If you are interested in sponsoring the event (e.g. we would be happy to provide videos after the event, free drinks for attendees as well as an after-show party), please contact us. Note: "DataDevRoom sponsors" will not be endorsed as "FOSDEM sponsors" and hence not listed in the sponsors section on the website.

Follow @DataDevRoom on twitter for updates. News on the conference will be published on our website at

Program Chairs:

  • Olivier Grisel - @ogrisel
  • Isabel Drost - @MaineC
  • Nicolas Maillot - @nmaillot

Please re-distribute this CFP to people who might be interested.

Video: Max Heimel on sequence tagging w/ Apache Mahout

2010-10-26 19:58
Some time ago Max Heimel from TU Berlin gave presentation of the new HMM support in the Mahout 0.4 release at the Apache Hadoop Get Together in Berlin:

Mahout Max Heimel from Isabel Drost on Vimeo.

Thanks to JTeam for sponsoring video taping, thanks to newthinking for providing the location and thanks to Martin Schmidt from newthinking for producing the video.