Apache Mahout Hackathon Berlin

2010-12-14 20:50
Early next year - on February 19th/20th to be more precise - the first Apache Mahout Hackathon is scheduled to take place at c-base. The Hackathon will take one weekend. There will be plenty of time to hack on your favourite Mahout issue, to get in touch with two of the Mahout committers and get your machine learning project off the ground.

Please contact isabel@apache.org if you are planning to attend this event or register with the xing event so we can plan for enough space for everyone. If you have not registered for the event there is now guarantee you will be admitted.

If you'd like to support the event: We are still looking for sponsors for drinks and pizza.

Apache Mahout Hackathon Berlin

2010-12-14 20:50
Early next year - on February 19th/20th to be more precise - the first Apache Mahout Hackathon is scheduled to take place at c-base. The Hackathon will take one weekend. There will be plenty of time to hack on your favourite Mahout issue, to get in touch with two of the Mahout committers and get your machine learning project off the ground.

Please contact isabel@apache.org if you are planning to attend this event or register with the xing event so we can plan for enough space for everyone. If you have not registered for the event there is now guarantee you will be admitted.

If you'd like to support the event: We are still looking for sponsors for drinks and pizza.

Apache Mahout Hackathon Berlin

2010-12-14 20:50
Early next year - on February 19th/20th to be more precise - the first Apache Mahout Hackathon is scheduled to take place at c-base. The Hackathon will take one weekend. There will be plenty of time to hack on your favourite Mahout issue, to get in touch with two of the Mahout committers and get your machine learning project off the ground.

Please contact isabel@apache.org if you are planning to attend this event or register with the xing event so we can plan for enough space for everyone. If you have not registered for the event there is now guarantee you will be admitted.

If you'd like to support the event: We are still looking for sponsors for drinks and pizza.

Apache Mahout Hackathon Berlin

2010-12-14 20:50
Early next year - on February 19th/20th to be more precise - the first Apache Mahout Hackathon is scheduled to take place at c-base. The Hackathon will take one weekend. There will be plenty of time to hack on your favourite Mahout issue, to get in touch with two of the Mahout committers and get your machine learning project off the ground.

Please contact isabel@apache.org if you are planning to attend this event or register with the xing event so we can plan for enough space for everyone. If you have not registered for the event there is now guarantee you will be admitted.

If you'd like to support the event: We are still looking for sponsors for drinks and pizza.

Apache Mahout Podcast

2010-12-13 21:21
During Apache Con ATL Michael Coté interviewed Grant Ingersoll on Apache Mahout. The interview is available online as podcast. The interview covers the goals and current use cases of the project, goes into some detail on the reasons for initially starting it. If you are wondering what Mahout is all about, what you can do with it and which direction development is heading, the interview is a great option to find out more.

Teddy in Antwerp

2010-12-12 21:30
When at Devoxx Teddy went to the city taking a few pictures of the Grote Markt, the Haven as well as the main train station.




Apache Lunch Devoxx

2010-12-11 21:30
On Twitter I suggested to host an Apache dinner during Devoxx. Matthias Wesendorf of Apache MyFaces was so kind to take up the discussion carrying it over to the Apache community mailing-list. It quickly turned out that there was quite some interest with several members and committers attending Devoxx. We scheduled the meetup for Friday after the conference during lunch time.
I pinged a few Apache related people I knew would attend the conference (being a speaker and a committer at some Apache project almost certainly resulted in getting a ping). Steven Noels kindly made a reservation at a restaurant close by and announced time and geo coordinates on party.apache.org. Although several speakers had left already that very same morning, we turned out to be eleven people – including Stephen Coleburn, Mathias Wessendorf, Steven Noels, Martijn Dashorst of the Apache Wicked project. Was great meeting all of you – and being able to put some faces to names :)

Devoxx – Day three

2010-12-10 21:28
The panel discussion on the future of Java was driven by visitor submitted and voted questions on the current state and future of Java. The general take-aways for me included the clear statement that the TCK will never be made available to the ASF. The promise of Oracle to continue supporting the Java community and remaining active in the JCP.

There was some discussion on whether coming Java versions should be backwards-incompatible. One advantage would be the removal of several Java puzzlers thus making it easier for Joe Java to write code in Java without knowing too much about potential inconsistencies. According to Joshua Bloch the language is no longer well suited to the average programmer who just simply wants to get his tasks done in a consistent and easy to use language: It has become too complicated over the course of the years and is in bitter need for simplification.

Having seen his presentation in Berlin at Buzzwords and silently following the project's progress online I skipped parts of the elastic search presentation. Instead went to the presentation on the Ghost-^wBoilerplate Busters from project Lombok. It always stroke me as odd that in a typical Java project there is so much code that can be generated automatically by Eclipse – such as getters/setters, equals/hashcode, delecation of methods and more. I never really understood why it is possible to generate all that code from Eclipse but not during compile time. Project Lombok however comes to the rescue here. As a compile time dependency it provides several annotations that are automatically converted to the correct code on the fly. It includes support for getter/setter generation, handling of closable resources (even with the current stable version of java), generation of thread safe lazy initialisation of member variables, automatic implementation of the composition over inheritance pattern and much more.

The library can be used from within Eclipse, in maven, ant, ivy, on Google App Engine. One of the developers in charge for IntelliJ who was in the audience announced that the library will be supported by the next version of IntelliJ as well.

Devoxx – Day 2 HBase

2010-12-09 21:25
Devoxx featured several interesting case studies of how HBase and Hadoop can be used to scale data analysis back ends as well as data serving front ends.

Twitter



Dmitry Ryaboy from Twitter explained how to scale high load and large data systems using Cassandra. Looking at the sheer amount of tweets generated each day it becomes obvious that with a system like MySQL alone this site cannot be run.

Twitter has released several of their internal tools under a free software license for others to re-use – some of them being rather straight forward, others more involved. At Twitter each Tweet is annotated by a user_id, a time stamp (ok if skewed by a few minutes) as well as a unique tweet_id. In order to come up with a solution for generating the latter one they built a library called snowflake. Though rather simple algorithm even works in a cross data-centre set-up: The first bits are composed of the current time stamp, the following bits encode the data-centre, after that there is room for a counter. The tweet_ids are globally ordered by time and distinct across data-centres without the need for global synchronisation.

With gizzard Twitter released a rather general sharding implementation that is used internally to run distributed versions of Lucene, MySQL as well as Redis (to be introduced for caching tweet timelines due to its explicit support for lists as data structures for values that are not available in memcached).

FlockDB for large scale social graph storage and analysis. Rainbird for time series analysis, though with OpenTSDB there is something comparable available for HBase. Haplocheirus for message vector caching (currently based on memcached, soon to be migrated to Redis for its richer data structures). The queries available through the front-end are rather limited thus making it easy to provide pre-computed, optimised version in the back-end. As with the caching problem a tradeoff between hit rate on the pool of pre-computed items vs. storage cost can be made based on the observed query distribution.

In the back-end of Twitter various statistical and data mining analysis are run on top of Hadoop HBase To compute potentially interesting followers for users, to extract potentially interesting products etc.
The final take-home message here: Go from requirements to final solution. In the space of storage systems there is not such thing as a silver bullet. Instead you have to carefully evaluate features and properties of each solutions as your data and load increase.

Facebook



When implementing Facebook Messaging (a new feature that was announced this week) Facebook decided to go for HBase instead of Cassandra. The requirements of the feature included massive scale, long-tail write access to the database (which more or less ruled out MySQL and comparable solutions) and a need for strict ordering of messages (which ruled out any eventually consistent system. The decision was made to use HBase.

A team of 15 developers (including operations and frontend) was working on the system for one year before it was finally released. The feature supports for integration of facebook messaging, IM, SMS and mail into one single system making it possible to group all messages by conversation no matter which device was used to send the message originally. That way each user's inbox turns into a social inbox.

Adobe



Cosmin Lehene presented four use cases of Hadoop at Adobe. The first one dealt with creating and evaluating profiles of the Adobe Media Player. Users would be associated with a vector giving more information on what types of genre the meda they consumed belonged to. These vectors would then be used to generate recommendations for additional content to view in order to increase consumption rate. Adobe built a clustering system that would interface Mahout's canopy- and k-means implementations with their HBase backend for user grouping. Thanks Cosmin for including that information in your presentation!

A second use case focussed on finding out more on the usage of flash on the internet. Using Google to search for flash content was no good as only the first 2000 results could be viewed thus resulting in a highly skewed sample. Instead they used a mixture of nutch and HBase for storage to retrieve the content. Analysis was done with respect to various features of flash movies, such as frame rates. The analysis revealed a large gap between the perceived typical usage and the actual usage of flash on the internet.

The third use case involves analysis of images and usage patterns on the Photoshop-in-a-browser edition of Photoshop.com. The forth use case dealt with scaling the infrastructure that powers businesscatalyst – a turn-key online business platform solution including analysis, campaigning and more. When purchased by Adobe the system was very successful business-wise. However the infrastructure was by no means able to put up with the load it had to accommodate. Changing to a back-end based on HBase led to better performance, faster report generation.

Devoxx – Day two – Hadoop and HBase

2010-12-08 21:24
In his session on the current state of Hadoop Tom went into a little more detail not only on the features released in the latest release or on the roadmap for upcoming releases (including Kerberos based security, append support, warm standby namenode and others).
He also gave a very interesting view on the current Hadoop ecosystem. More and more projects are currently being created that either extend Hadoop or are built on top of Hadoop. Several of these are being run as projects at the Apache Software Foundation, however some are available outside of Apache only. Using graphviz he created a graph of projects depending on or extending Hadoop and from that provided a rough classification of these projects.

As to be expected HDFS and Map/Reduce are part of the very basis of this ecosystem. Right next to them sits zookeeper, a distributed coordination and looking service.

Storage systems extending the capabilities of HDFS include HBase that adds random read/write as well as realtime access to the otherwise batch-oriented distributed file-system. With PIG and Hive and Cascading three projects are making it easier to formulate complex queries for Hadoop. Among the three, PIG is mainly focussed on expressing data filtering and processing, with SQL support being added over time as well. Hive came from the need for SQL formulation on Hadoop clusters. Cascading goes a slightly different way, providing a Java API for easier query formulation. The new kid on the block sort of is Plume, a project initiated by Ted Dunning that has the goal of coming up with a Map/Reduce abstraction layer inspired by Google's Flume Java publication.

There are several projects for data import into HDFS. Sqoop can be used for interfacing with RDMBS. Chukwa and Flume deals with feeding log data into the filesystem. For general co-ordination and workflow orchestration there is the release of Oozie, originally developed at Yahoo! as well as support for workflow definition in Cascading.

When storing data in Hadoop it is a common requirement to find a compact, structured representation of the data to store. Though human readable, xml files are not very compact. However when using any binary format, schema evolution commonly is a problem: Adding, renaming or deleting fields in most cases causes the need to upgrade all code interacting with the data as well as re-formatting already stored data. With Thrift, Avro and Protocol Buffers there are three options available for storing data in a compact, structured binary format. All three projects come with support for schema evolution by providing users no only to deal with missing data but also by providing a means to map old to new fields and vice versa.