O'Reilly Strata - Tutorial data analytics

2011-02-11 20:17

Acting based on data

It comes as no surprise to hear that also in the data analytics world engineers are unwilling to share details of how their analysis works with higher management - with on the other side not much interest on learning how analytics really works. This culture leads to a sort of black art, witch craft attitude to data analytics that hinders most innovation.

When starting to establish data analytics in your business there are a few steps to consider: Frist of all no matter how beautiful visualizations may look on the tool you just chose to work with and are considering to buy - keep in mind that shiny pebbles won't solve your problems. Instead focus on what kind of information you really want to extract and chose the tool that does that job best. Keep in mind that data never comes as clean as analysts would love it to be.

  • Ask yourself how complete your data really is (Are all fields you are looking at filled for all relevant records?).
  • Are those fields filled with accurate information (Ever asked yourself why everyone using your registration form seems to be working for a 1-100 engineers startup instead of one of the many other options down the list?)
  • For how long will that data remain accurate?
  • For how long will it be relevant for your business case?

Even the cleanest data set can get you only so far: You need to be able to link your data back to actual transactions to be able to segment your customers and add value from data analytics.

When introducing data analytics check whether people are actually willing to share their data. Check whether management is willing to act on potential results - that may be as easy as spending lots of money on data cleansing, or it may involve changing workflows to be able to provide better source data. As a result of data analytics there may be even more severe changes ahead of you: Are people willing to change the product based on pure data? Are they willing to adjust the marketing budget? ... job descriptions? ... development budget? How fast is the turnaround for these changes? When making changes yearly there is no value in having realtime analytics.

In the end it boils down to applying the OODA cycle: If you can be faster observing, orienting, deciding and acting than your competitor only then do you have a real business advantage.

Data analytics ethics

Today Apache Hadoop provides the means to give data analytics super powers to everyone: It brings together the use of commodity hardware with scaling to high data volumns. With great power there must come great great responsibility according to Stan Lee. In the realm of data science that involves solving problems that might be ethically at least questionable though technologically trivial:

  • Helping others adjust their weapons to increase death rates.
  • Making others turn into a monopoly.
  • Predict the likelihood of cheap food making you so sick that you are able and willing to go to court against the provider as a result.

On the other hand it can solve cases that are mutually sensible both for the provider and the customer: Predicting when visitors to a casino are about to become unhappy and willing to leave before the even know they are may give the casino employees a brief time window for counter actions (e.g. offering you a free meal).

In the end it boils down to avoiding to screw up other people's lifes. Deciding which action does least harm while achieving most benefit. Which treats people at least proportional if not equal, what serves the community as a whole - or more simply: What leads me to being the person I always wanted to be.


2011-01-23 15:46
It's already sort of a nice little tradition for me to spend the first weekend in February in Brussels for FOSDEM. This year I am particulary happy that there will be a Data Analytics Dev Room at FOSDEM. A huge Thanks to @ogrisel and @nmaillot who have done most of the heavy lifting of getting the schedule in place.

I'm going to FOSDEM, the Free and Open Source Software Developers' European Meeting

Looking forward to an interesting Cloud Track, to meeting Peter Hintjens who is going to give a talk on 0MQ, the DevOps presentation and lots of very interesting DevRooms. Looks like again it's going to be tough to decide on which presentations to go to at any one time.

Scrumtisch August Berlin

2010-08-14 16:26
Just seen it - the next Scrumtisch Berlin has been scheduled for 25th August 2010 at 18:30 Uhr. So far, no official talk has been scheduled, so please expect two topics on Scrum and its application to be selected for discussion according to Marion's agile topic selection algorithm.

Please talk to Marion Eickmann if you would like to attend the next meetup.

Apache Hadoop Event Blog

2009-08-24 20:38
As Apache Hadoop becomes ever more popular both in industry as well as in research, user groups, conferences and hacking days are being scheduled around the world. The goal of the event calendar blog hosted on wordpress.com is to provide a common space for organizers to announce their events and potential participants to look for new conferences.

September 2009 Hadoop Get Together Berlin

2009-08-17 09:11
The newthinking store Berlin is hosting the Hadoop Get Together user group meeting. It features talks on Hadoop, Lucene, Solr, UIMA, katta, Mahout and various other projects that deal with making large amounts of data accessible and processable. The event brings together leaders from the developer and user communities. The speakers present projects that build on top of Hadoop, case studies of applications being built and deployed on Hadoop. After the talks there is plenty of time for discussion, some beer and food.

There is also a related Xing Group on the topic of building scalable information retrieval systems. Feel free to join and meet other developers dealing with the topic of building scalable solutions.


Please see upcoming page for updates.

  • Thilo G├Âtz: JAQL
  • Uwe Schindler: Lucene 2.9
  • nugg.ad: Ad Recommendation with Hadoop
  • T. Schuett: Solving puzzles with Hadoop.

If you yourself would like to give a presentation: There are additional slots of 20 minutes each available. There is a beamer provided. Just bring your slides. To include your topic on this web site as well as the upcoming.org entry, please send your proposal to Isabel.

After the talks there will be time for an open discussion. We are going into a nearby restaurant after the event so there will be plenty of time for talking, discussing and new ideas.


The Apache Hadoop Get Together takes place at the newthinking store Berlin:

newthinking store GmbH

Tucholskystr. 48

10117 Berlin

View Larger Map


  • Homeli - not exactly in walking distance, but only a few S-Bahn stations away. Very nice Bed and Breakfast hotel. (The offer is only valid if you stay for at least three nights.)

  • Circus Berlin is a combination of hostel and hotel close by.

  • Zimmer in Berlin is yet another Bed and Breakfast hotel.

  • House boat near Friedrichshain


If you would like to be notified on news please subscribe to our mailinglist. The meetings usually are also announced on the project mailing lists as well as on the newthinking store website.


In case you have any trouble reaching the location or finding accomodation feel free to contact the organiser Isabel.

Past events

June 2009 Apache Hadoop Get Together @ Berlin

2009-06-21 21:33
Just a brief reminder: Next week on Thursday the next Apache Hadoop Get Together is scheduled to take place in Berlin. There are quite a few interesting talks scheduled:

  • Torsten Curdt: Data Legacy - the challenges of an evolving data warehouse
  • Christoph M. Friedrich, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI): "SCAIView - Lucene for Life Science Knowledge Discovery".
  • Uri Boness from JTeam in Amsterdam: Solr - From Theory to Practice.

See http://upcoming.yahoo.com/event/2488959/ for more information.

For those interested in NOSQL Meetups, the discussion over at the NOSQL mailing list might be of interest to you: http://blog.oskarsson.nu/2009/06/nosql-debrief.html

Scrum Table Berlin

2009-06-21 21:26
Last week I attended the scrum table Berlin. This time around Phillippe gave a presentation on "backlog colours", that is types of work items tracked in the backlog.

The easiest type to track are features - that is items that generate revenue and are on the wishlist of the customer. Second type of items he sees are infrastructure items - that is, things needed to implement several features but invisible to the customer. Third type are bugs. Basically these are diminishing the value of features one had already classified as done earlier in the process. Fourth and last type are technical debt items - that is shortcuts taken or bad design choices (either knowingly as intentional decision made to meet some deadline or unintentional due to lack of experience).

A very simple classification could be the following matrix:

Name Value Cost
Feature Visible Positive Positive
Infrastructure Invisible Positive Positive
Bug Visible Negative Positive
Technical Debt Invisible Negative Positive

All four types of items exist in the real world. The interesting part is making these visible, assigning costs to each of them and scheduling these items in the regular sprint intervals.

The full presentation can be downloaded: http://scrumorlando09.pbworks.com/f/kruchten_backlog_colours.pdf

Tomcat Tuesday talk

2009-05-21 09:07
Since several months at neofonie we have a talk given by external or internal developers on various subjects each Tuesday. Usually these presentations are a nice way to get an overview of new emerging technologies, to get an overview of current conference topics or to gain insight into interesting internal projects.

This week we had Apache Tomcat Committer and PMC Peter Rossbach here at neofonie to talk about the Tomcat architecture and Tomcat clustering solutions. He gave two pretty in-depth presentations on the Tomcat internals, Tomcat optimization and extension points.

Some points that were especially interesting to me: The project started out in the late nineties, initiated by a bunch of developers who just wanted to see what it takes to write a web application container and that fullfills the spec. The goal basically was a reference implementation. Soon enough however users defined the resulting code as production ready and used it.

There are a few caveats from this history that are still visible. One is the lack of tests in the codebase. Sure, each release is tested agains the Sun TCK - but these tests cannot be opened to the general public. So if you as a developer make extensions or modifications to the code base there is no easy way of knowing whether you broke something or not.

For me as a developer it was interesting to see really how complex it quickly gets to cluster tomcat deployments and make them failure resistant. Some tools mentioned that help automatic with easier deployment are Puppet and FAI. One issue however that is still on the developer's agenda is Tomcat monitoring.

To summarize: The conference room was packed with developers expecting two very interesting talks. Thanks to Peter Rossbach for coming to neofonie and explaining more on the internals of the Tomcat software, the project and the community behind.

Hadoop User Group UK

2009-04-21 20:34
On Tuesday the 14th the second Hadoop User Group UK took place in London. This time venue and pizza was sponsored by Sun. The room quickly filled http://www.thecepblog.com/2009/04/19/kmeans-clustering-now-running-on-elastic-mapreduce/with approximately 70 people.

Tom opened the session with a talk on 10 practical tips on how to get the most benefit from Apache Hadoop. The first question users should ask themselves is which type of programming language they want to use. There is a choice between structured data processing languages (PIG or Hive), dynamic languages (Streaming or Dumbo), or using Java which is closest to the system.

Tom's second hint dealt with the size of files to process with Hadoop: Both - too large unsplittable and too small ones are bad for performance. In the first case, the workload cannot be easily distributed across the nodes in the latter case each unit of work is to small to account for startup and coordination overhead. There are ways to remedy these problems with sequence files and map files though. Another performance optimization would be to chain individual jobs - PIG and Hive do a pretty decent job in automatically generating such jobs. ChainMapper and ChainReducer can help with creating chained jobs.

Another important task when implementing map reduce jobs is to tell Hadoop the progress of your job. For once, this is important for long running jobs in order for them to remain alive and not be killed by the framework due to timeouts. Second, it is convenient for the user as he can view the progress in the web UI of Hadoop.

Usual suspects for tuning a job: Number of mappers and reducers, usage of combiners, compression customised data serialisation, shuffling tweaks. Of course there is always the option to let someone else do the tuning: Cloudera does provide support as well as pre-built packages init scripts and the like ;)

In the second talk I did a brief Mahout intro. It was surprising to me that half of the attendees already employed machine learning algorithm implementations in their daily work. Judging from the discussion after the talk and from questions I received after it the interest in the project seems pretty high. The slide I liked the most: The announcement of our first 0.1 release. Thanks to all Mahout committers and contributors who made this possible.

After the coffee break Craig gave an introduction to Terrier an extensible information retrieval plattform developed at the university of Glasgow. He mentioned a few other open IR platforms namely Tuple Flow, Zettair, Lemur/Indri, Xapian, as well as of course nutch/Solr/Lucene.

What does Terrier have to do with the HugUK? Well index creation in Terrier is now based on an implementation that makes use of Hadoop for parallelization. Craig did some very interesting analysis on scalability of the solution: The team was able to achieve scaling near linear in the number of nodes added (at least as long as more than reducer is used ;) ).

After the pizza Paolo described his experiences implementing the vanilla pagerank computation with Hadoop. One of his test datasets was the Citeseer citation graph. Interestingly enough: Some of the nodes in this graph have self references (maybe due to extraction problems), duplicate citations, and the data comes in an invalid xml format.

The last talk was on HBase by Michael Stack. I am really happy I attended HugUK as I missed that talk in Amsterdam at the ApacheCon. First Michael gave an overview of which features of a typical RDBMS are not supported by HBase: Relations, joins, and of course JDBC being among the limitations. On the pro site HBase offers a multiple node solutions that has scale out and replication built in.

HBase can be used as source as well as as sink for map reduce jobs and thus integrates nicely with the Apache Hadoop stack. The framework provides a simple shell for administrative tasks (surgery on sick clusters forced flushes non sql get scan and put methods). In addition the master comes with a UI to monitor the cluster state.

Your typical DBA work though differs with HBase: Data locality and physical layout do matter and can be configured. Michaels recommendation was to start out testing with the XL instance on EC2 and decrease instances if you find out that it is too large.

The talk finished with an outlook of the features in the upcoming release the issues on the todo list and an overview of companies already using HBase.

After talks were finished quite a few attendees went over to a pub close by: Drinking beer, discussing new directions and sharing war stories.

I would to thank Johan Oskarsson for organising the event. And a special thanks to Tom for letting me use his Laptop for the Apache Mahout presentation: the hard disk of mine broke exactly one day before.

Last but not least thank you to Sylvio and Susi for letting me stay at their place - and thanks to Helene for crying only during daytime when I was out anyway ;)

Hope to see at least some of the attendees again at the next Hadoop Meetup in Berlin. Looking forward to the next Hadoop User Group UK.

GSoC: Student applications.

2009-04-05 09:57
Title: GSoC: Student applications closed
Link out: Click here
Description: After this date no more student applications are accepted. Internal ranking at Apache starts 7 days earlier. The ranking process finishes at 16th of April.
Date: 2009-04-03