Apache Mahout Meetup Amsterdam

2011-02-19 20:18

Last week I was honoured to be invited as one of the two speakers on Apache Mahout at the Mahout meetup in Amsterdam at JTeams offices. After free beer, cola and pizza Frank Scholten gave an overview of Mahout's clustering capabilities. After a brief introduction to Mahout itself he went into a little more detail on how clustering works in general. After that with a selection of Seinfeld scripts he used a fun data set to guide the audience through the process of choosing the right data preparation steps, coming up with good training parameters and finally evaluating clustering quality.

After that I gave a brief introduction to classification with Mahout - going into a little more detail when it comes to data preparation and quality evaluation. The audience seemed most interested in learning more on how data preparation works - after all that step cannot really be covered by Mahout itself (though we do have some support) but instead needs a lot of domain knowledge from the user side.

Judging from the brief round of self introductions the meetup was well visited by an intesting mixture of people coming from JTeam, Hippo, the dutch police working on data analytics, developers working at RIPE and many more.

If you are interested in more data analysis, search and data storage - do not miss registration for Berlin Buzzwords on June 6/7th 2011.

FOSDEM - Sunday - smaller bits and pieces

2011-02-18 20:17

With WebODF the Office track featured a very interesting project that focusses on providing a means to open ODF documents in your favourite browser: Content and formatting are converted to a form that can easily be dealt with by using a combination of HTML and CSS. Advanced editing is then supported by using JavaScript.

With Open Stack the following talk focussed on an open cloud stack project that was started by NASA and Rackspace as both simultanously needed support for an open source, openly designed, developed cloud stack that strives for community inclusion. According to the speaker the goal is to be as ubiquitous a cloud project as Apache is for web servers - he probably was not quite aware of how close to even the foundation side of Apache that development model is.

The closing keynote dealt with the way kernel development takes place. There were a few very interesting pieces of information for contributors that are valid for any open source project really:

  • Out of tree code is invisible to the kernel developers and users. As such the longer it remains out of tree code the harder it becomes to actually go out there and feel the wind.
  • In contrast open code means giving up control: Maintainership means responsibility but it does not come with any power or control over the source code. Similarly opening code up as patch or separate project at Apache means giving up control - means working towards turning the project into a community that can live on its own.
  • For kernel patches the general rule is to not break things and not go backward in quality: What is working for users today must be working with the next release as well. To be able to spot any compat issues it is necessary to take part on the wider disucssion lists - not only in your limited development community. Developers should focus on coming up with a problem solution instead of getting their original code into the project.

Or in short: The kernel is no research project, as such it must not break existing applications. Visionary brilliance really is no excuse for poor implementation. Conspiracy theories such as "hey, developer x declined my patch only because it is out of scope for his employer's goals" are not going to get you anywhere. Such things do happen, but in general kernel developers first think of themselves as kernel developers - being employee somewhere only comes after that.

Keep in mind that the community remembers past actions. In the end you need not convince business people or users but the developers themselves who might end up with the maintanance burden for your patch. To get your patch accepted it greatly helps to not express it in terms of the implementation needs only but to clearly formulate your requirements - independent of implementation. And as in any open source project, helping with cleanup (that is not only white space fixes, but real cleanup as in refactoring) does help build a positive attitude.

Why you should go for kernel development never the less? It's a whole lot of fun. It's a way to influence the kernel to support the features that you need. It's sort of like becoming part of an elite club - and which developer does not like the feeling of belonging to the elite changing the way the world looks tomorrow? In addition as with an substantial open source involvement being visible in the kernel community also most likely means being visible to your future employer.

FOSDEM - HBase at Facebook Messaging

2011-02-17 20:17

Nicolas Spiegelberg gave an awesome introduction not only to the architecture that powers Facebook messaging but also to the design decisions behind their use of Apache HBase as a storage backend. Disclaimer: HBase is being used for message storage, for attachements with Haystack a different backend is used.

The reasons to go for HBase include its strong consistency model, support for auto failover, load balancing of shards, support for compression, atomic read-modify-write support and the inherent Map/Reduce support.

When going from MySQL to HBase some technological problems had to be solved: Coming from MySQL basically all data was normalised - so in an ideal world, migration would have involved one large join to port all the data over to HBase. As this is not feasable in a production environment instead what was done was to load all data into an intermediary HBase table, join the data via Map/Reduce and import all into the target HBase instance. The whole setup was run in a dark launch - being fed with parallel life traffic for performance optimisation and measurement.

The goald was zero data loss in HBase - which meant using the Apache Hadoop append branch of HDFS. The re-designed the HBase master in the process to avoid having a single point of failure, backup masters are handled by zookeeper. Lots of bug fixes went back from Facebooks engineers to the HBase code base. In addition for stability reason rolling restarts were added for upgrades, performance improvements, consistency checks.

The Apache HBase community received lots of love from Facebook for their willingness to work together with the Facebook team on better stability and performance. Work on improvements was shared between teams in an amazing open and inclusive model to development.


One additional hint: FOSDEM videos of all talks including this one have been put online in the meantime.

FOSDEM - Django

2011-02-16 20:17

The languages/ cloud computing track on Sunday started with the good, the bad and the ugly of Django's architecture. Without much ado the speaker started by giving a high level overview of the general package layout of Django - unfortunately not going into too much detail on the architecture itself.

What he loves about Django are the model layer abstractions that really are no ORM only - instead both relational and non-relational databases can be supported easily. Abstractions in Django are made by task solved - there are multiple implementations available for caching, mailing, session handling etc. There is great geo support with options for defining geo objects, querying single points on a map for all their overlaying geo objects. Being a community of test driven people Django features awesome debugging and testing tools. To avoid cross side request forgery Django comes with built in protection mechanisms.

There is multi database support for building applications. Being a small core implementation features can be turned on and off as needed. In addition the framework comes with great documentation: No feature addition is accepted unless it comes with decent documentation - which fits nicely with the common perception that anything that is untested and undocumented does not exist.

The bad things about Django according to the speaker? Well, the old CSRF protection implementation that might lead to token leakage. Schema changes and migrations currently really are hard to handle. Though there is south to handle at least some of the migrations pain. The templating implementation could use some improvement as well - being designed to make inclusion of logic in the templates hard some use cases are just to clumsy to implement.

As for the ugly things: There is quite a bit of magic at work which generally leads to harder tracing of applications - that is about to get better. Too many parts of Django rely on unwieldy regular expressions. Anything that spans more than 4 lines on a screen probably is to be considered unmanageable and unchangeable. Authentication cannot really be customised - the information that is stored per user is hard coded and fixed.

Over time what was learned: Refactoring cannot be avoided as requirements change. However being consistent in what you do makes it so much easier for users to pick up the framework. What helps with creating a great open source project: People that have the time to invest - never under estimate the time needed to really go from prototype to production ready.

FOSDEM - Saturday

2011-02-15 20:17

Day one at FOSDEM started with a very interesting and timely keynote by Eben Moglen: Starting with the example of Egypt he voted for de-centralized distributed and thus harder to take over communication systems. In terms of tooling we are already almost there. Most use cases like micro blogging, social networking and real time communications can already be implemented in a distributed, fail safe way. So instead of going for convenience it is time to think about digital independence from very few central providers.

I spent most of the morning in the data dev room. The schedule was packed with interesting presentations ranging from introductory overview talks on Hadoop to more in depth treatment of the machine learning framework Apache Mahout. With an analysis of the Wikileaks cables the schedule also included case studies on what use cases can be implemented by thourough data anlysis. The afternoon featured presentations on the background to more data analytics for better usability at Wikimedia as well as talks on buiding search applications.

In the lightning talks room a wide variety of projects was presented - in only ten minutes Pieter Hintjens explained the gist of using 0MQ for messaging. That talk included "Hintjens law of concurrency: e = m * c^2, where e is effort needed to implement and maintain, m is mass - that is the amount of code written and c is complexity.

For me the day ended with a very interesting presentation by Matthias Kirschner/FSFE on one of their campaigns: pdfreaders.org has the very narrow and well scoped goal of getting links to unfree software off of governmental web pages. Using a really intuitive example they were able to convince officials of linking to their vendor neutral list of pdf readers: "Just imagine a road in your city. At this road drivers will find a sign that tells them the road is well suited to be used by VW cars. Those cars can be obtained for test drive at the following address. Your government." As unthinkable as such as sign may be that same text is included in nearly all governmental web pages linking to the acrobat reader.

What made pdfreaders successful is the combined effort of volunteers, its very narrow and clear scope, it's scalability by nature: People were asked to submit "broken" web pages to a bug tracker, campaign participants would then go and send out paper letters to these institutions and mark the bugs fixed as soon as the links were changed. Letters were pre-written and well prepared. So all that was needed was money for toner, paper and stamps.

One final cute example of how that worked out can be seen at hamburg.de/adobe.

O'Reilly Strata - day one afternoon lectures

2011-02-13 22:18

Big data at startups - Info Chimps

As a startup to get good people there is no other option then to grow your own: Offer the option to gain a lot of experience in return for a not so great wage. Start out with really great hires:

  • People who have the "get shit done gene": They discover new projects, are proud to contribute to team efforts, are confident in making changes to a code base probably not known before hand. To find these you should ask open ended questions in interviews.
  • People who are passionate learners, that use the tools out there, use open code and are willing to be proven wrong.
  • People who are generally fun to work with.

Put these people on small, non-mission-critical initial projects - make them fail on parallel tasks (and tell them they will fail) to teach them to ask for help. What is really hard for new hires is learning to deal with git, ssh keys, command line stuff, what to do and when to ask for help, knowing what to do when something breaks.

Infochimps uses Kanban for organisation: Each developer has a task he has chosen at any given point in time. He is responsible for getting that task done - which may well involved getting help from others. Being responsible for a complete features is one big performance boost once the feature truely goes online. Code review is being used for teachable moments - and in cases where something really goes wrong.

Development itself is organised to optimise for developers' joy - which usually means to take Java out of the loop.

Machine learning at Orbitz

They use Hadoop mostly for log analysis. Also here the problem of fields or whole entries missing in the original log format was encountered. To be able to dynamically add new attributes and deal with growing data volumns they went from a data warehouse solution to Apache Hadoop. Hadoop is used for data preparation before training, for training recommender models, for cross validation setups. Hive has been added for ad-hoc queries usually issued by business users.

Data scaling patterns at LinkedIn

When scaling to growing data LinkedIn developers started gathering a few patterns that helped make dealing with data easier:

  • When building applications constantly monitor your invariants: It can be so frustrating to run an hour long job just to find out at the very end that you made a mistake during data import.
  • Have a QA cluster, have versioning on your releases to allow for easy rollback should anything go bad. Unit tests go without saying.
  • Profile your jobs to avoid bottlenecks: Do not read from the distributed cache in a combiner - do not reuse code that was intended for a different component without thorough review.
  • Dealing with real world data means dealing with irregular, dirty data: When generating pairs of users for connect recommendation, Obama caused problems as he is friends with seemingly every american.

However the biggest bottleneck: IO during shuffling as every map talks to every reducer. As a rule of thumb, do most work on the map side and minimise data sent to reducers. This also applies to many of the machine learning M/R formulations. One idea for reducing shuffling load is to pre-filter on the map side with bloom filters.

To serve at scale:

  • Run stuff multiple times.
  • Iterate quickly to get fast feedback.
  • Do AB testing to measure performance.
  • Push out quickly for feedback.
  • Try out what you would like to see.

See also sna-projects.com/blog for more information.

O'Reilly Strata - Day two - keynotes

2011-02-12 20:17

Day two of Strata started with a very inspiring insight from the host itself that extended the vision discussed earlier in the tutorials: It's not at all about the tools, the current data analytics value lies in the data itself and in the conclusions and actions drawn from analysing it.

Bit.ly keynote

The first key note was presented by bit.ly - for them there are four dimensions to data analytics:

  • Timeliness: There must be realtime access, or at least streaming access to incoming data.
  • Storage must provide the means to efficiently store, access, query and operate on data.
  • Education as there is no clear path to becoming a data scientist today.
  • Imagination to come up with new interesting ways to look at existing data.

Storing shortened urls for bit.ly there really are three views on their data: The very personal intrinsic preferences expressed in your participation in the network. The neighborhood view taking into account your friends and accquaintances. Finally there is the global view that allows for drawing conclusion on a very large global scale - a way to find out what's happening world wide just by looking at log data.

Thomson Reuters

In contrast to all digital bit.ly Thomson Reuters comes with a very different background - though acting on a global scale distributing news world wide there lots of manual intervention is still asked for to come up with high quality, clean, curated data. In addition their clients focus on very low latency to be able to act on new incoming news at the stock market.

For traditional media providers it is very important to bring news together with context and users: Knowing who users are and where they live may result in delivering better service with more focussed information. However he sees a huge gap between what is possible with today's web2.0 applications and what is still in common practice in large corporate environments: Social networking sites tend to gather data implicitly without clearly telling users what is collected and for which purpose. In corporate environments though it was (and still is) common practice to come up with general compliance rules that target protecting data privacy and insulating corporate networks from public ones.

Focussing on cautious and explicit data mining might help these environments to benefit from cost savings and targeted information publishing to the corporate environment as well.

Mythology of big data

Each technology caries in itself the seeds for self destruction - same is true for Hadoop and friends: The code is about to start turning into commodity itself. As a result the real value lies in the data it processes and the knowledge about how to combine existing tools to solve 80% of your data analytics problems.

The myth really lies in the lonely hacker sitting in front of his laptop solving the world's data analysis problems. Instead analytics is all about communication and learning from those who stored and generated the data. Only they are able to tell more on business cases as well as the context of the data. Only domain knowledge can help solve real problems.

In the past data emerged from being the product, into being a by-product, to being an asset in the past decade. Nowadays it is turning into a substrate for developing better applications. There is no need for huge data sets for turning data into a basis for better applications. In the end it boils down to using data to re-vamp your organisation's decisions from being horse trading, gut-check based decisions to scientific, data backed informed decisions.

Amazon - Werner Vogels

For amazon, big data means that storing, collecting, analyzing and processing the data are hard to do. Being able to do so currently is a competitive advantage. In contrast to BI where questions drove the way data was stored and collected today infrastructure is cheap enough to creatively come up with new analytics questions based on available data.

  • Collecting data goes from a streaming model to daily imports even to batch imports - never under estimate the bandwidth of FedEx. There even is a FedEx import at Amazon.
  • Never under estimate the need for increased storage capacity. Storage on AWS can be increased dynamically.
  • When organizing data keep data quality and manual cleansing in mind - there is a mechanical turk offering for that at AWS.
  • For Analysis Map Reduce currently is the obvious choice - AWS offeres elastic map reduce for that.
  • The trend goes more and more to sharing analysis results via public APIs to enable customers down stream to reuse data and provide added value on top of it.

Microsoft Azure data market place

Microsoft used their keynote to announce the Azure Data Marketplace - a place to make data available for easy use and trading. To deal with data today you have to find it, license it from its original owner - which incurs overhead negotiating licensing terms.

The goal of Microsoft is to provide a one click stop shop for data that provides a unified and discoverable interface to data. They work with providers to ensure cleanup and curation. In turn providers get a marketplace for trading data. It will be possible to visualize data before purchase to avoid buying what you do not know. There is a subscription model that allows for constant updates, has cleared licensing issues. There are consistant APIs to data that can be incorporated by solution partners to provide better integration and analysis support.

At the very end the Heritage health prize was announced - a 3 million data mining competition open for participation starting next April.

O'Reilly Strata - Tutorial data analytics

2011-02-11 20:17

Acting based on data

It comes as no surprise to hear that also in the data analytics world engineers are unwilling to share details of how their analysis works with higher management - with on the other side not much interest on learning how analytics really works. This culture leads to a sort of black art, witch craft attitude to data analytics that hinders most innovation.

When starting to establish data analytics in your business there are a few steps to consider: Frist of all no matter how beautiful visualizations may look on the tool you just chose to work with and are considering to buy - keep in mind that shiny pebbles won't solve your problems. Instead focus on what kind of information you really want to extract and chose the tool that does that job best. Keep in mind that data never comes as clean as analysts would love it to be.

  • Ask yourself how complete your data really is (Are all fields you are looking at filled for all relevant records?).
  • Are those fields filled with accurate information (Ever asked yourself why everyone using your registration form seems to be working for a 1-100 engineers startup instead of one of the many other options down the list?)
  • For how long will that data remain accurate?
  • For how long will it be relevant for your business case?

Even the cleanest data set can get you only so far: You need to be able to link your data back to actual transactions to be able to segment your customers and add value from data analytics.

When introducing data analytics check whether people are actually willing to share their data. Check whether management is willing to act on potential results - that may be as easy as spending lots of money on data cleansing, or it may involve changing workflows to be able to provide better source data. As a result of data analytics there may be even more severe changes ahead of you: Are people willing to change the product based on pure data? Are they willing to adjust the marketing budget? ... job descriptions? ... development budget? How fast is the turnaround for these changes? When making changes yearly there is no value in having realtime analytics.

In the end it boils down to applying the OODA cycle: If you can be faster observing, orienting, deciding and acting than your competitor only then do you have a real business advantage.

Data analytics ethics

Today Apache Hadoop provides the means to give data analytics super powers to everyone: It brings together the use of commodity hardware with scaling to high data volumns. With great power there must come great great responsibility according to Stan Lee. In the realm of data science that involves solving problems that might be ethically at least questionable though technologically trivial:

  • Helping others adjust their weapons to increase death rates.
  • Making others turn into a monopoly.
  • Predict the likelihood of cheap food making you so sick that you are able and willing to go to court against the provider as a result.

On the other hand it can solve cases that are mutually sensible both for the provider and the customer: Predicting when visitors to a casino are about to become unhappy and willing to leave before the even know they are may give the casino employees a brief time window for counter actions (e.g. offering you a free meal).

In the end it boils down to avoiding to screw up other people's lifes. Deciding which action does least harm while achieving most benefit. Which treats people at least proportional if not equal, what serves the community as a whole - or more simply: What leads me to being the person I always wanted to be.

Teddy in San Francisco

2011-02-10 20:13

Before attending O'Reilly Strata there were a few days left to adjust to the different time zone, meet up with friends and generally spend some days in the Greater San Francisco area. As was to be expected, those were way to few days. The weekend was a bit rainy, still packed with visiting China town right after the plane had landed and spending some time at ... Finally was taken out to Bucks - the restaurant among software engineers generally known for being the place where VC deals are being made.

Sunday was reserved for visiting some red wood trees - it's so great driving just a few minutes out of the city and arriving in an area that looks like being set up for a fairy tale movie. With all the mist and with sun coming out here and there the area looked even more bewitched.

On Monday sun finally arrived in the bay - as a result a ferry trip to Sausolito seemed like the optimal thing to do. Unfortunately not enough time to rent a bike an to the "ride the bridge" tour - or get a kajak to go out into the bay. Maybe next time though.

After returning back home, Teddy showed me some pieces of chocolate someone in the US made him adicted to - now it's not just the tasty swiss one but also the Berkley one I have to find a shop for in Berlin ;)

Apache Mahout in Amsterdam

2011-01-25 20:00
On February 7th there will be an Apache Mahout meetup in Amsterdam kindly organised by JTeam. There will be two presentations - one by myself on classification with Apache Mahout as well as a second one by Frank Scholten on clustering with Apache Mahout.


  • Time: 18:00
  • Location: Frederiksplein 1, 1017XK Amsterdam, The Netherlands


Looking forward to a few days in Amsterdam.