Thanks for all the help

2012-12-31 11:24
This year was a blast: It started with the ever great FOSDEM in Brussels (see you there in 2013?), an invitation to GeeCon in Poznan (if you ever get an invitation to speak there - do accept, the organisers do an amazing job at that event). In summer we had Berlin Buzzwords in Berlin for the third time with 700 attendees (to retain the community feel to the conference we decided to limit tickets in 2013, so make sure you get your's early). In autumn I changed my name and afterwards spent two amazing weeks in Sydney, only to attend Strata EU afterwards. Finally in December I was invited to go through the most amazing submissions for Hadoop Summit in Amsterdam 2013 (it was incredibly hard to pick and choose - thanks to Sean and Torsten for assisting me with that choice for the Hadoop Applied track.)

I think I would have gone mad if it hadn't been for all the help from friends and family: A big hug to my husband for keeping me sane when ever times got a bit rough in terms of stuff in my calendar. Thanks for all your support throughout the year. Another huge hug to my family - in particular to my mom who early 2012 volunteered to take care of most of the local organisation of our wedding (we got married close to where I grew up) and put in several surprises that she "kept mum" about up to the very last second. Also everyone who helped fill your wedding magazine with content (and train my ma in dealing with all sorts of document formats containing that content in her mail box - me personally I was forbidden to even just touch her machine during the first nine months of 2012 ;) ).

Another thanks to David Obermann for a series of interesting talks at this year's Apache Hadoop Get Together Berlin. It's amazing to see the event continue to grow even after essentially stepping down from being the main organiser.

Speaking of events: Another Thank You to Julia Gemählich, Claudia Brückner and the whole Berlin Buzzwords team. This was the first year I reduced the time I put into the event considerably - it was the first year I could attend the conference and not be all too tired to enjoy at least some of the presentations. You did a great job! Also thanks to my colleagues over at Nokia who provided me with a day a week to get things done for Buzzwords. In the same context: A very big thank you to every one who helped turn Berlin Buzzwords into a welcoming event for everyone: Thanks to all speakers, sponsors and attendees. Looking forward to seeing you again next year.

Finally a really big Thanks to all the people who helped turn our wedding day and vacation afterwards into the great time it was: Thanks to our families, our friends (including but not limited to the best photographer and friend I've met so far, those who hosted us in Sydney, and the many people who provided us with information on where to go and what to do.)

There's one thing though that bugged me by the end of this year:





So I decided that my New Year's resolution for 2013 would be to ramp up the time I spend on Apache one way or another: At least as committer for Apache Mahout, Mentor for Apache Drill and as a Member of the foundation.

Wishing all of you a Happy New Year - and looking forward to another successful Berlin Buzzwords in 2013.

RecSys Stammtisch Berlin - December 2012

2012-12-30 12:40
Earlier this month I attended the fourth Recommender Stammtisch in Berlin. The event was kindly hosted by Soundcloud - who on top of organising the speakers provided a really yummy buffet by Kochzeichen D.

With Paul Lamere the evening started with a very entertaining but also very packed talk on why music recommendation is special - or put more generally why all recommender systems are special:


  • Traditionally recommender systems found their way into the wild to drive sales. In music however the main goal is to help users discover new content.
  • Listeners are very different: Ranging from those indifferent to what is being played (imagine someone sitting in a coffee bar enjoying their espresso - it's unlikely that those would want to influence the playlist of the shop's entertainment system unless they are really annoyed with it's content). There are casual listeners who from time to time skip a piece. There are more engaged people who train their own recommender through services like last.fm. Finally there are fanatics that are really into certain kinds of music. Building just one system to fit them all won't do. Also relying on just one signal won't do - instead you will have to deal with both, content signals like loudness plots as well as community signals.
  • Music applications tend to be highly interactive. So even if there is little to no reliable explicit feedback people tell you how much you like your music when skipping, turning pieces louder, interacting with the content behind the song being played.
  • In contrast to many other domains music deals with a vast item space and a huge long tail of songs that almost never get interacted with.
  • In contrast to shopping recommenders however in music making mistakes is comparably cheap: In most situations music isn't purchased on a song by song basis but based on some subscription model. That way the actual cost of playing the wrong song is low. Also songs tend to be not much longer than 5min so also users are less annoyed when confronted with a slightly wrong piece of music.
  • When implementing recommenders for a shopping site it is to be avoided to re-recommend stuff a user has purchased already. This is not the case in music recommendation. Quite the contrary: Re-recommending known music is one indicator for playlists people will like.
  • When it comes to building playlists care must be taken to organise songs in a coherent way, mixing new and familiar songs in a pleasing order - essentially the goal should be to take the listener on a journey.
  • Fast updates are crucial: Music business itself is fast paced with new releases coming out regularly and being taken up very quickly by major broadcasting stations.
  • Music is highly contextual: It pays to know if the user is in the mood for calm or for faster music..
  • There are highly passionate users that are very easy to scare away - those tend to be the loudest ones influencing your user community most.
  • Though meta data is key in music as well, never expect it to be correct. There are all sorts of weird band and song names that you never thought would be possible - an observation that also Ticketmaster made when building their ticket search engine.
  • Music is highly social and irrational - so just knowing your users friends and their tastes won't get you to being perfect.


Overall I guess the conclusion is that no matter which domain you deal with you will always need to know the exact properties of that domain to build a successful system.

In the second talk Brian McFee explained one way of modeling playlists with context. With that he concentrated on passive music discovery - that is based on one query return a list of music to listen to sequentially as opposed to active retrieval where users issue a query to search for a specific piece of music.

Historically it turned out to be difficult to come up with any playlist generator that is better than randomly selecting songs to play. His model is based on a random walk notian where the vertices are songs and edges represent learnt group similarities. Groups were represented by features like familiarity, social tags, specific audio features, metadata, release dates etc. Depending on the playlist category in most cases he was able to show that his model actually does perform better than random.

In the third talk Oscar Celma showed some techniques to also benefit from some of the more complicated signals for music recommendation. Essentially his take was that by relying on usage signals only you will be stuck with the head of the usage distribution only. What you want though is to be able to provide recommendations for the long tail as well.

Some signals he mentioned included content based features (rythm, BPM, timbre, harmony), usage signals, social signals (beware of people trying to game the system or make fun of it though) and a mix of all those. His recommendation was to put content signals at the end of the processing pipeline for re-ranking and refining playlists.

When providing recommendations it is essential to be able to answer why something was recommended. Even just in the space of novelty vs. relevancy to the user there are four possible strategies: a) recommend only old stuff that is marginally relevant to the specific user: This will end up pulling up mostly popular songs. b) recommend what is new but not relevant to the user: This will end up pulling out songs that turn your user away. c) recommend what is relevant to the user but old, this will mostly surface stuff the user knows already but is a safe bet to play. d) recommend what is both relevant and new to the user - here the real interesting work starts as this deals with recommending genuinely new songs to users.

To balance discovery with safe fallback go for skips, bans, likes and dislikes. Take into account the user context and attention.

The final point the speaker made was the need to take into account the whole picture: Your shiny new recommendation algorithm will just be a tiny piece in the puzzle. Much more work will need to go into data collection and ingestion, into API design.

The last talk finally went into some detail of the history of playlist creation - back from music creators' choices, via radio station mixes, mix tapes and finally ending up at spotify and fully automatic playlist creation.

There is a vast body of knowledge on how to create successful playlists e.g. among DJs that speak about warm-up phases, chillout times, alternating types of music in order to take the audience on a journey. Even just shuffling music the user already knows can be very powerful given the pool of songs the shuffle is based on neither too large (containing too broad types of music) nor too small (leading to frequent repetitions). According to Ben Fields the science and art of playlist generation and in particular evaluation is still pretty much in it's infancy with much to come.

Elastic Search meetup Berlin

2012-11-28 00:39
Today Retresco hosted the (to my knowledge fourth) Elastic Search User Group Berlin - a group dedicated to using Lucene as part of Elastic Search. With roughly fifteen attendees the meetup attracted a decent crowd - most interestingly many of the people there were already using the software either in production or for closed beta projects.

The fist talk given was by people from ferret-go - a company doing media monitoring for brands focused on the German market. They are pretty new to the search topic, on top they aren't fluent Java developers but do most stuff in Python. Essentially their whole application is built on top of Elastic Search - most features are implemented as more or less complex search queries. In recent weeks they had to deal with typical problems related to growing data set sizes, nodes getting hot in particular when load balancing isn't configured quite right and balancing shards on a per index level instead of doing it globally for all shards (particularly bad in their case as they added an index with smaller shards and one with significantly larger shards into ES).

The second talk gave a really nice overview on things to keep in mind before putting ES to production - as is usually the case, the default configuration makes it easy to get started but most likely is not what you want in your production environment.

Really nice, technically focused event. Thanks to Retresco for hosting the meetup including beer and Club Mate and to ElasticSearch.com for paying for the pizza. Check their meetup page for the next event - most likely to be scheduled in January.

Also if you'd like to learn more on Lucene 4 (talk by Simon Willnauer) - make sure to attend the Apache Hadoop Get Together December 2012 (if you need another ticket, it might make sense to politely ask the organiser, make sure you are registered, otherwise most likely you won't get through security). Other two talks scheduled: Pere Urbon Bayes on NoSQL and Graph, a love story! as well as myself on How to Fail Your Big Data Project Quick and Rapidly.

ApacheConEU - part 11 (last part)

2012-11-20 20:35
One of the last sessions covered logging frameworks for Java. Christian Grobmeier started by detailing the common requirements for all logging frameworks:


  • Speed - developers do not want to pay a disproportional penalty for using a logging framework.
  • Fail-safety and reliability - under no circumstances should your logging framework kill your application. In addition it would be most annoying to find that one log message that would help you de-cypher the problem your application ran into missing. As obvious as those requirements sound - there are counter examples to both: There is a memory leak in log4j1, when reconfiguring logback on the fly it may well lose messages.
  • Log frameworks should be backwards compatible: Both, changing the API as well as incompatible configuration file formats aren’t particularly great when wanting to upgrade the logging framework version you use.
  • On top of all the way you do logging ultimately is a matter of taste - by now there are several implementations even just for Java online catering needs ranging from pure simplicity to huge flexibility: log4j, logback, java.util.logging, AVSL, tinyLog. On top of that there are aggregators like SLF4j and commons logging - even though the speaker himself does contribute to the latter framework, at the current time he still recommends using SLF4J as it is actively being developed, more modern and better supported.


Biggest news shared in the talk was the release of log4j2 which comes with commons-logging and slf4j integration. This version finally gets rid of the if (debug.enabled()) { log.debug(...)} idiom by introducing place holders and variable length argument lists essentially enabling format strings for logging. Markers can be added to the logging code to allow for later filtering. Writing plugins has been made a whole lot simpler. There is support for an easier to read xml configuration format as well as json configurations. In additions configurations can be set to be reloaded on change in a pre-defined interval. It is slightly slower than logback and log4j on average, though we are still talking about a few milliseconds for a large amount of log messages. Unfortunately those averages did not come with error bars which would have made interpreting them in comparison a bit easier.

However log4j is not just about logging. It does have sub projects for viewing logs of httpd, log4j and others (called chainsaw), logging for php and .net.

Log4j has a rather complex history: As soon as the leader of the project left, activity died away. By now activity has taken up again quite a bit with 4 contributors. They created 6 releases in the last year alone, on top of 600 mailing list messages and a huge amount of commits. Still also log4j is hiring - if you want to work for free on a fun project that affects nearly every Java developer world wide, work together with awesome coders this project is seeking new contributors.

If you consider logging boring just be reminded that logs are a valuable source of user activity leading to features like being able to recommend new products to customers, localise content offerings or even just adjusting the default settings of your web page to increase click through. Related to log4j there’s Apache Flume dealing with distributed logging, there’s challenges in the cloud and mobile space, Apache Mayhem for Logger ingestions.

The last technical session dealt with Apache Buildr - a build system for Java, though not only Java, written in Ruby. The advantage being that it delivers the artifact resolution and download from maven archives through ivy plugins, provides greater flexibility through ruby integration and can fall back to ant tasks if needed.

The final session was the closing plenary given by Nick Burch. Most noteably he invited attendees for ApacheCon2013 Europe. Looking forward to meeting all of you there. The community edition of ApacheCon was an awesome setup for people to meet and not only pitch their projects but to also provide deep technical detail and show off more of what the Apache community is all about. Looking forward to the audio recordings as well as to the videos taken during the conference. CU all next year!

ApacheConEU - part 10

2012-11-19 20:34
In the next session Jukka introduced Tika - a toolkit for parsing content from files including a heuristics based component for guessing the file type: Based on file extension, magic and certain patterns in the file the file type can be guessed rather reliably. Some anecdotes:

  • not all mime types are registered with IANA, there are of course conflicting file extensions,
  • Microsoft Word not only localises their interface but also the magic in the file,
  • html detection is particularly hard as there is quite some overlap with other file formats (e.g. there are such things as html mails...)
  • xhtml parsing is quite reliable by using an actual xml parser for the first few bytes to identify the namespace of the document
  • identifying odf documents is easy - though zipped the magic is preserved uncompressed at a pre-defined location in the file
  • for ooxml the file has to be unpacked to identify it
  • plain text content is hardest - there are a few heuristics based on UTF BOMs, ASCII stats, line ending stats, byte histograms, still it's not fool proof.


In addition Tika can extract metadata: For text that can be as easy as encoding, length, content type and comments. For images that is extended by image size and potentially EXIF data. For pdf data it gets even more comprehensive including information on the file creator, save date and more (same applies for MS office documents). Most document metadata is based on the doublin core standard, for images there’s EXIF and IPCT - soon there’ll also be xmb related data that can be discovered.

ApacheCon EU - part 09

2012-11-18 20:54
In the Solr track Elastic Search and Solr Cloud went into competition. The comparison itself was slightly apples-and-oranges like as the speaker compared the current ES version based on Lucene 3.x and Solr Cloud based on Lucene 4.0. During the comparison it still turned out that both solutions are more or less comparable - so choice again depends on your application. However I did like the conclusion: The speaker did not pick a clear winner in terms of projects. However he did have another clear winner: The user community will benefit from there being two projects as this kind of felt competition did speed up development already considerably.

The day finished with hoss' Stump the Chump session: The audience was asked to submit questions before the session, the jury was than asked to pick the winning question that stumped Hoss the most.

Some interesting bits from that question: One guy had the problem of having to provide somewhat diverse results in terms e.g. manufacturers in his online shop. There are a few tricks to deal with this problem: a) clean your data - don't have items that use keyword spamming side by side with regular entries. Assuming this is done you could b) use grouping to collapse items from the same manufacturer and let the user drill deeper from there. Also using c) a secondary sorting value can help - one hint: Solr ships with a random value out of the box for such cases.

For me the last day started with hossman's session on boosting and scoring tricks with Solr - including a cute reference for explaining TF-IDF ranking to people (see also a message tweeted earlier for an explanation of what a picture taken during my wedding has to do with ranking documents):


Share photos on twitter with Twitpic


Though TF-IDF is standard IR scoring it probably is not enough for your application. There's a lot of domain knowledge that you can encode in your ranking:

  • novelty factors - number of ratings/ standard deviation of ratings - ranks controversial items on top that might be more interesting than just having the stuff that everyone loves
  • scarcity - people like buying what is nearly sold out
  • profit margin
  • create your score manually by an external factor - e.g. popularity by association like categories that are more popular than others or items that are more popular depending on the time of day or year


There are a few sledge hammers people usually think of that can turn against you really badly: Say you rank by novelty only - that is you sort by date. The counter example given was the case of the AOL-Time Warner merger - being a big story news papers would post essays on it, do evaluations etc. However also articles only remotely related to it would mention the case. So be the end of the week when searching for it you would find all those little only remotely relevant articles and have to search through all of them to find the really big and important essay.

There are cases where it seems like only recency is all that matters: Filter only for the most recent items and re-try only in case of no results. The counter example is the case where products are released just on a yearly basis but you set the filter to say a month. This way up until May 31 your users will run into the retry branch and get a whole lot of results. However when a new product comes out on June first from that day onward the old results won't be reachable anymore - leading to a very weird experience for those of your users who saw yesterday's results.

There were times when scoring was influenced by keyword stuffing to emulate higher scores - don't do that anymore, Solr does support sophisticated per field and document boosting that make such hacks superfluous.

Instead rather use edismax for weighting fields. Some hints on that one: configure omitNorms in order to avoid having keyword stuffing influence your ranking. Configure omitTermFrequencyAndPosition if the term frequency in any document does not really tell you much e.g. in case of small documents only.

With current versions of Solr you can use your custom scoring per field. In addition a few ones are shipped that come with options for tweaking - like for instance the sweetSpotSimilarity wher you can tell the scorer that up to a certain length no length penalisation should happen.

Create your own boost functions that in addition to TF-IDF rely on rating values, click rates, prices or category influences. There's even an external file field option to allow you to load your scoring value per document or category from an external file that can be updated on a much more frequent basis than you would otherwise want to re-index all documents in your solr. For those suffering from the "For business reasons this document must come first no matter what the algorithm says" syndrom - there's a query elevation component for very fine grained tuning of rankings per query. Keep in mind so that this can easily turn into a maintanance nightmare. However it can be handy when quickly fixing a highly valuable business based use case: With that component it is possible to explicitly exclude documents from matching and precisely setting where to rank individual documents.

When it comes to user analytics and personalisation many people think of highly sophisticated algorithms that need lots of data to be trained. Yes Mahout can help you with personalisation and recommendation - but there are a few low hanging fruits to grab before:

  • Use the history of registered users or those you can identify through cookies - track the keywords they are looking for, the sort and filter functions commonly used.
  • Bucket people by explicit or implicit demographics.
  • Even just grouping people by the os and browser they use can help to identify preferences.


All of this information is relatively cheap to get by and can be used in many creative ways:

  • Provide default sort and filter functions for returning users.
  • Filter on the current query but when scoring take the older query into account.
  • Based on category facet used before do boosting in the next search assuming that the two queries are related.


Essentially the goal is to identify three factors for your users: What is their preference, what is the differentiator and what is your confidence in your estimation.

Another option could be to use the SweetSpotPlateau: If someone clicked on a price range facet on the next related query do not hide other prices but boost those that are in the previous facet.

One side effect to keep in mind: Your cache hit rate will go down now you are tailoring your results to individual users or user groups.

Biggest news for Lucene and Solr was the release of Lucene 4 - find more details online in an article published recently.

ApacheConEU - part 08

2012-11-17 20:53
Jan Lehnardt's talk covered the history of CouchDB - including lessons learnt along the way. The first issue he went into: Shipping 1.0 is hard! They spent a lot of effort and time in order to have a stable database that won't loose your data - only to have a poorly patch slip in for 1.0 that resulted in data loss. The fury of action happening afterwards was truely amazing - people working on rolling shifts all over the planet to not only fix the issue but also provide recovery tooling for those affected by the bug. The lessons learnt form that are as obvious as they are often neglected: Both test coverage as well as code review are crucial for any software project.

The second topic Jan went into was the disctraction and tension that comes from having a company built around your favourite open source project. When going down this road keep in mind that the whole VC setup usually is very time consuming - the world starts revolving around the need to either gather more VC funding or make up a successful business case to support your company. All of this results in less time spent coding, friction around the fact that the corporate interests may not always be what is best for your open source project. In CouchDB the result was the explosion of the project founder who eventually left the project. This hit CouchDB particularly badly as the project essentially was built around the idea of the one brilliant coder, relied on his information channels for marketing. The lesson learnt was that having communications centralised that way can easily turn against you - don't trust your benevolent dictator.

Usually it is quite ok for users to move on - in particular if the project does no longer fit their needs. However having multiple key people leave at the same time can be detrimental, in particular if they are the vocal ones. In terms of lessons learnt: Embrace the fact that people will fail your software. Use the resulting knowledge about your application boundaries - or fix what failed them.

In terms of general advise: The world moved on after any of these cases. What does help is to ship what users need instead of running after the next big hype. Also good ideas will stick - using json as format and js for query formulation did make it into many other applications with the former also making it into the next SQL standard to be released in 2015. The goal should be to build stuff that is easy (and fun) to use.

In the mean time CouchDB grew up. Not only does it have another release and a new web site. It has turned into a project that is no longer a thing pushed forward by a single person but that moves on its own. The secret behind that development is to acknowledge that having just few people in the leading position will burn them out - make sure to enable others and that your strong leaders to get to lead. Oh and as any Apache project also CouchDB is happy about any new contributor joining the project.

When it comes to communication the Apache incubation process made sure to burn the "everything happens on the mailing list" mantra into their mind. Still IRC was a valuable way of communication for non-decision stuff like user support and community building. IRC is fun - in particular when you can train irc bots based on earlier communication to automatically answer incoming user questions.

Another option CouchDB used to fix the community issues was to meet with people face-2-face - for three days in Boston, later in Dublin, later in Vienna. In addition they added a roadmap for the next 2 to 3 years including points like:

  • faster releases - they switched to time based instead of feature based releases except for security patches
  • they are the first to use git@apache to make branching and merging easier
  • they are github lovers with pull requests ending up on their dev list
  • they enabled a Erlang beginners question list in order to be able to recruit new contributors in a world of lacking Erlang developers. A very specific result of that was that people are much more comfortable even asking simple question - and on a more practical note one question for the birds eye view of couchdb resulted in Jan spending an hour and a half drawing up that particular picture: Spending an hour on docs to get to really new people is time well spend.


In terms of PMC chair lessons learnt: The goal should be to get the right people to care about the right thing. Having people finish stuff helps - and is infectious.

In the end as an open source project your biggest asset is your community. Motivating more people to join is key. If for your target audience JIRA is one step too much talk to infra to figure out how to make things better (and help them with the solutions).

What is fascinating about CouchDB is the whole ecosystem around the project. CouchDB is not just a database project hosted at Apache. It comes with a really well working replication API. There are implementations in js running in Browsers, there's BigCouch (dynamo in Erlang on top of CouchDB), there is an iOS app, there is PouchDB (the couch for your pocket), TouchDB (iOS and android implementations on top of sqlLight). The fun part to watch is that the idea is bigger than the project at Apache. The bigger the ecosystem the better for the community - there's no need to fold everything into the original project.

And of course also CouchDB is hiring.

ApacheConEU - part 07

2012-11-16 20:51
Julien Nioche shared some details on the nutch crawler. Being the mother of all Hadoop projects (as in Hadoop was born out of developments inside of nutch) the project has become rather quite with a steady stream of development in the recent past. Julien himself uses the nutch for gathering crawled data for several customer projects - feeding this data into an NLP pipeline based on Behemoth that glues Mahout, UIMA and Gate together.

The basic crawling steps including building the web graph, computing a link based ranking method and indexing are still the same since last I looked at the project - just that for indexing the project now uses solr instead of their own lucene based solution.

The main advantage of nutch is its pluggability: the protocol parser, html filter, url filter, url normaliser all can be exchanged against your own implementations.

In their 2.0 version they moved away from using their own plain hdfs storage to a table schema - mapped to the real database through Gora, an abstraction layer to connect to e.g. Cassandra or HBase. The schema itself is based on Avro but can be adopted to your needs. The advantages are obvious: Though still distributed this approach is much easier and simpler in terms of logic kept in nutch itself. Also it is easier to connect to the data for third parties - all you need is the schema as well as Gora. The current disadvantage lies in it's configuration overhead and instability compared to the old solution. Most likely at least the latter one will go away as version 2.0 stabelises.

In terms of future work the project focuses on stabilisation, synchronising features of version 1.x and 2.x (link ranking is only available in version 1.x while support for elastic search is only available in version 2.x). In terms of functionality the goal is to move to Solr Cloud, support sitemaps (as implemented by commons crawler), more (pluggable?) indexers.

The goal is to delegate implementations - it was already done for Tika and Solr. Most likely it will also happen for the fetcher, protocol handling, robots.txt handling, url normalisation and filtering, graph processing code and others.


The next talk in the Solr/Lucene talk dealt with scaling Solr to big data. The goal of the speaker was to index 100 million documents - the number of documents was expected to grow in the future. Being under heavy time pressure and having a bash wizard on the project they started building lots of their glue code and components in bash scripts: There were scripts for starting/stopping services, remote indexing, performance monitoring, content extraction, ingestion and deployment. Short term this was a very good idea - it allowed for fast iterations and learning. On the long run they slowly replaced their custom software with standard components (tika for content extraction, puppet for deployment etc.).

They quickly learnt to value property files in order to easily reconfigure the system even in production (relying heavily on bash xml was of course not an option). One problem this came in handy with was adjusting the sharding configuration - going from a simple random sharding to old vs new to monthly they could optimise the configuration to their load. What worked well for them was to separate JVM startup from Solr core startup - they would start with empty solrs and only point them to the data directories aafter verifying that the JVMs booted up correctly.

In terms of performance they learnt to go wide quickly - instead of spending hours on optimising their one huge box they ended up distributing their solrs to multiple separate machines.

In terms of ingestion pipelines: Theirs is currently based on an indexing directory convention, moving the indexing as soon as all data is ingested. The advantage here is the atomicity of mv that can be used - disadvantage is having to move around lots of data. Their goal is to go for hdfs for indexing soon and zookeeper for configuration handling.

In terms of testing: In contrast to having a dev-test-production environment their goal is to have an elastic compute cloud that can be adjusted accordingly. Though EC2 itstelf is very cost intensive, poses additional problems with firewalling and moving data their cloud computing could still be a solution - in particular given projects like cloud stack or open cloud. The goal would be to do cycle scavaging in their own datacenter, do heavy computations when there is not a lot of load on the system and turn those analysis of in case of incoming traffic.

When it comes to testing and monitoring changes they made good experiences with using JConsole (connecting it to several solrs at once through a simple ip discovery script) and solr meter as a performance debugging tool.

Some implementation details: They used Solr as some sort of NoSQL cache as well (for thousands of queries/s), push the schema definition to solr from the app, have common fields and the option for developers to add custom fields in the app. Their experience is to not do expensive stuff in solr but to move that outside - this applies in particular to content extraction. For storage they used an avro based binary file format (mainly in order to save storage space, have a versioned schema and fast compression and de-compression). They are starting to use tika as their pipeline and for auto content detection, scaling up with behemoth. They learnt to upgrade indexes without re-indexing using the lucene upgrade tooling. In addition they use Grimreaper to kill servers if anything goes wrong and restart it later.

ApacheConEU - part 06

2012-11-15 20:48
For the next session I joined the Tomcat crowd in Marc Thomas' to learn more on Tomcat reverse proxy configurations. One rather common setup is to have Tomcat connected to an httpd instance. One common issue encountered with this setup in particular when running httpd with the event mpm is the problem of thread exhaustion on tomcat's side. Fixes include always having more active tomcat threads than there can be httpd threads at any one time and to disable persistent connections. Keep in mind that tomcat performance does not degrade gracefully here - in case of thread exhaustion it just goes downhill very quickly. One famous example here was an issue with the ASF jira: After weeks of debugging bad performance, after blaming the hardware, the os, the JVM, the java implementation in generally finally the number of threads was increased resulting in a smoothly running system...

Another common configuration problem is to rename to deployed web application war - for instance in order to keep the version number in the war name itself - and change the path on httpd's side. This is bad for at least for reasons:

  • redirects will fail - you can configure ProxyReversePath which will fix some issues but does not affect all http headers
  • cookie paths break - you can configure CookiePathReverse here
  • links that are generated in the web app will fail - you can use mod_sed/ _substitute/ _proxy_html to fix that - however those configurations tend to become messy and are error prone
  • custom headers usually also break


If the only reason for doing such a thing is to keep the version number in the file name it might be an option to use "foo##bar.1.2.3" as filename - tomcat will ignore anything after the hashtags.

When dealing with proxying traffic make sure to inform tomcat about https termination events in order to correctly handle secure cookies and sessions. This is done automatically with mode_jk and mod_ajp, mod_proxy needs some more manual work. When dealing with virtual hosting make sure to use ProxyPreserveHeader in order to be able to switch hosts on Tomcat's side.


Julien Nioche shared some details on the nutch crawler. Being the mother of all Hadoop projects (as in Hadoop was born out of developments inside of nutch) the project has become rather quite with a steady stream of development in the recent past. Julien himself uses the nutch for gathering crawled data for several customer projects - feeding this data into an NLP pipeline based on Behemoth that glues Mahout, UIMA and Gate together.

The basic crawling steps including building the web graph, computing a link based ranking method and indexing are still the same since last I looked at the project - just that for indexing the project now uses solr instead of their own lucene based solution.

The main advantage of nutch is its pluggability: the protocol parser, html filter, url filter, url normaliser all can be exchanged against your own implementations.

In their 2.0 version they moved away from using their own plain hdfs storage to a table schema - mapped to the real database through Gora, an abstraction layer to connect to e.g. Cassandra or HBase. The schema itself is based on Avro but can be adopted to your needs. The advantages are obvious: Though still distributed this approach is much easier and simpler in terms of logic kept in nutch itself. Also it is easier to connect to the data for third parties - all you need is the schema as well as Gora. The current disadvantage lies in it's configuration overhead and instability compared to the old solution. Most likely at least the latter one will go away as version 2.0 stableises.

In terms of future work the project focuses on stabilisation, synchronising features of version 1.x and 2.x (link ranking is only available in version 1.x while support for elastic search is only available in version 2.x). In terms of functionality the goal is to move to Solr Cloud, support sitemaps (as implemented by commons crawler), more (pluggable?) indexers.

The goal is to delegate implementations - it was already done for Tika and Solr. Most likely it will also happen for the fetcher, protocol handling, robots.txt handling, url normalisation and filtering, graph processing code and others.


The next talk in the Solr/Lucene talk dealt with scaling Solr to big data. The goal of the speaker was to index 100 million documents - the number of documents was expected to grow in the future. Being under heavy time pressure and having a bash wizard on the project they started building lots of their glue code and components in bash scripts: There were scripts for starting/stopping services, remote indexing, performance monitoring, content extraction, ingestion and deployment. Short term this was a very good idea - it allowed for fast iterations and learning. On the long run they slowly replaced their custom software with standard components (tika for content extraction, puppet for deployment etc.).

They quickly learnt to value property files in order to easily reconfigure the system even in production (relying heavily on bash xml was of course not an option). One problem this came in handy with was adjusting the sharding configuration - going from a simple random sharding to old vs new to monthly they could optimise the configuration to their load. What worked well for them was to separate JVM startup from Solr core startup - they would start with empty solrs

ApacheConEU - part 05

2012-11-14 20:47
The afternoon featured several talks on HBase - both it's implementation as well as schema optimisation. One major issue in schema design in the choice of key. Simplest recommendation is to make sure that keys are designed such that on reading data load will be evenly distributed accross all nodes to prevent region-server hot-spotting. General advise here are hashing or reversing urls.

When it comes to running your own HBase cluster make sure you know what is going on in the cluster at any point in time:

  • Hbase comes with tools for checking and fixing tables,
  • tools for inspecting hfiles that HBase stores data in,
  • commands for inspecting the binary write ahead log,
  • web interfaces for master and region servers,
  • offline meta data repair tooling.


When it comes to system monitoring make sure to track cluster behaviour over time e.g. by using Ganglia or OpenTSDB and configure your alerts accordingly.

One tip for high traffic sites - it might make sense to disable automatic splitting to avoid splits during peaks and rather postpone them to low traffic times. One rather new project presented to monitor region sizes was Hannibal.

At the end of his talk the speaker went into some more detail on problems encountered when rolling out HBase and lessons learnt:

  • the topic itself was new so both engineering and ops were learning.
  • at scale nothing that was tested on small scale works as advertised.
  • hardware issues will occur, tuning the configuration to your workload is crucial.
  • they used early versions - inevitably leading to stability issues.
  • it's crucial to know that something bad is building up before all systems start catching fire - monitoring and alerting the right thing is important. With Hadoop there are multiple levels to monitor: the hardware, os, jvm, Hadoop itself, HBase on top. It's important to correlate the metrics.
  • key- and schema design are key.
  • monitoring and knowledgable operations are important.
  • there are no emergency actions - in case of an emergency it just will take time to recover: Even if there is a backup, even just transferring the data back can take hours and days.
  • HBase (and Hadoop) is DevOps technology.
  • there is a huge learning curve to get to a state that makes using these systems easy.



In his talk on HBase schema design Lars George started out with an introduction to the HBase architecture. On the logical level it's best to think of HBase as a large distributed hash table - all data except for names are stored as byte arrays (with helpers to transform that data back into well known data types provided). The tables themselves are stored in a sparse format which means that null values essentially come for free.

On a more technical level the project uses zookeeper for master election, split tracking and state tracking. The architecture itself is based on log structured merge trees. All data initially ends up in the write ahead log - with data always being appended to the end this log can be written and ready very efficiently without any seek penalty. The data is inserted at the respected region server in memory (mem store, size of 64 MB typically) and synched to disk in regular intervals. In HBase all files are immutable - modifications are done only by writing new data and merging it in later. Deletes also happen by marking data as deleted instead of really deleting it. On a minor compaction the recently few files are being merged. On a major compaction all files are merged and deletes are being handled. Handling your own major compaction is possible as well.

In terms of performance lookup by key is the best you can do. If you do lookup by value this will result in a full-table scan. There is an option to give HBase a hint as to where to find the key when it is updated only infrequently - there is an option to provide a timestamp of roughly where to look for it. Also there are options to use Bloomfilters for better read performance. Another option is to move more data into the row key itself if that is the data you will be searching for very often. Make sure to de-normalize your data as HBase does not do any sophisticated joins, there will be duplication in your data as all should be pre-joined to get better performance. Have intelligent keys that make match your read/write patterns. Also make sure to keep your keys reasonably short as they are being repeated for each entry - so moving the whole data into the key isn't going to get you anything.

Speaking of read write patterns - as a general rule of thumb: to optimise for better write performance tune the memstore size. For better read performance tune the block cache size. One final hint: Anything below 8 servers really is just a test setup as it doesn't get you any real advantages.