ApacheCon EU - part 09

2012-11-18 20:54
In the Solr track Elastic Search and Solr Cloud went into competition. The comparison itself was slightly apples-and-oranges like as the speaker compared the current ES version based on Lucene 3.x and Solr Cloud based on Lucene 4.0. During the comparison it still turned out that both solutions are more or less comparable - so choice again depends on your application. However I did like the conclusion: The speaker did not pick a clear winner in terms of projects. However he did have another clear winner: The user community will benefit from there being two projects as this kind of felt competition did speed up development already considerably.

The day finished with hoss' Stump the Chump session: The audience was asked to submit questions before the session, the jury was than asked to pick the winning question that stumped Hoss the most.

Some interesting bits from that question: One guy had the problem of having to provide somewhat diverse results in terms e.g. manufacturers in his online shop. There are a few tricks to deal with this problem: a) clean your data - don't have items that use keyword spamming side by side with regular entries. Assuming this is done you could b) use grouping to collapse items from the same manufacturer and let the user drill deeper from there. Also using c) a secondary sorting value can help - one hint: Solr ships with a random value out of the box for such cases.

For me the last day started with hossman's session on boosting and scoring tricks with Solr - including a cute reference for explaining TF-IDF ranking to people (see also a message tweeted earlier for an explanation of what a picture taken during my wedding has to do with ranking documents):

Share photos on twitter with Twitpic

Though TF-IDF is standard IR scoring it probably is not enough for your application. There's a lot of domain knowledge that you can encode in your ranking:

  • novelty factors - number of ratings/ standard deviation of ratings - ranks controversial items on top that might be more interesting than just having the stuff that everyone loves
  • scarcity - people like buying what is nearly sold out
  • profit margin
  • create your score manually by an external factor - e.g. popularity by association like categories that are more popular than others or items that are more popular depending on the time of day or year

There are a few sledge hammers people usually think of that can turn against you really badly: Say you rank by novelty only - that is you sort by date. The counter example given was the case of the AOL-Time Warner merger - being a big story news papers would post essays on it, do evaluations etc. However also articles only remotely related to it would mention the case. So be the end of the week when searching for it you would find all those little only remotely relevant articles and have to search through all of them to find the really big and important essay.

There are cases where it seems like only recency is all that matters: Filter only for the most recent items and re-try only in case of no results. The counter example is the case where products are released just on a yearly basis but you set the filter to say a month. This way up until May 31 your users will run into the retry branch and get a whole lot of results. However when a new product comes out on June first from that day onward the old results won't be reachable anymore - leading to a very weird experience for those of your users who saw yesterday's results.

There were times when scoring was influenced by keyword stuffing to emulate higher scores - don't do that anymore, Solr does support sophisticated per field and document boosting that make such hacks superfluous.

Instead rather use edismax for weighting fields. Some hints on that one: configure omitNorms in order to avoid having keyword stuffing influence your ranking. Configure omitTermFrequencyAndPosition if the term frequency in any document does not really tell you much e.g. in case of small documents only.

With current versions of Solr you can use your custom scoring per field. In addition a few ones are shipped that come with options for tweaking - like for instance the sweetSpotSimilarity wher you can tell the scorer that up to a certain length no length penalisation should happen.

Create your own boost functions that in addition to TF-IDF rely on rating values, click rates, prices or category influences. There's even an external file field option to allow you to load your scoring value per document or category from an external file that can be updated on a much more frequent basis than you would otherwise want to re-index all documents in your solr. For those suffering from the "For business reasons this document must come first no matter what the algorithm says" syndrom - there's a query elevation component for very fine grained tuning of rankings per query. Keep in mind so that this can easily turn into a maintanance nightmare. However it can be handy when quickly fixing a highly valuable business based use case: With that component it is possible to explicitly exclude documents from matching and precisely setting where to rank individual documents.

When it comes to user analytics and personalisation many people think of highly sophisticated algorithms that need lots of data to be trained. Yes Mahout can help you with personalisation and recommendation - but there are a few low hanging fruits to grab before:

  • Use the history of registered users or those you can identify through cookies - track the keywords they are looking for, the sort and filter functions commonly used.
  • Bucket people by explicit or implicit demographics.
  • Even just grouping people by the os and browser they use can help to identify preferences.

All of this information is relatively cheap to get by and can be used in many creative ways:

  • Provide default sort and filter functions for returning users.
  • Filter on the current query but when scoring take the older query into account.
  • Based on category facet used before do boosting in the next search assuming that the two queries are related.

Essentially the goal is to identify three factors for your users: What is their preference, what is the differentiator and what is your confidence in your estimation.

Another option could be to use the SweetSpotPlateau: If someone clicked on a price range facet on the next related query do not hide other prices but boost those that are in the previous facet.

One side effect to keep in mind: Your cache hit rate will go down now you are tailoring your results to individual users or user groups.

Biggest news for Lucene and Solr was the release of Lucene 4 - find more details online in an article published recently.

ApacheConEU - part 08

2012-11-17 20:53
Jan Lehnardt's talk covered the history of CouchDB - including lessons learnt along the way. The first issue he went into: Shipping 1.0 is hard! They spent a lot of effort and time in order to have a stable database that won't loose your data - only to have a poorly patch slip in for 1.0 that resulted in data loss. The fury of action happening afterwards was truely amazing - people working on rolling shifts all over the planet to not only fix the issue but also provide recovery tooling for those affected by the bug. The lessons learnt form that are as obvious as they are often neglected: Both test coverage as well as code review are crucial for any software project.

The second topic Jan went into was the disctraction and tension that comes from having a company built around your favourite open source project. When going down this road keep in mind that the whole VC setup usually is very time consuming - the world starts revolving around the need to either gather more VC funding or make up a successful business case to support your company. All of this results in less time spent coding, friction around the fact that the corporate interests may not always be what is best for your open source project. In CouchDB the result was the explosion of the project founder who eventually left the project. This hit CouchDB particularly badly as the project essentially was built around the idea of the one brilliant coder, relied on his information channels for marketing. The lesson learnt was that having communications centralised that way can easily turn against you - don't trust your benevolent dictator.

Usually it is quite ok for users to move on - in particular if the project does no longer fit their needs. However having multiple key people leave at the same time can be detrimental, in particular if they are the vocal ones. In terms of lessons learnt: Embrace the fact that people will fail your software. Use the resulting knowledge about your application boundaries - or fix what failed them.

In terms of general advise: The world moved on after any of these cases. What does help is to ship what users need instead of running after the next big hype. Also good ideas will stick - using json as format and js for query formulation did make it into many other applications with the former also making it into the next SQL standard to be released in 2015. The goal should be to build stuff that is easy (and fun) to use.

In the mean time CouchDB grew up. Not only does it have another release and a new web site. It has turned into a project that is no longer a thing pushed forward by a single person but that moves on its own. The secret behind that development is to acknowledge that having just few people in the leading position will burn them out - make sure to enable others and that your strong leaders to get to lead. Oh and as any Apache project also CouchDB is happy about any new contributor joining the project.

When it comes to communication the Apache incubation process made sure to burn the "everything happens on the mailing list" mantra into their mind. Still IRC was a valuable way of communication for non-decision stuff like user support and community building. IRC is fun - in particular when you can train irc bots based on earlier communication to automatically answer incoming user questions.

Another option CouchDB used to fix the community issues was to meet with people face-2-face - for three days in Boston, later in Dublin, later in Vienna. In addition they added a roadmap for the next 2 to 3 years including points like:

  • faster releases - they switched to time based instead of feature based releases except for security patches
  • they are the first to use git@apache to make branching and merging easier
  • they are github lovers with pull requests ending up on their dev list
  • they enabled a Erlang beginners question list in order to be able to recruit new contributors in a world of lacking Erlang developers. A very specific result of that was that people are much more comfortable even asking simple question - and on a more practical note one question for the birds eye view of couchdb resulted in Jan spending an hour and a half drawing up that particular picture: Spending an hour on docs to get to really new people is time well spend.

In terms of PMC chair lessons learnt: The goal should be to get the right people to care about the right thing. Having people finish stuff helps - and is infectious.

In the end as an open source project your biggest asset is your community. Motivating more people to join is key. If for your target audience JIRA is one step too much talk to infra to figure out how to make things better (and help them with the solutions).

What is fascinating about CouchDB is the whole ecosystem around the project. CouchDB is not just a database project hosted at Apache. It comes with a really well working replication API. There are implementations in js running in Browsers, there's BigCouch (dynamo in Erlang on top of CouchDB), there is an iOS app, there is PouchDB (the couch for your pocket), TouchDB (iOS and android implementations on top of sqlLight). The fun part to watch is that the idea is bigger than the project at Apache. The bigger the ecosystem the better for the community - there's no need to fold everything into the original project.

And of course also CouchDB is hiring.