Systemd - FOSDEM 06

2013-02-18 20:45
As sort of a “go out of your comfort zone and discover new stuff” exercise I went to the systemd – two years later talk next. It's just plain amazing to see a machine boot in roughly one second (that is not counting the 7s that the BIOS needs for initialization). The whole project started as a init-only project but has since grown to a much larger purpose: An init platform ranging from mobile, embedded, desktop devices to servers many features were just over-due across the board.

Essentially the event-based system brings together what was split and duplicated before in things like console-kit, sysVinit, initscripts, inetd, pm-utils, acpid, syslog, watchdog services, cgrulesd, cron and atd. It brings support for event based container spawning, suspending and shutdown which brings whole new opportunities for optimisations. In addition for the first time in the history of Linux there is the possibility of grouped resource management: Instead of having nice levels bound to processes you now can group services to cgroups and give them guaranteed resources (which makes resource management of e.g multiple Apache processes plus some MySQL instances all running on the same machine so much easier).

(Post kindly proof-read and corrected by Thilo Fromm)

Notes on storage options - FOSDEM 05

2013-02-17 20:43

Second day at FOSDEM for me started with the MySQL dev room. One thing that made me smile was in the MySQL new features talk: The speaker announced support for “NoSQL interfaces” to MySQL. That is kind of fun in two dimensions: A) What he really means is support for the memcached interface. Given the vast number of different interfaces to databases today, announcing anything as “supports NoSQL interfaces” sounds kind of silly. B) Given the fact that many databases refrain from supporting SQL not because they think their interface is inferior to SQL but because they sacrifice SQL compliance for better performance, Hadoop integration, scaling properties or others this seems really kind of turning the world upside-down.

As for new features – the new MySQL release improved the query optimiser, subquery support. When it comes to replication there were improvements along the lines of performance (multi threaded slaves etc.), data integrity (replication check sums being computed, propagated and checked), agility (support for time delayed replication), failover and recovery.

There were improvements along the lines of performance schemata, security, workbench features. The goal is to be the go-to-database for small and growing businesses on the web.

After that I joined the systemd in Debian talk. Looking forward to systemd support in my next Debian version.

HBase optimisation notes

Lars George's talk on HBase performance notes was pretty much packed – like any other of the NoSQL (and really also the community/marketing and legal dev room) talks.

Lars started by explaining that by default HBase is configured to reserve 40% of the JVM heap for in memory stores to speed up reading, 20% for the blockcache used for writing and leaves the rest as breath area.

On read HBase will first locate the correct region server and route the request accordingly – this information is cached on the client side for faster access. Prefetching on boot-up is possible to save a few milliseconds on first requests. In order to touch as little files as possible when fetching bloomfilters and time ranges are used. In addition the block cache is queried to avoid going to disk entirely. A hint: Leave as much space as possible for the OS file cache for faster access. When monitoring reads make sure to check the metrics exported by HBase e.g. by tracking them over time in Ganglia.

The cluster size will determine your write performance: HBase files are so-called log structured merge trees. Writes are first stored in memory and in the so-called Write-Ahead-Log (WAL, stored and as a result replicated on HDFS). This information is flushed to disk periodically either when there are too many log files around or the system gets under memory pressure. WAL without pending edits are being discarded.

HBase files are written in an append-only fashion. Regular compactions make sure that deleted records are being deleted.

In general the WAL file size is configured to be 64 to 128 MB. In addition only 32 log files are permitted before a flush is forced. This can be too small a file size or number of log files in periods of high write request numbers and is detrimental in particular as writes sync across all stores, so large cells in one family will cause a lot of writes.

Bypassing the WAL is possible though not recommended as it is the only source for durability there is. It may make sense on derived columns that can easily be re-created in a co-processor on crash.

Too small WAL sizes can lead to compaction storms happening on your cluster: Many small files than have to be merged sequentially into one large file. Keep in mind that flushes happen across column families even if just one family triggers.

Some handy numbers to have when computing write performance of your cluster and sizing HBase configuration for your use case: HDFS has an expected 35 to 50 MB/s throughput. Given different cell size this is how that number translates to HBase write performance:

Cell size OPS
0.5MB 70-100
100kB 250-500
10kB with 800 less than expected as this HBase is not optimised for these sizes
1kB 6000, see above

As a general rule of thumb: Have your memstore be driven by size number of regions and flush size. Have the number of allowed WAL logs before flush be driven by fill and flush rates.. The capacity of your cluster is driven by the JVM heap, region count and size, key distribution (check the talks on HBase schema design). There might be ways to get rid of the Java heap restriction through off-heap memory, however that is not yet implemented.

Keep enough and large enough WAL logs, do not oversubscribe the memstore space, keep the flush size in the right boundaries, check WAL usage on your cluster. Use Ganglia for cluster monitoring. Enable compression, tweak the compaction algorithm to peg background I/O, keep uneven families in separate tables, watch the metrics for blockcache and memstore.

AFERO GPL Panel discussion - FOSDEM 04

2013-02-16 20:41
The panel started with a bit of history of the AGPL: Born in the age of growing ASP (application service provider) businesses AGPL tried to fix the hosting loop whole in GPL in the early 2000s. More than ten years later it turns out the license hasn't quite caught traction: On the one hand the license does have a few wording issues. In addition it is still rather young and used by few so there is less trust compared to GPL or ASL to last when put on trial. However there's another reason for low adoption:

Those that are being targeted with the license – people developing web services – tend to prefer permissive licenses over copyleft ones (see Django, Rails for example). People are still in the postion of trying to gain strong positions when opening up their infrastructure. As a result there is a general preference for permissive licenses. Also there are many more people working on open source not as their hobby project but as their general day job. As a result the number of people backing projects that are infrastructure only, company driven and trying to establish de-facto standards through the availability of free software is growing.

Depressing for the founders of AGPL are businesses using the AGPL to try and trick corporations into using their software as open source and later go after them with additional clauses in their terms and conditions to enforce subscription based services.

Mozilla legal issues - FOSDEM 03

2013-02-15 22:39
In the next talk Gervase Markham talked about his experience working for Mozilla on legal and license questions. First the speaker summarized what kind of requests he gets most:

  • There are lots of technical support requests.
  • Next on the top list is the question for whether or not shipping Mozilla with a set of modifications is ok.
  • Next is an internal question, namely: Can I use this code?
  • Related to that is the “We have a release in two weeks, can we ship with this code?”
  • Another task is finding code that was used but is not ok.
  • Yet another one is getting code licensed or re-licensed.
  • Maintaining the about:license page is another task.
  • Dealing with ECCV/CCATS requests is another issue that comes up often.

However there are also bigger tasks: There was the goal of tri-licensing Mozilla. The only issue was the fact that they had accumulated enough individually copyrighted contributions to make that that task really tricky. In the end they wrote a tool to pull out all contributor names, send them mails asking for permission to tri-license. After little over three years they had responses from all but 30 contributors. As a result the “find this hacker” campaign was launched on /. and other news sites. In the end most could be found.

As another step towards easier licensing the MPL 2 replacing 1.1 was introduced – it fixes GPL/ASL license incompatibilities, notification and distribution requirements, the difference for initial developers and the use of conditional/Jacobson language.

There are still a few issues with source files lacking license headers (general advise that never has been tested in court is the concept of license bleeding: If there are files with and without license headers in one folder, most likely those w/o have the same license as those with. “aehem” ;)

There are lots of questions on license interpretation. This includes questions from people wanting to use Mozilla licensed software that wasn't even developed within the Mozilla foundation. Also there are lots of people who do not understand the concept of “free does not mean non-commercial use only”.

Sometimes there a license archeology task where people ask “hey, is that old code yours and is it under the Mozilla license?”

Another interesting case was a big, completely unknown blue company asking whether the hunspell module, having changed licenses so often (from BSD forked to GPL, changed to LGPL, to CC-Attr, to the tri license of Mozilla, including changed GPL stuff with the author's permission) really can be distributed by Mozilla under the MPL. After lots of digging through commit logs and change logs they could indeed verify that the code is completely clean.

Then there was the case of Firefox OS which was a fast development effort, involving copying lots of stuff from all over the internet just to get things running. A custom license scanner written to verify all bits and pieces was finally implemented and used to give clearance on release. It found dozens of distinct versions of the Mozilla and BSD licenses (mainly due to the fact that people are invited to add their own name to it when releaseing code). As a result there now is a discussion on OSI to discourage that behaviour to keep the number of individual license files to ship with the software down to a minimal number.

The speaker's general recommendation on releasing small software projects under a non-copyleft license was to use the CC-0 license, for larger stuff his recommendation was to go for the ASL due to its patent grant clauses. Even at Mozilla quite a few projects have switched over to Apache.
There also were a few license puzzlers:

  • OpenJDK asked for permission to use their root-store certificates. Unfortunately at the time of receiving them they had not been given any sort of contract under which they may use them. *ahem*
  • The case with search engine icons … really isn't … so much different.

There also tend to be some questions on the Firefox/Mozilla trademarks ranging from

  • “can I use your logo for purpose such'n'such”?
  • ”Do you have 'best viewed in...' button”? - Nope, as we generally appreciate developers writing web sites that comply with web standards instead of optimizing for one single browser only.
  • They did run into the subscription on download scam trap and could stop those sites due to trademark infringement.
  • Most of this falls under fair use – especially cases like Pearson asking for permission (with a two-page mail + pdf letter) to link to the mozilla web site...

In general when people ask for permission if they do not need to ask: Tell them so but give them permission anyway. This is in order to avoid an “always-ask-for-permission” culture, and really to keep the number of requests down to those that are really necessary. One thing that does need prior permission though is shipping Firefox with a bunch of plugins pre-installed as a complete package.

On Patents – Mozilla does not really have any and spends time (e.g. on OPUS) avoiding them. On a related note there sometimes even are IPO requests.

Trademarks and OSS - FOSDEM 02

2013-02-14 20:38
So the first talk I went to ended up being in the legal dev room on trademarks in open source projects. The speaker had a background mainly in US American trademark law and quite some background when it comes to open source licenses.

To start Pamela first showed a graphic detailing the various types of trademarks: In the pool of generic names there is a large group of trademarks that are in use but not registered. The amount of registered trademarks actually is rather small. The main goal of trademarks is to avoid confusing costumers. This is best seen when thinking about scammers trying to trick users into downloading users pre-build and packaged software from third party servers demanding a credit card number that is later charged based on a subscription service the user signed by clicking away the fine print on the download page. Canonical example seems to be e.g. the Firefox/Mozilla project that was effected by this kind of scam. But also other end-user software (think Libre/Open Office, Gimp) could well be targets. This kind of deceiving web pages usually can be taken down way faster with a cease and desist letter due to trademark infringement rather than due to the fraud they do.

So when selecting trademarks – what should a project look out for? One is the name should not be too generic as that would lead to a name that is not enforceable. It should not be too theme-y as the names that are themed usually are already taken. The time to research should be contrasted with the pain it will cost to rename the project in case of any difficulties.

There are few actual court decisions that relate to trademarks and OSS: In Germany it was decided that forking the ENIGMA project and putting it on set-op boxes but keeping the name was ok for as long as the core function would be kept and third party plugins would still work.

In the US there was a decision that keeping the name of re-furbished SparkPlugs is ok for as long as it is clearly marked what to expect when buying them (in this case re-furbished instead of newly made).

Another thing to keep in mind are trademarks are naked trademarks – those that were not enforced and have become too ubiquitous. In the US that would be the naked license trademarks, in Greece the recycling mark “Der Grüne Punkt” has become too ubiquitous to be treated as a trademark any more.

Trademark law already fails in multinational corporation setups with world wide subsidies. It gets even worse with world wide distributed open source projects. The question of who owns the mark, who is allowed to enforce it, who exercises control gets worse the more development is distributed. When new people take over trademarks there should be some clear paper transferral document to avoid confusion.

Trademarks only deal with avoiding usage confusion: Using the mark when talking about it is completely fine. Phrases like “I'm $mark compatible”, “I'm running on top of $mark” care completely ok. However make sure to use as little as possible – there is no right to also just use the logos, icons or design forms of the project you are talking about – unless you are talking about said logo of course.

So to conclude: respect referential use, you can't exercise full control but should avoid exercising too little control.

There is a missing consistent understanding of and behaviour towards trademarks in the open source community. Now is the time to shape the law according to what open source developers think they need.

FOSDEM 2013 - 01

2013-02-13 22:38
On Friday morning our train left for this year's FOSDEM. Though a bit longish I have a strong preference for going by train as this gives more time and opportunity for hacking (in my case trying out Elastic Search), reading (in my case the book “Team Geek”) and chatting with other FOSDEM visitors.

Monday morning was mostly busy with meeting people - at the FSFE, Debian, Apache Open Office booths, generally in the hallways. And with getting some coffee to the Beaglebone booth where my husband helped out . For really fun videos on the hardware they had there see:

if you want to get the hardware underneath talk to circuitco.

Unfortunately I didn't make it to the community and marketing room – too full during the talks that I wanted to see (as a general shout-out to people attending conferences: If you do not find a seat, move into the room instead of standing right next to the door, if you do have a seat and a free one just next to you, move to the seat next to you).

If you missed some of the talks you might want to try your luck with the FOSDEM video archive - it's really extensive featuring videos taken at previous editions as well and is a great resource to find talks of the most important tracks.

Elastic Search meetup Berlin – January 2013

2013-02-01 18:34
The first meetup this year I went to started with a large bag of good news for Elastic Search users. In the offices of Sys Eleven (thanks for hosting) the meetup started at 7p.m. last Tuesday. Simon Willnauer gave an overview of what to expect of the upcoming major release of Elastic Search:

For all 0.20.x version ES features a shard allocator version that is ignorant of which index shards belong to, machine properties, usage patterns. Especially ignoring index information can be detrimental and lead to having all shards of one index on one machine in the end leading to hot spots in your cluster. Today this is solved by lots of manual intervention or even using custom shard allocator implementations.

With the new release there will be an EvenShardCountAllocator that allows for balancing shards of indexes on machines – by default it will behave like the old allocator but can be configured to take weighted factors into account. The implementation will start with basic properties like “which index does this shard belong to” but the goal is to also make variables like remaining disk space available. To avoid constant re-allocation there is a threshold on the delta that has to be passed for re-allocation to kick in.

0.21 will be released when Lucene 4.1 is integrated. That will bring new codecs, concurrent flushing (to avoid the stop-the-world flush during indexing that is used in anything below Lucene 4 – hint: Give less memory to your JVM in order to cause more frequent flushes), there will be compressed sort fields, spellchecking and suggest built into the search request (though unigram only). There will be one similarity configurable per field – that means you can switch from TF-IDF to alternative built-in scoring models or even build your own.

Speaking of rolling your own: There is a new interface for FieldData (used for faceting, scoring and sorting) to allow for specialised data structures and implementations per field. Also the default implementation will be much more memory efficient for most scenarios be using UTF-8 instead of UTF-16 characters).

As for GeoSpatial: The code came to Lucene as a code dump that the contributor wasn't willing to support or maintain. It was replaced by an implementation that wasn't that much better. However the community is about to take up the mess and turn it into something better.

After the talk the session essentially changed to an “interactive mailing list” setup where people would ask questions live and get answers both from other users as well as the developers. Some example was the question for recommendability of pyes as a library. Most people had used it, many ran into issues when trying to run an upgrade with features being taken away or behaviour being changed without much notice. There are plans to release Perl, Ruby and Python clients. However also using JRuby, Groovy, Scala or Clojure to communicate with ES works well.

On the benefit of joining the cluster for requests: That safes one hop for routing, result merging, is an option to have a master w/o data and helps with indexing as the data doesn't go through an additional node.

As for plugins the next thing needed is an upgrade and versioning schema. Concerning plugin reloading without restarting the cluster there was not much ambition to get that into the project from the ES side of things – there is just too much hazzle when it comes to loading and unloading classes with references still hanging around to make that worthwhile.

Speaking of clients: When writing your own don't rely on the binary protocol. This is a private interface that can be subject to change at any time.

When dealing with AWS: The S3 gateway is not recommended to be used as it is way too slow (and as a result very expensive). Rather backup with replicas, keep the data around for backup or use rsync. When trying to backup across regions this is nothing that ES will help you with directly – rather send your data to both sites and index locally. One recommendation that came from the audience was to not try and use EBS as the IO optimised versions are just too expensive – it's much more cost effective to rely on ephermeral storage. Another thing to checkout is the support for ES being zone aware to avoid having all shards in one availability zone. Also the node discovery timeout should be increased to at least one minute to work in AWS. When it comes to hosted solutions like heroko you usually are too limited in what you can do with these offers compared to the low maintenance overhead of running your own cluster. Oh, and don't even think about index encryption if you want to have a fast index without spending hours and hours of development time on speeding your solution up with custom codecs and the like :)

Looking forward to the Elastic Search next meetup end of February – location still to be announced. It's always interesting to see such meetup groups grow (this time from roughly 15 in November to over 30 in January).

PS: A final shout-out to Hossman - that psychological trick you played on my at your boosting and biasing talk at Apache Con EU is slightly annoying: Everytime someone mentions TF-IDF in a talk (and that isn't too unlikely in any Lucene, Solr, Elastic Search talks) I panicingly double check whether there are funny pictures on the slide shown! ;)

Linux vs. Hadoop - some inspiration?

2013-01-16 20:22
This (even for my blog’s standards) long-ish blog post was inspired by a talk given late last year at Apache Con EU as well as from discussions around what constitutes “Apache Hadoop compatibility” and how to make extending Hadoop easier. The post is based on conversations with at least one guy close to the Linux kernel community and another developer working on Hadoop. Both were extremly helpful in answering my questions and sanity checking the post below. After all I’m neither an expert on Linux kernel development and design, nor am I an expert on the detailed design and implementation of features coming up in the next few Hadoop releases. Thanks for your input.

Posting this here as I thought the result of my trials to understand the exact design commonalities and differences better might be interesting for others as well. Disclaimer: This is by no means an attempt to influence current development, it just summarizes some recent thoughts and analysis. As a result I’m happy about comments pointing out additions or corrections - preferably as trackback or maybe on Google Plus as I had to turn of comments on this very blog for spamming reasons.

In his slides on “Insides Hadoop dev” during Apache Con EU:

Steve Loughran included a comparison that popped up rather often already in recent past but still made me think:

“Apache Hadoop is an OS for the datacenter”

It does make a very good point, even though being slightly misleading in my opinion:

  • There are lots of applications that need to run in a datacenter that do not imply having to use Hadoop at all - think mobile application backends, content management systems of publishers, encyclopedia hosting. Growing you may still run into the need for central log processing, scheduling and storing data.
  • Even if your application benefits from a Hadoop cluster you will need a zoo of other projects not necessarily related to the project to successfully run your cluster - think configuration management, monitoring, alerting. Actually many of these topics are on the radar of Hadoop developers - with an intend to avoid the NIH principle and rather integrate better with existing proven standard tools.

However if you do want to do large scale data analysis on largely unstructured data today you will most likely end up using Apache Hadoop.

When talking about operating systems in the free software world inevitably the topic will drift towards the Linux kernel. Being one the most successful free software projects out there from time to time it’s interesting and valuable to look at its history and present in terms of development process, architecture, stakeholders in the development cycle and the way conflicting interests are being dealt with.

Although interesting in many dimensions this blog post focuses just on two related aspects:

  • How to balance innovation for stability in critical parts of the system.
  • How to deal with modularity and API stability from an architectural point of view taking project-external (read: non-mainline) module contributions into account.

The post is not going to deal with just “everything map/reduce” but focus solely on software written specifically to work with Apache Hadoop. In particular Map/Reduce layers plugged on top of existing distributed file systems that ignore data locality guarantees as well as layers on top of existing relational database management systems that ignore easy distribution and fail over are intentionally being ignored.

Balancing innovation with stability

One pain point mentioned during Steve’s talk was the perceived need for a very stable and reliable HDFS that prevents changes and improvements from making it into Hadoop. The rational is very simple: Many customers have entrusted lots (as in not easy to re-create in any reasonable time frame) of critical (as in the service offered degrades substantially when no longer based on that data) data to Hadoop. Even when in a backup Hadoop going down for a file system failure would still be catastrophic as it would take ages to get all systems back to a working state - time that means loosing lots of customer interaction with the service provided.

When glancing over to Linux-land (or Windows, or MacOS really) the situation isn’t much different: Though both backup and recovery are much cheaper there, having to restore a user’s hard-disk just due to some weird programming mistake still is not acceptable. Where does innovation happen there? Well, if you want durability and stability all you do is to use one of the well proven file system implementations - everyone knows names like ext2, xfs and friends. A simple “man mount” will reveal many more. If on the contrary you need some more cutting edge features or want to implement a whole new idea of how a file system should work, you are free to implement your own module or contribute to those marked as EXPERIMENTAL.

If Hadoop really is the OS of the datacenter than maybe it’s time to think about ways that enable users to swap in their prefered file system implementation, maybe it’s time for developers to focus implementation of new features that could break existing deployed systems to separate modules. Maybe it’s time to announce an end-of-support-date for older implementations (unless there are users that not only need support but are willing to put time and implementation effort into maintaining these old versions that is.)

Dealing with modularity and API stability

With the vision of being able to completely replace whole sub-systems comes the question of how to guarantee some sort of interoperability. The market for Hadoop and surrounding projects is already split, it’s hard to grasp for outsiders and newcomers which components work with wich version of Hadoop. Is there a better way to do things?

Looking at the Linux kernel I see some parallels here: There’s components built on top of kernel system calls (tools like ls, mkdir etc. all rely on a fixed set of system calls being available). On the other hand there’s a wide variety of vendors offering kernel drivers for their hardware. Those come in three versions:

  • Some are distributed as part of the mainline kernel (e.g. those for Intel graphics cards).
  • Some are distributed separately but including all source code (e.g. ….)
  • Some are distributed as binary blog with some generic GPLed glue logic (e.g. those provided by NVIDIA for their graphics cards).

Essentially there are two kinds of programming interfaces: ABIs (Application Binary Interfaces) that are being developed against from user space applications like “ls” and friends. APIs (Application Programming Interfaces) that are being developed against by kernel modules like the one by NVIDIA.

Coming back to Hadoop I see some parallelism here: There are ABIs that are being used by user space applications like “hadoop fs -ls” or your average map/reduce application. There are also some sort of APIs that strictly only allow for communication between HDFS, Map/Reduce and applications on top.

The Java ecosystem has a history of having APIs defined and standardised through the JCP and implemented by multiple vendors afterwards. With Apache projects people coming from a plain Java world often wonder why there is no standard that defines the APIs of valuable projects like Lucene or even Hadoop. Even log4j, commons logging and build tooling follow the “defacto standardisation” approach where development defines the API as opposed to a standardisation committee.

Going one step back the natrual question to ask is why there is demand for standardisation. What are the benefits of having APIs standardised? Going through a lengthy standardisation process obviously can’t be the benefit.

Advantages that come to my mind:

  • When having multiple vendors involved that do not want to or cannot communicate otherwise a standardisation committee can provide a neutral ground for communication in particular for the engineers involved.
  • For users there is some higher level document they can refer to in order to compare solutions and see how painful it might be to migrate.

Having been to a DIN/ISO SQL meetup lately there’s also a few pitfalls that I can think of:

  • You really have to make sure that your standard isn’t going to be polluted with things that never get implemented just because someone thought a particular feature could be interesting.
  • Standardisation usually takes a long time (read: mutliple years) until something valuable that than can be adopted and implemented in the industry is created.

More concerns include but are not limited to the problem of testing the standard - when putting the standard into main focus instead of the implementation there is a risk of including features in the standard that are hard or even impossible to implement. There is the risk of running into competing organisations gaming the system, making deals with each other - all leading to compromises that are everything but technologically sensible. There clearly is a barrier to entry when standardisation happens in a professional standards body. (On a related note: At least the German group working on the DIN/ISO standard defining the standard query language in particular in big data environments. Let me know if you would like to get involved.)

Concerning the first advantage (having some neutral ground for vendors to meet): Looking at your average standardisation effort those committees may be neutral ground. However communication isn’t necessarily available to the public for whatever reasons. Compared to the situation little over a decade ago there’s also one major shift in how development is done on successful projects: Software is no longer developed in-house only. Many successful components that enable productivity are developed in the open in a collaborative way that is open to any participant. Httpd, Linux, PHP, Lucene, Hadoop, Perl, Python, Django, Debian and others are all developed by teams spanning continents, cultures and most importantly corporations. Those projects provide a neutral ground for developers to meet and discuss their idea of what an implementation should look like.

Pondering a bit more on where successful projects I know of came from reveals something particularly interesting: ODF first was implemented as part of Open Office and then turned into a standardised format. XMPP was first implemented and than turned into an IETF standardised protocol. Lucene never went for any storage format or even search API standardisation but defined very rigid backwards compatibility guidelines that users learnt to trust. Linux itself never went for ABI standardisation - instead they opted for very strict ABI backwards compat guidelines that developers of user space tools could rely on.

Looking at the Linux kernel in particular the rule is that user facing ABIs are supposed to be backwards compatible: You will always be able to run yesterday’s ls against a newer kernel. One advantage for me as a user is that this way I can easily upgrade the kernel in my system without having to worry about any of the installed user space software.

The picture looks rather different with Linux’ APIs: Those are intentionally not considered holy and subject to change if need be. As a result vendors providing proprietary kernel driver like NVIDIA have the burden of providing updated versions in case they want to support more than one kernel version.

I could imaging a world similar to that for Hadoop: A world in which clients run older versions of Hadoop but are still able to talk to their upgraded clusters. A world in which older MapReduce programs still run when deployed on newer clusters. The only people who would need to worry about API upgrades would be those providing plugins to Hadoop itself or replace components of the system. According to Steve this is what YARN promises: Turn MR into user layer code, have the lower level resource manager for requesting machines near the data.

ABC - die Katze lief im Schnee

2013-01-11 20:42
Seen this morning in Berlin:

A little impression from what the city looked like the weeks before it turned green on Christmas:

For winter images of other years see also previous posts. Title taken from a children's song:

On Taming Text

2013-01-01 20:21
This time of the year I would usually post pictures of my bicycle standing in the snow somewhere in Tierpark. This year however I was tricked into using public transport instead: a) After my husband found a new job, we now share some of the route to work - and he isn't crazy going by bike when it's snowing. b) I got myself a Nexus7 earlier this month which obsoleted having to take paper books with me when using public transport. c) Early in December Grant Ingersoll asked me for feedback on the by now nearly finished "Taming Text (currently available as MEAP at Manning). So I even had a really interesting book to read on my way home.

Up to mid-December "Taming Text" was one of those books that always were very high on my to-read list: At least from the TOC it looked like the book to read if ever you wanted to write a search application. So I was really curious which topics it would cover and how deep explanations would go when I got the offer to read and review the book.


Short version: If you are building search applications - that is anything that makes a search box available on a web site, be it an online store or a new article archive - this is the book to read. It covers all the gory details of how to implement features we have come to take for granted when using search: Type ahead, spelling correction, facetting, automatic tagging and more. The book motivates what the value of these features is from the user side, explains how to implement these features with proven technologies like Apache Lucene, OpenNLP, and Mahout and how those projects work internally to provide you with the functionality you need.

Longer summary

Search can be as easy as providing one box in some corner on your web site that users can type into to find relevant pages. However when thinking about the topic just a little more some more handy features that users have come to expect come to mind:

  • Type ahead to avoid superfluous typing - it also comes in handy to avoid spelling errors and to know exactly which query actually will return a decent number of documents.
  • Spelling correction is pretty much standard - and avoids user frustration with hard to spell query terms.
  • Facetting is a great way to discover and explore more content in particular when there are a few structured attributes attached to your items (prices to books, colors to cars etc).
  • Named Entity Recognition is well known among publishers who use automatic tagging services to support their staff.

The authors of Taming Text decided to structure the book around the task of building an automatic Question Answering system. Throughout the book they present technologies that need to be orchestrated to build such an application but are each valuable in it's own right.

In contrast to Search Patterns (which is focused mainly on the product manager perspective and contains much less technical detail) Taming Text is the book to read for any engineer working on search applications. In contrast to books like Programming Collective Ingelligence Taming Text takes you one level further by not only showing the tools to use but also explaining their inner workings so that you can adapt them exactly to your use case. To me, Taming Text is the ideal complimentary book to Mahout in Action (for the machine learning part) and Lucene in Action for the search part.

Back in 1998 it was estimated that 80% of all information is unstructured data. In order to make sense of that wealth of data we need technologies that can deal with unstructured data. Search is one of the most basic but also most powerful ways to analyse texts. With a good mixture of theoretical background and hands-on-examples Taming Text guides you through the process of building a successful search application, no matter if you are dealing with a vast product database that you want to make more accessible to your users, with an ever growing news archive or with several blog posts and twitter messages that you want to extract data from.