ApacheConEU - part 07

2012-11-16 20:51
Julien Nioche shared some details on the nutch crawler. Being the mother of all Hadoop projects (as in Hadoop was born out of developments inside of nutch) the project has become rather quite with a steady stream of development in the recent past. Julien himself uses the nutch for gathering crawled data for several customer projects - feeding this data into an NLP pipeline based on Behemoth that glues Mahout, UIMA and Gate together.

The basic crawling steps including building the web graph, computing a link based ranking method and indexing are still the same since last I looked at the project - just that for indexing the project now uses solr instead of their own lucene based solution.

The main advantage of nutch is its pluggability: the protocol parser, html filter, url filter, url normaliser all can be exchanged against your own implementations.

In their 2.0 version they moved away from using their own plain hdfs storage to a table schema - mapped to the real database through Gora, an abstraction layer to connect to e.g. Cassandra or HBase. The schema itself is based on Avro but can be adopted to your needs. The advantages are obvious: Though still distributed this approach is much easier and simpler in terms of logic kept in nutch itself. Also it is easier to connect to the data for third parties - all you need is the schema as well as Gora. The current disadvantage lies in it's configuration overhead and instability compared to the old solution. Most likely at least the latter one will go away as version 2.0 stabelises.

In terms of future work the project focuses on stabilisation, synchronising features of version 1.x and 2.x (link ranking is only available in version 1.x while support for elastic search is only available in version 2.x). In terms of functionality the goal is to move to Solr Cloud, support sitemaps (as implemented by commons crawler), more (pluggable?) indexers.

The goal is to delegate implementations - it was already done for Tika and Solr. Most likely it will also happen for the fetcher, protocol handling, robots.txt handling, url normalisation and filtering, graph processing code and others.


The next talk in the Solr/Lucene talk dealt with scaling Solr to big data. The goal of the speaker was to index 100 million documents - the number of documents was expected to grow in the future. Being under heavy time pressure and having a bash wizard on the project they started building lots of their glue code and components in bash scripts: There were scripts for starting/stopping services, remote indexing, performance monitoring, content extraction, ingestion and deployment. Short term this was a very good idea - it allowed for fast iterations and learning. On the long run they slowly replaced their custom software with standard components (tika for content extraction, puppet for deployment etc.).

They quickly learnt to value property files in order to easily reconfigure the system even in production (relying heavily on bash xml was of course not an option). One problem this came in handy with was adjusting the sharding configuration - going from a simple random sharding to old vs new to monthly they could optimise the configuration to their load. What worked well for them was to separate JVM startup from Solr core startup - they would start with empty solrs and only point them to the data directories aafter verifying that the JVMs booted up correctly.

In terms of performance they learnt to go wide quickly - instead of spending hours on optimising their one huge box they ended up distributing their solrs to multiple separate machines.

In terms of ingestion pipelines: Theirs is currently based on an indexing directory convention, moving the indexing as soon as all data is ingested. The advantage here is the atomicity of mv that can be used - disadvantage is having to move around lots of data. Their goal is to go for hdfs for indexing soon and zookeeper for configuration handling.

In terms of testing: In contrast to having a dev-test-production environment their goal is to have an elastic compute cloud that can be adjusted accordingly. Though EC2 itstelf is very cost intensive, poses additional problems with firewalling and moving data their cloud computing could still be a solution - in particular given projects like cloud stack or open cloud. The goal would be to do cycle scavaging in their own datacenter, do heavy computations when there is not a lot of load on the system and turn those analysis of in case of incoming traffic.

When it comes to testing and monitoring changes they made good experiences with using JConsole (connecting it to several solrs at once through a simple ip discovery script) and solr meter as a performance debugging tool.

Some implementation details: They used Solr as some sort of NoSQL cache as well (for thousands of queries/s), push the schema definition to solr from the app, have common fields and the option for developers to add custom fields in the app. Their experience is to not do expensive stuff in solr but to move that outside - this applies in particular to content extraction. For storage they used an avro based binary file format (mainly in order to save storage space, have a versioned schema and fast compression and de-compression). They are starting to use tika as their pipeline and for auto content detection, scaling up with behemoth. They learnt to upgrade indexes without re-indexing using the lucene upgrade tooling. In addition they use Grimreaper to kill servers if anything goes wrong and restart it later.

ApacheConEU - part 06

2012-11-15 20:48
For the next session I joined the Tomcat crowd in Marc Thomas' to learn more on Tomcat reverse proxy configurations. One rather common setup is to have Tomcat connected to an httpd instance. One common issue encountered with this setup in particular when running httpd with the event mpm is the problem of thread exhaustion on tomcat's side. Fixes include always having more active tomcat threads than there can be httpd threads at any one time and to disable persistent connections. Keep in mind that tomcat performance does not degrade gracefully here - in case of thread exhaustion it just goes downhill very quickly. One famous example here was an issue with the ASF jira: After weeks of debugging bad performance, after blaming the hardware, the os, the JVM, the java implementation in generally finally the number of threads was increased resulting in a smoothly running system...

Another common configuration problem is to rename to deployed web application war - for instance in order to keep the version number in the war name itself - and change the path on httpd's side. This is bad for at least for reasons:

  • redirects will fail - you can configure ProxyReversePath which will fix some issues but does not affect all http headers
  • cookie paths break - you can configure CookiePathReverse here
  • links that are generated in the web app will fail - you can use mod_sed/ _substitute/ _proxy_html to fix that - however those configurations tend to become messy and are error prone
  • custom headers usually also break


If the only reason for doing such a thing is to keep the version number in the file name it might be an option to use "foo##bar.1.2.3" as filename - tomcat will ignore anything after the hashtags.

When dealing with proxying traffic make sure to inform tomcat about https termination events in order to correctly handle secure cookies and sessions. This is done automatically with mode_jk and mod_ajp, mod_proxy needs some more manual work. When dealing with virtual hosting make sure to use ProxyPreserveHeader in order to be able to switch hosts on Tomcat's side.


Julien Nioche shared some details on the nutch crawler. Being the mother of all Hadoop projects (as in Hadoop was born out of developments inside of nutch) the project has become rather quite with a steady stream of development in the recent past. Julien himself uses the nutch for gathering crawled data for several customer projects - feeding this data into an NLP pipeline based on Behemoth that glues Mahout, UIMA and Gate together.

The basic crawling steps including building the web graph, computing a link based ranking method and indexing are still the same since last I looked at the project - just that for indexing the project now uses solr instead of their own lucene based solution.

The main advantage of nutch is its pluggability: the protocol parser, html filter, url filter, url normaliser all can be exchanged against your own implementations.

In their 2.0 version they moved away from using their own plain hdfs storage to a table schema - mapped to the real database through Gora, an abstraction layer to connect to e.g. Cassandra or HBase. The schema itself is based on Avro but can be adopted to your needs. The advantages are obvious: Though still distributed this approach is much easier and simpler in terms of logic kept in nutch itself. Also it is easier to connect to the data for third parties - all you need is the schema as well as Gora. The current disadvantage lies in it's configuration overhead and instability compared to the old solution. Most likely at least the latter one will go away as version 2.0 stableises.

In terms of future work the project focuses on stabilisation, synchronising features of version 1.x and 2.x (link ranking is only available in version 1.x while support for elastic search is only available in version 2.x). In terms of functionality the goal is to move to Solr Cloud, support sitemaps (as implemented by commons crawler), more (pluggable?) indexers.

The goal is to delegate implementations - it was already done for Tika and Solr. Most likely it will also happen for the fetcher, protocol handling, robots.txt handling, url normalisation and filtering, graph processing code and others.


The next talk in the Solr/Lucene talk dealt with scaling Solr to big data. The goal of the speaker was to index 100 million documents - the number of documents was expected to grow in the future. Being under heavy time pressure and having a bash wizard on the project they started building lots of their glue code and components in bash scripts: There were scripts for starting/stopping services, remote indexing, performance monitoring, content extraction, ingestion and deployment. Short term this was a very good idea - it allowed for fast iterations and learning. On the long run they slowly replaced their custom software with standard components (tika for content extraction, puppet for deployment etc.).

They quickly learnt to value property files in order to easily reconfigure the system even in production (relying heavily on bash xml was of course not an option). One problem this came in handy with was adjusting the sharding configuration - going from a simple random sharding to old vs new to monthly they could optimise the configuration to their load. What worked well for them was to separate JVM startup from Solr core startup - they would start with empty solrs

ApacheConEU - part 05

2012-11-14 20:47
The afternoon featured several talks on HBase - both it's implementation as well as schema optimisation. One major issue in schema design in the choice of key. Simplest recommendation is to make sure that keys are designed such that on reading data load will be evenly distributed accross all nodes to prevent region-server hot-spotting. General advise here are hashing or reversing urls.

When it comes to running your own HBase cluster make sure you know what is going on in the cluster at any point in time:

  • Hbase comes with tools for checking and fixing tables,
  • tools for inspecting hfiles that HBase stores data in,
  • commands for inspecting the binary write ahead log,
  • web interfaces for master and region servers,
  • offline meta data repair tooling.


When it comes to system monitoring make sure to track cluster behaviour over time e.g. by using Ganglia or OpenTSDB and configure your alerts accordingly.

One tip for high traffic sites - it might make sense to disable automatic splitting to avoid splits during peaks and rather postpone them to low traffic times. One rather new project presented to monitor region sizes was Hannibal.

At the end of his talk the speaker went into some more detail on problems encountered when rolling out HBase and lessons learnt:

  • the topic itself was new so both engineering and ops were learning.
  • at scale nothing that was tested on small scale works as advertised.
  • hardware issues will occur, tuning the configuration to your workload is crucial.
  • they used early versions - inevitably leading to stability issues.
  • it's crucial to know that something bad is building up before all systems start catching fire - monitoring and alerting the right thing is important. With Hadoop there are multiple levels to monitor: the hardware, os, jvm, Hadoop itself, HBase on top. It's important to correlate the metrics.
  • key- and schema design are key.
  • monitoring and knowledgable operations are important.
  • there are no emergency actions - in case of an emergency it just will take time to recover: Even if there is a backup, even just transferring the data back can take hours and days.
  • HBase (and Hadoop) is DevOps technology.
  • there is a huge learning curve to get to a state that makes using these systems easy.



In his talk on HBase schema design Lars George started out with an introduction to the HBase architecture. On the logical level it's best to think of HBase as a large distributed hash table - all data except for names are stored as byte arrays (with helpers to transform that data back into well known data types provided). The tables themselves are stored in a sparse format which means that null values essentially come for free.

On a more technical level the project uses zookeeper for master election, split tracking and state tracking. The architecture itself is based on log structured merge trees. All data initially ends up in the write ahead log - with data always being appended to the end this log can be written and ready very efficiently without any seek penalty. The data is inserted at the respected region server in memory (mem store, size of 64 MB typically) and synched to disk in regular intervals. In HBase all files are immutable - modifications are done only by writing new data and merging it in later. Deletes also happen by marking data as deleted instead of really deleting it. On a minor compaction the recently few files are being merged. On a major compaction all files are merged and deletes are being handled. Handling your own major compaction is possible as well.

In terms of performance lookup by key is the best you can do. If you do lookup by value this will result in a full-table scan. There is an option to give HBase a hint as to where to find the key when it is updated only infrequently - there is an option to provide a timestamp of roughly where to look for it. Also there are options to use Bloomfilters for better read performance. Another option is to move more data into the row key itself if that is the data you will be searching for very often. Make sure to de-normalize your data as HBase does not do any sophisticated joins, there will be duplication in your data as all should be pre-joined to get better performance. Have intelligent keys that make match your read/write patterns. Also make sure to keep your keys reasonably short as they are being repeated for each entry - so moving the whole data into the key isn't going to get you anything.

Speaking of read write patterns - as a general rule of thumb: to optimise for better write performance tune the memstore size. For better read performance tune the block cache size. One final hint: Anything below 8 servers really is just a test setup as it doesn't get you any real advantages.

ApacheConEU - part 04

2012-11-13 20:46
The second talk I went to was the one on the dev@hadoop.a.o insights given by Steve Loughran. According to Steve Hadoop has turned into what he calls an operating system for the data center - similar to Linux in that it's development is not driven by a vendor but by its users: Even though Hortenworks, Cloudera and MapR each have full time people working on Hadoop (and related projects), this work usually is driven by customer requirements which ultimately means that someone is running a Hadoop cluster that he has trouble with and wants to have fixed. In that sense the community at large has benefitted a lot from the financial crisis that Y! has slipped into: Most of the Hadoop knowledge that is now spread across companies like Linked.In, Facebook and others comes from engineers leaving Y! and joining one of those companies. With that also the development cycle of Hadoop has changed: While it was mostly driven by Y! schedule in the beginning - crunching out new releases nearly on a monthly basis with a dip around Christmas time it's got more irregular later - to pick up a more regular schedule just recently.



Image taken a few talks earlier in the same session.


In terms of version numbers: 1.x is to be considered stable, production ready with fixes and low risk patches applied only. For 2.x the picture is a bit different - currently that is in alpha stage, features and fixes go there first, new code is developed there first. However Hadoop is not just http://hadoop.apache.org - the ecosystem is much larger including projects like Mahout, Hama, HBase and Accumulo built on top of Hadoop and Hive, Pig, Sqoop and Flume for integraion and analysis, as well as oozie for coordination an zookeeper for distributed configuration. There's even more in incubation (or just recently graduated): Kafka for logging, whirr for cloud deployment, giraph for graph processing, ambari for cluster management, s4 for distributed stream processing, hcatalog and templeton for schema management, chuckwa for loggin purposes. All of the latter ones love helping hands. If you want to help out and want to "play Apache" those are the place to go to.

With such a large ecosystem one of the major pain points when rolling out your own Hadoop cluster is integrating all components you need: All projects release on separate release schedules, usually documenting against which version of Hadoop they were built. However finding a working combination is not always trivial. The first place to go to that springs to mind are the standard Linux distributions - however for their release and support cycles (e.g. Debian's guarantees for stable releases) the pace inside of Hadoop and the speed with which old versions were declared "no longer supported" still is too fast. So what alternatives are there? You can go with Cloudera who ship Apache Hadoop extended with their patches and additional proprietary software. You can opt for Hortenworks that ships the ASF Hadoop only. Or you can opt for Apache itself and either tame the zoo yourself or rely on BigTop that aims to integrate the latest versions.

Even though there are people working fulltime on the project there's still way more work to do than hands to help out. Some issues do fall through the cracks, in particular if you are the only person affected by the issue. Ultimately this may mean that if the bug only affects you you will have to be the one to fix the issue. Before going out and hacking the source itself it may make sense to go out and search for the stack trace you are looking at, the error message in front of you - look in JIRA and on the mailing list archives to see if there's someone else who has solved the problem already - and in an ideal case even provided a patch that just didn't make it into the distribution so far.

Also contributing to Hadoop is a great way to get to work on CS hard problems: There's distributed computing involved (clearly), there's consensus implementations like Paxos, there's work to optimise Hadoop for specific CPU architectures, there's scheduling and data placement problems to solve, machine learning and graph algorithm implementations and more. If you are into these topics - come and join, those are highly needed skills. People on the list tend to be friendly - they are "just" overloaded.

Of course there are barriers to entry - just like the Linux kernel Hadoop has become business critical for many of its users. Steve's recommendation for dealing with this circumstance was to not compete with existing work that is being done - instead either concentrate on areas that aren't covered yet or even better yet collaborate with others. A single lonely developer just cannot reasonably compete.

In terms of commit process Hadoop is running a review-than-commit protocol. Also it makes things seemingly more secure it also means that the development process is a lot slower, that things get neglected, that there is a lot of frustration also on the committers' side when having to update a patch oftentimes. In addition tests running for a long time doesn't make contributing substancially easier. With lots of valuable and critical data being stored in HDFS makeing radical changes there also isn't something that happens easily. The best way to really get started is to get trusted and have a track record: The maintanance cost for abandoned patches that are large, hard to grasp and no longer supported by the original author is just to high.

Get a track record by getting known on dev@ and meetups. Show your knowledge by helping out others, help with patches that are not your own, don't rewrite core initially, help with building plug-in points (like for shuffling implementations), help with testing across the whole configuration space, testing in unusual settings. Delegate the test at scale to those that have the huge clusters - there will always be issues revealed in their setup that do not cause any problems on a single machine or in a tiny cluster. Also make sure to download the package in the beta phase and test that it works in your problem space. Another way to get involved is by writing better documentation - either online or as a book. Share your experience.

One major challenge is work on major refactorings - there is the option to branch out, switch to a commit-than-review model but that only post-pones work to merge time. For independent works it's even more complicated to get the changes in. Also integrating post graduate work that can be very valuable isn't particularly simple - especially if there already is a lack in helping hand s- who's going to spend the work to mentor those students.

Some ideas to improve the situation would b the help with spreading the knowledge - in local university partnerships, google hangouts, local dev workshops, using git and gerrit for better distributed development and merging.

ApacheConEU - part 03

2012-11-12 20:27
Tuesday started early with a plenary - run by the sponsor, not too many news there, except for the very last slide that raised a question that is being discussed often also within the ASF - namely how to define oneself compared to non-ASF projects. What is the real benefit for our users - and what is the benefit for people to go with the ASF. The speaker concentrated on pointing out the difference to github. Yes tooling changes are turning into a real game changer - but that is nothing that the foundation could not adopt over time. What I personally find interesting is not so much what makes us different from others but more what can be learnt from other projects - not only on github but also in a broader scope from the KDE, Python, Gnome, Debian, Open/Libre-Office communities, from people working on smaller but non-the-less successful projects as well as the larger foundations, maybe even the corporate driven projects. Over time lots of wisdom has accumulated within and outside of the foundation on how to run successful open source projects - now the question is how to transfer that knowledge to relatively young projects and whether that can help given the huge amount of commercial interest in open source - not only in using it but also in driving individual projects including all the benefits (more people) and friction around it.

The first talk I went to was an introduction to was Rainer Jung’s presentation on the new Apache httpd release. Most remarkably the event mpm available as “experimental” feature is now marked as default for the Apache distribution - though it is not being used for ssl connections. In addition there is support for async write completion, better support for sizing and monitoring. In particular when sizing the event mpm the new scoreboard comes in handy. When using it, keep in mind to adjust the number of allowed open file handles as well.

In order to better support server-to-client communication there is html5 web socket standardisation on it’s way. If you are interested in that check out the hybi standardisation list. Also taking a look at the Google SPDY could be interesting.

Since 2.4 dynamic loadable modules are supported and easy to switch. When it comes to logging there is now support for sub second timestamp precision, per module log levels. Process and thread ids are kept in order to be able to untwist concurrent connection handling. There are unique message tokens in the error log to track requests. Also the error log format is configurable - including trace levels one to eight, configurable per directory, location and module.
They’ve added lots of trace messages to core, correlation ids between error and access log entries (format entry %L). In addition there is a mod_log_debug module to help you log exactly what you want when you want.

Speaking of modules - in order to upgrade them from 2.2 to 2.4 it’s in general sufficient to re-compile. With the new version though not all modules are going to be loaded as default anymore. New features include dynamic configurations based on mod_lua. AAA was changed again, there are filters to rewrite content before it’s sent out to clients (mod_substitute, mode_sed, mod_proxy_html). mod_remoteip helps to keep the original ip in your logs instead of the procy ip.

When it comes to documentation better check the English documentation - or better yet provide patches to it. mod_rewrite and mod_proxy improved a lot. In addition the project itself now has a new service for it’s users: via comments.apache.org you can send documentation comments to the project without the need to register for a bugzilla account and provide documentation patches. In addition there is now syntax highlighting in the documentation. One final hint: the project is very open and actively looking for new contributors - though they may be slow to respond on the user and dev list - they definitely are not unfriendly ;)

ApacheCon EU - part 02

2012-11-11 20:26
For me the week started with the Monday Hackathon. Even though I was there early the room quickly filled up and was packed at lunch time. I really liked the idea of having people interested in a topic register in advance - it gave the organisers a chance to assign tables to topics and put signs on the tables to advertise the topic worked on. I'm not too new to the community anymore and can relate several faces to names of people I know are working on projects I'm interested in - however I would hope that this little bit of extra transperancy made it easier for newcomers to figure out who is working on what. Originally I wanted to spend the day continuing to work on an example showing what sort of pre-processing is involved in order to get from raw html files to a prediction of which Berlin Buzzwords submission is going to be accepted. (Un-?)fortunately I quickly got distracted and drawn into discussions on what kind of hardware works best for running an Apache Hadoop cluster, how the whole Hadoop community works and where the problem areas are (e.g. constantly missing more helping hands to get all things on the todo list done).





The evening featured a really neat event: Committers and Hackathon participants were invited to the committer reception in the Sinsheim technical and traffic museum. One interesting observation: There's an easy way to stop geeks from rushing over to the beer, drinks and food: Just put some cars, motor cycles and planes in between them and the food ;)

ApacheConEU - part 01

2012-11-10 14:30
Apache Con EU in Germany - in November, in Sinsheim (in the middle of nowhere): I have to admit that I was more than skeptical whether that would actually work out very well. A day after the closing session it's clear that the event was a huge success: Days before all tickets were sold out, there were six sessions packed with great talks on all things related to Apache Software Foundation projects - httpd, tomcat, lucene, open office, hadoop, apache commons, james, felix, cloud stack and tons of other projects were well covered. In addition the conference featured a separate track on how the Apache community works.





The venue (the Hoffenheim soccer team home stadium) worked out amazingly well: The conference had four levels rented with talks hosted in the press room, a lounge and two talks on each of the first and second floor in an open space setup. That way entering a talk late or leaving early was way less of a hazzle than when having to get out the door - sneaking into interesting talks on the second floor was particularly easy: From the third floor that was reserved for catering one could easily follow the talks downstairs. Speaking of catering: Yummy and available all the time - and that not only counts for water but for snacks (e.g. cake between breaks), coffee, soft-drinks, tea etc. On top of that tasty lunch buffet with all sorts of more or less typical regional food. You've set high standards for upcoming conferences ;)

Apache Con Wrap Up

2011-11-16 20:45
First things first - slides, audio and abstracts of Apache Con are all online now on their Lanyrd page. So if you missed the conference or could not attend a session due to conflicting with another interesting session - that's your chance to catch up.

For those of you who are speaking German, there's also a summary of Apache Con available on heise Open. (If you don't speak German, I have been told that the Google Translate version of the site captures the gist of the article reasonably well.)

Talking people into submitting patches

2011-09-21 20:22
In November I am going to attend Apache Con NA. This year I decided to do a little experiment: I sumitted a talk on talking people into contributing to free software projects. The format of the talk is a bit unusual: Drawing from my - admittedly limited and very biased - experience explaining free software to others and talking people into contributing patches this talks tries to initiate a discussion on methods to get awesome developers to consider contributing their work back to free software projects.

As a precursor to the talk I have created a public Google docs document - it already contains the title of each slide I will use in my presentation. Of course the content does not get disclosed ahead of time.

If any of the readers of this blog post has experience with either explaining why they contribute, how to contribute, issues and questions new users have - please feel free to fill them into above document. I'll try to integrate as much feedback as possible into my final slides.

Apache Con – Wrap up

2010-11-29 23:27
After one week of lots of interesting input the ASF's user conference was over. With a focus on Apache software users quite a few talks are not too well suited for conference regulars but more or less targeted at newbies who want to know who too successfully apply the software. As a developer of Mahout with close ties to the Lucene and Hadoop community what is of course most interesting to me are stories of users putting the software into production.
The conference was well organised: The foundation features way more projects at the moment than can reasonably be covered in just one conference. As a result Apache Con only covers only a subset of what is being developed at Apache. As such more and more smaller community organised events are being run by individuals as well as corporations. Still Apache Con is a nice place to get in touch with other developers – and to get a glimpse of what is going on in project outsides ones own regular community.