ApacheConEU - part 04

2012-11-13 20:46
The second talk I went to was the one on the dev@hadoop.a.o insights given by Steve Loughran. According to Steve Hadoop has turned into what he calls an operating system for the data center - similar to Linux in that it's development is not driven by a vendor but by its users: Even though Hortenworks, Cloudera and MapR each have full time people working on Hadoop (and related projects), this work usually is driven by customer requirements which ultimately means that someone is running a Hadoop cluster that he has trouble with and wants to have fixed. In that sense the community at large has benefitted a lot from the financial crisis that Y! has slipped into: Most of the Hadoop knowledge that is now spread across companies like Linked.In, Facebook and others comes from engineers leaving Y! and joining one of those companies. With that also the development cycle of Hadoop has changed: While it was mostly driven by Y! schedule in the beginning - crunching out new releases nearly on a monthly basis with a dip around Christmas time it's got more irregular later - to pick up a more regular schedule just recently.

Image taken a few talks earlier in the same session.

In terms of version numbers: 1.x is to be considered stable, production ready with fixes and low risk patches applied only. For 2.x the picture is a bit different - currently that is in alpha stage, features and fixes go there first, new code is developed there first. However Hadoop is not just http://hadoop.apache.org - the ecosystem is much larger including projects like Mahout, Hama, HBase and Accumulo built on top of Hadoop and Hive, Pig, Sqoop and Flume for integraion and analysis, as well as oozie for coordination an zookeeper for distributed configuration. There's even more in incubation (or just recently graduated): Kafka for logging, whirr for cloud deployment, giraph for graph processing, ambari for cluster management, s4 for distributed stream processing, hcatalog and templeton for schema management, chuckwa for loggin purposes. All of the latter ones love helping hands. If you want to help out and want to "play Apache" those are the place to go to.

With such a large ecosystem one of the major pain points when rolling out your own Hadoop cluster is integrating all components you need: All projects release on separate release schedules, usually documenting against which version of Hadoop they were built. However finding a working combination is not always trivial. The first place to go to that springs to mind are the standard Linux distributions - however for their release and support cycles (e.g. Debian's guarantees for stable releases) the pace inside of Hadoop and the speed with which old versions were declared "no longer supported" still is too fast. So what alternatives are there? You can go with Cloudera who ship Apache Hadoop extended with their patches and additional proprietary software. You can opt for Hortenworks that ships the ASF Hadoop only. Or you can opt for Apache itself and either tame the zoo yourself or rely on BigTop that aims to integrate the latest versions.

Even though there are people working fulltime on the project there's still way more work to do than hands to help out. Some issues do fall through the cracks, in particular if you are the only person affected by the issue. Ultimately this may mean that if the bug only affects you you will have to be the one to fix the issue. Before going out and hacking the source itself it may make sense to go out and search for the stack trace you are looking at, the error message in front of you - look in JIRA and on the mailing list archives to see if there's someone else who has solved the problem already - and in an ideal case even provided a patch that just didn't make it into the distribution so far.

Also contributing to Hadoop is a great way to get to work on CS hard problems: There's distributed computing involved (clearly), there's consensus implementations like Paxos, there's work to optimise Hadoop for specific CPU architectures, there's scheduling and data placement problems to solve, machine learning and graph algorithm implementations and more. If you are into these topics - come and join, those are highly needed skills. People on the list tend to be friendly - they are "just" overloaded.

Of course there are barriers to entry - just like the Linux kernel Hadoop has become business critical for many of its users. Steve's recommendation for dealing with this circumstance was to not compete with existing work that is being done - instead either concentrate on areas that aren't covered yet or even better yet collaborate with others. A single lonely developer just cannot reasonably compete.

In terms of commit process Hadoop is running a review-than-commit protocol. Also it makes things seemingly more secure it also means that the development process is a lot slower, that things get neglected, that there is a lot of frustration also on the committers' side when having to update a patch oftentimes. In addition tests running for a long time doesn't make contributing substancially easier. With lots of valuable and critical data being stored in HDFS makeing radical changes there also isn't something that happens easily. The best way to really get started is to get trusted and have a track record: The maintanance cost for abandoned patches that are large, hard to grasp and no longer supported by the original author is just to high.

Get a track record by getting known on dev@ and meetups. Show your knowledge by helping out others, help with patches that are not your own, don't rewrite core initially, help with building plug-in points (like for shuffling implementations), help with testing across the whole configuration space, testing in unusual settings. Delegate the test at scale to those that have the huge clusters - there will always be issues revealed in their setup that do not cause any problems on a single machine or in a tiny cluster. Also make sure to download the package in the beta phase and test that it works in your problem space. Another way to get involved is by writing better documentation - either online or as a book. Share your experience.

One major challenge is work on major refactorings - there is the option to branch out, switch to a commit-than-review model but that only post-pones work to merge time. For independent works it's even more complicated to get the changes in. Also integrating post graduate work that can be very valuable isn't particularly simple - especially if there already is a lack in helping hand s- who's going to spend the work to mentor those students.

Some ideas to improve the situation would b the help with spreading the knowledge - in local university partnerships, google hangouts, local dev workshops, using git and gerrit for better distributed development and merging.

ApacheConEU - part 03

2012-11-12 20:27
Tuesday started early with a plenary - run by the sponsor, not too many news there, except for the very last slide that raised a question that is being discussed often also within the ASF - namely how to define oneself compared to non-ASF projects. What is the real benefit for our users - and what is the benefit for people to go with the ASF. The speaker concentrated on pointing out the difference to github. Yes tooling changes are turning into a real game changer - but that is nothing that the foundation could not adopt over time. What I personally find interesting is not so much what makes us different from others but more what can be learnt from other projects - not only on github but also in a broader scope from the KDE, Python, Gnome, Debian, Open/Libre-Office communities, from people working on smaller but non-the-less successful projects as well as the larger foundations, maybe even the corporate driven projects. Over time lots of wisdom has accumulated within and outside of the foundation on how to run successful open source projects - now the question is how to transfer that knowledge to relatively young projects and whether that can help given the huge amount of commercial interest in open source - not only in using it but also in driving individual projects including all the benefits (more people) and friction around it.

The first talk I went to was an introduction to was Rainer Jung’s presentation on the new Apache httpd release. Most remarkably the event mpm available as “experimental” feature is now marked as default for the Apache distribution - though it is not being used for ssl connections. In addition there is support for async write completion, better support for sizing and monitoring. In particular when sizing the event mpm the new scoreboard comes in handy. When using it, keep in mind to adjust the number of allowed open file handles as well.

In order to better support server-to-client communication there is html5 web socket standardisation on it’s way. If you are interested in that check out the hybi standardisation list. Also taking a look at the Google SPDY could be interesting.

Since 2.4 dynamic loadable modules are supported and easy to switch. When it comes to logging there is now support for sub second timestamp precision, per module log levels. Process and thread ids are kept in order to be able to untwist concurrent connection handling. There are unique message tokens in the error log to track requests. Also the error log format is configurable - including trace levels one to eight, configurable per directory, location and module.
They’ve added lots of trace messages to core, correlation ids between error and access log entries (format entry %L). In addition there is a mod_log_debug module to help you log exactly what you want when you want.

Speaking of modules - in order to upgrade them from 2.2 to 2.4 it’s in general sufficient to re-compile. With the new version though not all modules are going to be loaded as default anymore. New features include dynamic configurations based on mod_lua. AAA was changed again, there are filters to rewrite content before it’s sent out to clients (mod_substitute, mode_sed, mod_proxy_html). mod_remoteip helps to keep the original ip in your logs instead of the procy ip.

When it comes to documentation better check the English documentation - or better yet provide patches to it. mod_rewrite and mod_proxy improved a lot. In addition the project itself now has a new service for it’s users: via comments.apache.org you can send documentation comments to the project without the need to register for a bugzilla account and provide documentation patches. In addition there is now syntax highlighting in the documentation. One final hint: the project is very open and actively looking for new contributors - though they may be slow to respond on the user and dev list - they definitely are not unfriendly ;)

ApacheCon EU - part 02

2012-11-11 20:26
For me the week started with the Monday Hackathon. Even though I was there early the room quickly filled up and was packed at lunch time. I really liked the idea of having people interested in a topic register in advance - it gave the organisers a chance to assign tables to topics and put signs on the tables to advertise the topic worked on. I'm not too new to the community anymore and can relate several faces to names of people I know are working on projects I'm interested in - however I would hope that this little bit of extra transperancy made it easier for newcomers to figure out who is working on what. Originally I wanted to spend the day continuing to work on an example showing what sort of pre-processing is involved in order to get from raw html files to a prediction of which Berlin Buzzwords submission is going to be accepted. (Un-?)fortunately I quickly got distracted and drawn into discussions on what kind of hardware works best for running an Apache Hadoop cluster, how the whole Hadoop community works and where the problem areas are (e.g. constantly missing more helping hands to get all things on the todo list done).

The evening featured a really neat event: Committers and Hackathon participants were invited to the committer reception in the Sinsheim technical and traffic museum. One interesting observation: There's an easy way to stop geeks from rushing over to the beer, drinks and food: Just put some cars, motor cycles and planes in between them and the food ;)

ApacheConEU - part 01

2012-11-10 14:30
Apache Con EU in Germany - in November, in Sinsheim (in the middle of nowhere): I have to admit that I was more than skeptical whether that would actually work out very well. A day after the closing session it's clear that the event was a huge success: Days before all tickets were sold out, there were six sessions packed with great talks on all things related to Apache Software Foundation projects - httpd, tomcat, lucene, open office, hadoop, apache commons, james, felix, cloud stack and tons of other projects were well covered. In addition the conference featured a separate track on how the Apache community works.

The venue (the Hoffenheim soccer team home stadium) worked out amazingly well: The conference had four levels rented with talks hosted in the press room, a lounge and two talks on each of the first and second floor in an open space setup. That way entering a talk late or leaving early was way less of a hazzle than when having to get out the door - sneaking into interesting talks on the second floor was particularly easy: From the third floor that was reserved for catering one could easily follow the talks downstairs. Speaking of catering: Yummy and available all the time - and that not only counts for water but for snacks (e.g. cake between breaks), coffee, soft-drinks, tea etc. On top of that tasty lunch buffet with all sorts of more or less typical regional food. You've set high standards for upcoming conferences ;)

Teddy in London

2012-10-29 20:04
While I was at the conference – Teddy spent some time exploring the surroundings of the conference hotel. Looks like in particular Hyde park was attractive:

Strata EU - part 4

2012-10-28 20:17
The rest of the day was mainly reserved for more technical talks: Tom Wight introducing the merits of MR2, also known as YARN. Steve Loughran gave a very insightful talk on the various failure modes of Hadoop – though the Namenode is like the most obvious single point of failure there are a few more traps waiting for those depending on their Hadoop clusters: Hadoop does just find with single harddisks failing. Failing single machines usually also does not create a huge issue. However what if the switch one of your racks is connected with fails? Suddenly not just one machine has to be re-replicated but a whole rack of machines. Even if you have enough space in your cluster left, can your network deal with the replication traffic? What if your cluster is split in half as a result? Steve gave an introduction to the various HA configurations available for Hadoop. There's one insight I really liked though: If you are looking for SPOFs in your system – just carry a pager … and wait.

In the afternoon I joined Ted Dunning's talk on fast kNN soon to be available in Mahout – the speedups gained really do look impressive – just like the fact that the algorithm is all online and single pass.

It was good to meet with so many big data people in two days – including Sean Owen who joined the Data Science Meetup in the evening.

Thanks to the O'Reilly Strata team – you really did an awesome job making Strata EU an interesting and very well organised event. If you yourself are still wondering what this big data thing is and in what respect it might be relevant to your company Strata is the place to be to find out: Though being a tad to high-level for people with a technical interest the selection of talks is really great when it comes to showing the wide impact of big data applications from IT, the medical sector right up to data journalism.

If you are interested in anything big data, in particular who to turn the technology into value make sure you check out the conferences in New York and Santa Clara. Also all keynotes of London were video taped and are available on YouTube by now.

Strata EU - part 3

2012-10-27 20:16
The first Tuesday morning keynote put the hype around big data into historical context: According to wikipedia big data apps are defined by their capability of coping with data set sizes that are larger than can be handled with commonly available machines and algorithms. Going from that definition we can look back to history and will realize that the issue of big data actually isn't that new: Even back in the 1950s people had to deal with big data problems. One example the speaker went through was a trading company that back in the old days had a very capable computer at their disposal. To ensure optimal utilisation they would rent out computing power whenever they did not need it for their own computations. One of the tasks they had to accomplish was a government contract: Freight charges on rails had been changed to be distance based. As a result the British government needed information on the pairwise distances between all trainstations in GB. The developers had to deal with the fact that they did not have enough memory to fit all computation into it – as a result they had to partition the task. Also Dijkstra's algorithm for finding shortest paths in graphs wasn't invented until 4 years later – so they had to figure something out themselves to get the job done (note: Compared to what Dijkstra published later it actually was very similar – only that they never published it). The conclusion is quite obvious: The problems we face today with Petabytes of data aren't particularly new – we are again pushing frontiers, inventing new algorithms as we go, partition our data to suit the compute power that we have.

With everyday examples and a bit of hackery the second keynote went into detail on what it means to live in a world that increasingly depends on sensors around us. The first example the speaker gave was on a hotel that featured RFID cards for room access. On the card it was noted that every entry and exit to the room is being tracked – how scary is that? In particular when taking into account how simple it is to trick the system behind into revealing some of the gathered information as shown a few slides later by the speaker. A second example he have was a leaked dataset of mobile device types, names and usernames. By looking at the statistics of that dataset (What is the distribution of device types – it was mainly iPads as opposed to iPhones or Android phones. What is the distribution of device names? - Right after manufacturer names those contained mainly male names. When correlating these with a statistic on most common baby name per year they managed to find that those were mainly in their mid thirties.) The group of people whose data had leaked used the app mainly on an iPad, was mainly male and in their thirties. With a bit more digging it was possible to deduce who exactly had leaked the data – and do that well enough for the responsible person (an American publisher) to not be able to deny that. The last example showed how to use geographical self tracking correlated with credit card transactions to identify fraudulent transactions – in some cases faster than the bank would discover them.

The last keynote provided some insight into the presentation bias prevalent in academic publishing – but in particular in medical publications: There the preference to publish positive results is particularly detrimental as it has a direct effect on patient treatment.

Strata EU - part 2

2012-10-26 20:15
The second keynote touched upon the topic of data literacy: In an age in which growing amounts of data are being generated being able to make sense of these becomes a crucial skill for citizens just like reading, writing and computing. The speaker's message was two-fold: a) People currently are not being taught how to deal with that data but are being taught that all that growing data is evil. Like an enemy hiding under their bed just waiting to jump at them. b) When it comes to getting the people around you literate the common wisdom is to simplify, simplify, simplify. However her approach is a little different: Don't simplify. Instead give people the option to learn and improve. As a trivial comparison: Just because her own little baby does not yet talk doesn't mean she shouldn't talk to it. Over time the little human will learn and adapt and have great fun communicating with others. Similarly we shouldn't over-simplify but give others a chance to learn.

The last keynote dealt gave a really nice perspective on information overload and the history of information creation. Starting back in the age of clay tablets where writing was to 90% used for accounting only – tablets being tagged for easier findability. Continuing with the invention of paper – back then still as roles as opposed to books that facilitated easy sequential reading but made random access hard. The obvious next step being books that allow for random access read. Going on to initial printing efforts in an age where books were still a scarce resource. Continuing to the age of the printing press with movable types when books became ubiquitous – introducing the need for more metadata attached to books like title pages, TOCs and indexes for better findability. As book production became simpler and cheaper people soon had to think of new ways to cope with the ever growing amount of information available to them. Compared to that the current big data revolution does not look to familiar anymore: Much like the printing press allowed for more and more books to become available , Hadoop allows for more and more data to be stored in clusters. As a result we will have to think about new ways to cope with the increasing amount of data at our disposal, time to start going beyond the mere production processes and deal with the implications for society. Each past data revolution left both – winners and loosers – mainly unintentioned by those who invented the production processes. Same will happen with today's data revolution.

After the keynotes I joined some of the nerdcore track talks on Clojure for data science and Cascalog for distributed data analysis, briefly joined the talk on data literacy for those playing with self tracking tools to finally join some friends heading out for an Apache Dinner. Always great to meet with people you know in cities abroad. Thanks to the cloud of people who facilitated the event!

O'Reilly Strata London - part 1

2012-10-25 20:13
A few weeks ago I attended O'Reilly Strata EU. As I had the honour of being on the program committee I remember how hard it was to decide on which talks to accept and which ones to decline. It's great to see that potential turned into an awesome conference on all things Big Data.

I arrived a bit late as I flew in only Monday morning. So I didn't get to see all of the keynotes and plunged right into Dyson's talk on the history of computing from Alan Turing to now including the everlasting goal of making computers more like humans, making them what is generally called intelligent.

The next keynote was co-presented by the Guardian and Google on the Guardian big data blog. Guardian is very well known for their innovative approach to journalism that more and more relies on being able to make sense of ever growing datasets – both public and not-yet-published. It was quite interesting to see them use technologies like Google Refine for cleaning up data, see them mention common tools like Google spreadsheets or Tableau for data presentation and learn more on how they enrich data by joining it with publicly available datasets.

Teddy in Down Under

2012-10-24 20:24
The last two September weeks Teddy was in Down Under. He spent the first few days exploring Sydney: Taking the ferry from Manly to the city each morning, followed by beautiful sunny weather, warm enough to already go swimming.

The following days took him to the Blue Mountains and into Kangaroo Valley for some hiking, animal watching and kayaking:

Of course Teddy also made some new friends:

A huge thanks to Tatjana, Steve and Ash for hosting us in Sydney. Thanks also to Brett, Laura, Samantha and Tobi for hosting us in the Blue Mountains. And thanks to the folks joining us on our very last evening for a Apache Dinner. Was great meeting you – looking forward to see you again soon.

Also thanks to Thoralf, Anja, Astro, Douwe, Stefan, Nick, Brett and everyone else who provided us with lots of hints and recommendations on what to do in and near Sydney. As usual it was too little time for too much to do and see.