FrOSCon - understanding Linux with strace

2012-09-06 20:29
Being a Java child I had only dealt with strace once before: Trying to figure out whether any part of the Mahout tests happens to use /dev/random for initialisation in a late night debugging session with my favourite embedded Linux developer. Strace itself is a great tool to actually see what your program is doing in terms of system calls, giving you the option to follow on a very detailed level what is going on.

In his talk Harald König gave a very easy to follow over view on how to understand Linux with strace. Starting with the basic use cases (trace file access, program calls, replay data, analyse time stamps, do some statistics) the quickly moved on to showing some more advanced tricks you can do with the little tool: Finding sys-calls that take surprisingly long vs. times when user code is doing long-running computations. Capturing and replaying e.g. networking related calls to simulate problems that the application runs into. Figuring out bottlenecks (or just plain weird stuff) in the application by figuring out the most frequent syscall. Figuring out which configuration files an application really touches - sorting them by last modified date with a bit of shell magic might give an answer to the common question of whether the last update or the last time the user tinkered with the configuration turned his favourite editor to appear green instead of white. On the other hand it can also reveal when configurations have been deleted (in the presentation he moved away the user-emacs configuration. As a result emacs tried >30 times to find it for various configuration options during startup: Is it there? No. ... Is it now there? No. ... Maybe now? Nope. ... ;) ).

When looking at strace, you might also want to take a look at ltrace that traces library calls - the output there might be a bit more readable in that it's not just system calls but also library calls. Remember though that tracing everything can not only make your app pretty slow but also quickly generates several gigabytes of information.

FrOSCon - Git Goodies

2012-09-05 20:34
In his talk on Git Goodies Sebastian Harl introduced not only some of the lesser known git tooling but also gave a brief introduction as to how git organises its database. Starting with an explanation of how patches essentially are treated as blobs identified by SHA1 hashes (thus avoiding duplication not only in the local database but allover the git universe), pointed to by trees that are in turn generated and extended by commits that are in turn referenced by branches (updates on new commits) and tags (don't update on new commits). With that concept in mind it suddenly becomes trivial to understand that HEAD simply is a reference to wherever you next commit is going to in your working directory. It also becomes natural to understand that HEAD pointing just to a commit-id but not to a branch is called a de-tached head.

Commits in git are tracked in three spaces: In the repository (this is where stuff goes after a commit), in the index (this is where stuff goes after an add or rm) and in the working directory. Reverting is symetric: git checkout takes stuff from the repository and puts it into the current working copy. reset --mixed/--hard only touches the index.

When starting to work more with git start reading the man and help pages. They contain lots of goodies that make daily work easier: There are options that allow for colored diffs, setting external merge tools (e.g. vimdiff), setting the push default (just current branch or all matching branches). There are options to define aliases for commands (diff here has a large variety of options that can be handy like coloring only different words instead of lines). There are options to set the git-dir (where .git lies) as well as the working directory which makes it easy to track your website in git but not have the git directory lie in your public_html folder.

There is a git archive to checkout your stuff as tar.gz. When browsing the git history tig can come in handy - it allows for browsing your repository with an ncurses interface, show logs, diffs and the tree of all commits. You can ask it to only show logs that match a certain pattern.

Make sure to also look at the documentation of ref-parse that explains how to reference commits in an even more flexible manner (e.g. master@{yesterday}). Also checkout the git reflog to take a look at the version history of your versioning. Really handy if you ever mess up your repository and need to get back to a sane state. Also a good way to recover detached commits. Take a look at git-bisect to learn more on how to binary-search for commits that broke your build. Use a fine granular way to add changes to your repository with git add -p - do not forget to take a look at git stash as well as cherry-pick.

FrOSCon - Robust Linux embedded platform

2012-09-04 20:05
The second talk I went to at FrOSCon was given by Thilo Fromm on Building a robust embedded Linux platform. For more information on the underlying project see also projec HidaV on github. Slides of the talk Building a robust Linux embedded platform are already online.

Inspired by a presentation on safe upgrade prodedures in embedded devices by Arnaut Vandecappelle in the Embedded Dev Room FOSDEM earlier this year Thilo extended the scope of the presentation a bit to cover safe kernel upgrades as well as package updates in embedded systems.

The main goal of the design he presented was to allow for developing embedded systems that are robust - both in normal operation but also when upgrading to a new firmware version or a set of new packages - the design included support for upgrading and rolling back to a known working state in an atomic way. Having systems deployed somewhere in the wild to power a wind turbine, inside of busses and trains or even within satellites pretty much forbids relying on an admin to press the "reset button".



Original image xkcd.com/705

The reason for putting that much energy into making these systems robust also lies in the ways they are deployed. Failure vectors include not only your usual software bugs, power failures or configuration incompatibilities. Transmission errors, storage corruption, temperature, humidity add their share to increase the probability of failure.

Achieving these goals by building a custom system isn't too trivial. Building a platform that is versatile enough to be used by others building embedded systems adds to the challenges: Suddenly having easy to use build and debug tools, support for software life-cycle management and extend-ability are no longer nice-to-have features.

Thilo presented two main points to address the requirements: The first is to avoid trying to cater every use case. Setting requirements for a platform in terms of performance, un-brickability (see also urban dictionary, third entry as of this writing). Even setting a requirement for dual boot support or to the internal storage technology used. As a result designing the platform can become a lot less painful.

The second step is to harden the platform itself. Here that means that upgrading the system (both firmware and packages) is atomic, can be rolled-back atomically and thus no longer carries the danger of taking the device down for longer than intended: A device that does no longer perform it's task in the embedded world usually is considered broken and shipped back to the producer. As a result upgrading may be necessary but should never render the device useless.

One way to deal with that is to store boot configurations in a round robin manner - for each configuration a "was booted" (set by the bootloader on boot) and a "is healthy" (set by the system after either a certain time of stability or after running self tests) flags are needed. This way at each boot it is clear what the last healthy configuration was.

To do the same with your favourite package management system is slightly more complicated: Imagine running something like apt-get upgrade with the option to switch back to the previous state in an atomic way if anything goes wrong. One option to deal with that presented is to work with transparant overlay filesystems that allow for having a read-only base layer - and a "transparent" r/w layer on top. If a file does not exist in the transparent layer, the filesystem will return the original r/o version. If it does exist it will return the version in the transparent overlay. In addition there's also an option to mark files as deleted in the overlay.

With that upgrading becomes as easy as installing the upgraded versions into some directory in your filesystem and mounting said directory as transparent overlay. With that roll-back as well as snapshots are easy to do.

The third ingredient to achieving a re-usable platform presented was to use Open Embedded. Including an easy to extend layer-based concept, support for often recent software versions, versioning and dependency modelling, some BSP layers officially supported by hardware manufacturers building a platform on top of Open Embedded is one option to make it easily re-useable by others.

If you want to know more on the concepts described join HiDaV platform project - many of the concepts described are already - or soon to be - implemented.

FrOSCon 2012 - REST

2012-08-29 19:33
Together with Thilo I went to FrOSCon last weekend. Despite a few minor glitches and the "traditional" long BBQ line the conference was very well organised and again brought together a very diverse crowd of people including but not limited to Debian developers, OpenOffice people, FSFE representatives, KDE and Gnome developers, people with background in Lisp, Clojure, PHP, Java, C and HTML5.

The first talk we went to was given by JThijssen on REST in practice. After briefly introducing REST and going a bit into Myths and false believes about REST he explained how REST principles can be applied in your average software development project.

To set a common understanding of the topic he first introduced the four steps REST Maturity Model: Step zero means using plain old xml over http for rpc or SOAP. Nothing particularly fancy here - even to some extend breaking common standards related to http. Going one level up means modeling your entities as resources. Level two is as simple as using the http verbs for what they are intended - don't delete anything on the other side just by using a GET request. Level three finally means using hypermedia controls, HATEOS and providing navigational means to decide on what to do next.

Myths and legends

Rest is always http - well, it is transport agnostic. However mostly it using http for transport.

Rest equals CRUD - though not designed for that it is often used for that task in practice.

Rest scales - as a protocol yes, however of course that does not mean that the backend you are talking to does. All Rest does for you is to give you a means to horizontally scale without having to worry too much about server state.

Common mistakes

Using Http verbs - if you've ever dealt with web crawling you probably know those stories of some server's content being deleted just be crawling a public facing web site just because there was a "delete" button somewhere that would trigger a delete action through an innocent looking GET request. The lesson learnt of those: Use the verbs for what they are intended to be used. One commonly confused thing is the usage of PUT vs. POST. Common rule of thumb that also applies to the CouchDB REST API: Use PUT if you know what the resulting URL should be (e.g. when storing an entry to he database and you know the key that you want to use). Use POST if you do not care about which URL should result from the operation (e.g. if the database should automatically generate a unique key for you). Also make sure to use the error codes as intended - never return error code 2?? only to add an xml snippet to the payload that explains to the surprised user that an error occurred including an error code. If you really need an explanation of why this is considered bad practice if not plain evil, think about caching policies and related issues.

When dealing with resources a common mistake is to stuff as much information as possible into one single resource for one particular use case. This means transferring a lot of additional information that may not be needed for other use cases. A better approach could be to allow clients to request custom views and joins of the data instead of pre-generating them.

When it comes to logging in to your API - don't design around HTTP - use it. Sure you can give a session id into a cookie to the user. However than you are left with the problem of handling client state on the server - which was supposed to be stateless so clients can talk to any server. You could store the logged in information in the client cookie - signing and encrypting that might even make it slightly less weird. However the cleaner approach would be to authenticate individual requests and avoid state altogether.

When it comes to URL design keep in mind to keep them in a format that is easy to handle for caches. An easy check would be to try and bookmark the page you are looking at. Also think about ways to increase the number of cache hits if results are even slightly expensive to generate. Think about an interface to retrieve the distance from Amsterdam to Brussels. The URL could be /distance/to/from - however given no major road issues the distance from Amsterdam to Brussels should be the same as from Brussels to Amsterdam. One easy way to deal with that would be to allow for both requests but to send a redirect to the first version in case a user requests the second. The semantics would be slightly different when asking for driving directions - there the returned answers would indeed differ.

The speaker also introduced a concept for handling asynchronous updates that I found interesting: When creating a resource hand out a 202 accepted response including a queue ticket that can be used to query for progress. For as long as the ticket is not yet being actively dealt with it may even contain cancellation methods. As soon as the resource is created requesting the ticket URL will return a redirect to the newly created resource.

The gist of the talk for me was to not break the Rest constraints unless you really have to - stay realistic and pragmatic about the whole topic. After all, most likely you are not going to build the next Twitter API ;)


Video: Stefan Hübner on Cascalog

2012-08-28 20:49

Video: "Accessing Hadoop data with HCatalog and PostgreSQL"

2012-08-20 20:53

Open Source Meetup Berlin

2012-08-19 20:22
This evening the (to my knowledge first) Berlin Open Source Meetup took place at Prater (Bier-)garten in Berlin. There are lots of project specific meetings, a monthly Free Software meeting, quite some stuff on project management. However this was one of the rare occasions where you get Linux kernel hackers, Wikidata project members, Debian developers, security people, mobile developers as well as people writing on free software or making movies related to the topic around one table.

Despite the heat (over 30 degrees Celcius in Berlin today) over 30 people gathered for some food, cold beer, some drinks and lots of interesting discussions. Would be great to see another edition of this kind of event.

Spotted this morning...

2012-08-16 22:03
in front of my office:



Ever wondered how accurate navigable map data for your Garmin, your in-car navigation system (most likely), or maps.nokia.com are created? One piece of the puzzle is the car above collecting data for Navteq, a subsidary of Nokia.

Apache Hadoop Get Together Berlin - August 2012

2012-08-15 23:30
Despite beautiful summer weather roughly 50 people gathered at ImmobilienScout24 for the August 2012 edition of the Apache Hadoop Get Together (Thanks again for hosting the event and sponsoring drinks and pizza to ImmoScout as well as to David Obermann for organising the meetup.



Today there were three talks: In the first presentation Dragan Milosevic (also known from his talk at the Hadoop GetTogether and his presentation at Berlin Buzzwords) provided more insight as to how Zanox is managing their internal RPC protocols in particular when it comes to versioning and upgrading protocol versions. Though in principle very simple to do this sort of problem still is very common when starting to roll out distributed systems and scaling them over time. The concepts he described were not unlike what is available today in projects like Avro, Thrift or protocol buffers. However by the time they needed versioning support for their client server applications neither of these projects was a really good fit. This also highlights one important constraint: With communication being a very central component in distributed systems, changing libraries after an implementation went to production can be too painful to be followed through.

In the second presentation Stefanie Huber, Manuel Messner and Stephan Friese showed how Gameduell is using Hadoop to provide better data analytics for marketing, BI, developers, product managers et.al. Founded in 2003 they have a accumulated quite a bit of data consisting of micro transactions (related to payment operations), user activities, gaming results that need to be used for balancing games. Their team turned a hairy, complex system into a pretty clean, Hadoop based solution: By now all actions end up in a Hadoop cluster (with an option to subscribe to a feed for realtime events). Typically from there people would start analysis jobs either in plain map reduce or in pig and export the data to external databases for further analysis by BI people who preferred Hive as a query language as it is much closer to SQL than any of the alternatives. As of late they introduced HCatalog to support providing a common view on data for all three analysis options - in addition to allowing for a more abstract view of the data available that does not require knowing the exact filesystem structure to access the data.

After a short break in the last talk of the evening Stefan Hübner introduced Cascalog to the otherwise pretty Java-savvy crowd. Being based on Cascading Cascalog provides for a concise way of formulating queries to a Hadoop cluster (compared to plain map reduce). Also when contrasted with Pig or Hive what stands out is the option to easily and seemlessly integrate additional functions (both map- and reduce-side) into Cascalog scripts without switching languages or abstractions. Note: When testing Cascalog scripts, one project to look at is Midje.

Overall a really interesting evening with lots of new input, interesting discussions and new input. Always amazing to see what other big data applications people in Berlin are developing. It's awesome to see so many development teams adopt seemingly new technologies (some even still in the Apache Incubator) for production systems. Looking forward to the next edition - as well as to the slides and videos of today's edition.

Data Scientists - researchers' persectives

2012-08-03 20:35
"Data scientist" as a term has caught quite some attention as of late (together with all the big data, scalability and cloud hype). Instead of re-hashing arguments seen in other sources I thought it might make more sense to link to a few of the thought provoking posts I came across recently.