One day later

2012-01-05 23:57

Fun little new toy

2012-01-03 23:48
Yesterday Thilo invited me to attend an "Electronics 101" workshop including an introduction to soldering that was scheduled to start at 7p.m. this evening at the offices of IN-Berlin e.V.. As part of my studies back in university I do have a little bit of background in Electronics, but never before had tried any serious soldering (apart from fixing one of our audio cables) so I thought, why not.

The workshop turned out to be a lot of fun: The organisers Mitch Altman and Jimmie Rodgers had brought several pre-packaged kits for people to work on. Quite a few of them based on Arduino so after putting them together you can actually continue having fun with writing little programs. After giving a brief but very well done, easy to understand introduction to digital electronics Mitch showed attendees how to use a soldering iron (make sure to check out his comic "soldering is easy" if you want to know more) and got everyone started. Both Jimmie and Mitch did a great job answering any upcoming questions, fixing issues and generally helping out with any problems. Even those that never used a soldering iron before quickly got up to speed and in the end went home with that nice experience of having built something that you cannot only program but can touch and hold in your hands.

I got myself a LoL shield (still to be done), and a Diavolino. Still missing is the FTDI TTL-232R cable for getting the device hooked up to our laptops and be able to re-program it (though most likely that will be easier to find than a >1G Ohm resistor Thilo is looking for to be able to calibrate his Geiger counter).

Results of my first session are below:

The board First pins attached Last pins attached

Also thanks to Sven Guckes organising and announcing this workshop on short notice. And thanks to Thilo for talking me into that.

Update: Images of the event are available online.

Talking people into submitting patches - results

2012-01-01 18:42
Back in November I gave a talk at Apache Con NA in Vancouver on talking friends and colleagues into contributing patches to open source projects. The intended audience for this talk were experienced committers to Apache projects, the goal was to learn more on their tricks for talking people into patching. First of all thanks for an interesting discussion on the topic - it was great to get into the room with barely enough slides to fill 10 min and still have a lively discussion 45min later.

For the impatient - the written feedback is available as Google Doc. Most common advise I heard involved patience, teaching, explaining, fast feedback and reward.

One warning before going into more detail on the talk: All assumptions and observations stated are highly subjective, influenced by my personal experience or by whatever the experience of the audience was. Do not expect an objective, balanced, well research analysis of the problems in general. That said, lets start with the talk itself. Before the talk I decided to limit scope to getting people in that have limited experience with open source. That intentionally excluded anyone downstream projects depending on one's code. Though in particular interaction with common Linux distributions and their package maintainers is vital, that issue warrants for a separate talk and discussion.

I divided those inexperienced with open source into three groups to keep discussion somewhat focused:

  • Students learning about open source projects during their education and have neither background in software engineering nor in open source but are generally very eager to lean and open to new ideas.
  • Researchers learning about the concept as part of a research grant who have some software engineering experience, some experience with open source - in particular with using it - but in general do not have writing open source software as their main objective, but have to participate as part of their research grant.
  • Software engineers having experience with software engineering, some experience in particular with using open source and in general both strong opinions on what the right way of doing things is and who have a strong position in their team that helps them in no way when starting to contribute.

One very common way

To understand some of the issues below let me first highlight what seems to be the most common way to become involved with any Apache project: Usually it starts with using one of their software packages. After some time what is shipped does no longer fit your needs, reveals bugs that stop you from reaching your goals or is missing one particular feature - even if that is just one particular method being protected instead of private.

People fix those issues. As the best software developers are utterly lazy the contribute stuff back to the project to avoid the work of having to maintain their private fork just for some simple modification. The more features of a project are being used, the more likely it gets that also larger contributions become possible. Overall this way of selecting issues to fix has a lot to do with scratching your own itch. In the end this kind of issue prioritisation also influences the general direction of a project: Whatever is most important to those actively contributing is driving the design and development. So the only way to change a project's direction to better fit your needs is to start getting active yourself: Those that do are the ones that decide.


Lets take a closer look at students aspiring to work on an open source project. They are very keen on contributing new stuff, learning the process and open to new ways of doing things. However for the most part they are no active users of the projects they selected so they do not directly see what is important to fix. In addition they have only limited software development experience - at least when looking at German universities, bug trackers, source version control, build systems, release management, maintaining backwards compatibility, unit test frameworks are on no schedule - and most likely shouldn't be neither. So your average student has to learn to deal with checking out code, compiling it, getting it into their favourite editor, adding tests and making them pass.

Apart from teaching, giving even simple feedback it helps to provide the right links to literature at the right times, and generally mentor students actively. In addition it can be helpful to leave non-critical, easy to fix issues open and mark them as "beginner level" to make it easier for new-comers to get started. One last advise: Get students to publish what they do as early and as often as possible. Back in the days I used to do projects at TU Berlin with the goal of getting students to contribute to Mahout. In the first semester I left the decision on when to open up the code to the students - they never went public. In the second semester I forced them to publish progress on a weekly basis (and made that part of how their final evaluation was done) - suddenly what was developed turned into a patch committed to the code base.


A second group of people that has an increasing interest in open source projects are researchers. In particular for EU project research grant the promise of providing results and software developed with the help of European tax-payers money under and open source license has become an important plus when asking for project grants.

However before becoming all too optimistic it might make sense to take a closer look: Even though there is an open source check box on your average research grant that by no means leads to highly motivated, well educated new contributors for your project: With software development only being a means to reach the ultimate goal of influential publications researchers usually do not have the time and motivation to polish software to the level needed for a successful and useful contribution. In addition the concept of maintaining your contribution for a longer time usually does not fit the timeline and timeframe of a research project.

Apart from teaching and mentoring projects themselves should start asking for the motivation of the contribution. There are a few popular arguments to contribute patches back. However not all of them really work for the research use case: The cost of maintaining a fork is close to zero if you intend to never upgrade to a new version and do not need security fixes. Another common argument is an improved visibility of your work and an improved reputation of yourself as software developer. If software development for you is just a means to reach a much higher goal those arguments may not mean much to you. A third common argument is that of improving code quality by having more than one pair of eyes review it - and where would you get a better review than in the project bringing together the original code authors? However if ultimate stability, security and flexibility is not your goal than also that may not mean much to you.

Key is to find out where the interest for working on open source comes from and build up arguments from there.

Software engineers

The third group I identified was professional software developers - as clarified after a question from the audience: Yes, I consider people who are unable to create, read, apply patches as professional software developers. If I would exclude these people there would be noone left who earns his living with software development and does not already work on open source projects.

In contrast to the above groups these people have extensive software development experience. However that also means that after having seen a lot of stuff that works and that does not work they do have a strong position in their teams. Usually those fixing issues in libraries they use re the ones that have established work-flows that work for them very well and who are used to being pretty influential. When going into an open source community however no-one knows them. In general they are only judged based on their patch. They get open feedback - in the context of that project. Projects tend to have established coding guidelines, best practices, build systems - that may differ from what you are used to in your corporate environment.

Getting up to speed in such an environment can be intimidating at best in particular if everything you do is public, searchable and findable by definition. All the more it is important to get involved and get feedback early by even putting online early sketches of what your plan is.

However with everything being open there is also one major positive side to motivating contributors: Give credit where credit is due - add praise to the issue tracker by assigning issues to the one providing he patch, add the name of the contributor to your release notes. When substantial, mention the contribution with name in talks, presentations and publications.

Another important issue here is the influence of deadlines: If it takes half a year to get feedback on your particular improvement the reason why you made it may no longer exist - the project may have been cancelled, the developer moved to a different team, the patch applied internally as is fixing the existing issues. Fast feedback on new patches, in particular if they are clean and come with tests is vital. One positive example for providing feedback on formal issues quickly is the automated review bot at Apache Hadoop: It checks stuff like style, addition of tests, checks against existing tests and the like quickly after the patch is submitted in an automated way. Just one nitpick from the audience: The output of that bot could be either marked more clearly as "this is automated" or the text formulated a bit friendlier - if a human had done the review it would have mentioned the positive things first before criticising what is wrong.

Last but not least (applies to researchers as well), there may be legal issues lurking: Most if not all contracts entail that at least what you do during working hours belongs to your employer - so it's up to them what gets open sourced and what doesn't. Suddenly your very technical new contributor has to convince management, deal with legal departments and work his way through the employers processes - most likely without deep prior knowledge on open source licenses - let alone contributor agreements (or did you know what the Apache CCLA entails, let alone being able to explain it to others before really getting active?)

General advise

To briefly summarise the most important points:

  • Give feedback fast - projects only run for so long, interest only lasts for so long. The faster a contributor is told what is not too great about his patch, the more likely those issues are fixed as part of the contribution. (Inspired by Avro and Zookeeper who were amazingly fast in providing feedback, committing and in the case of Avro even releasing a fixed version).
  • When it comes to new contributors be patient, remain friendly even when faced with seemingly stupid mistakes.
  • Give credit where credit is due - or could be due. Mention contributors in publications, press releases, release notes, the bug tracker. Let them know that you do. (Inspired by Drools, Tomcat, Zookeeper, Avro). Pro-tip: Make sure to have no typo in people's names even if checking takes one extra minute. (Learned from Otis).
  • Use any chance you get to teach the uninitiated about the whole patch process. I know that this seems trivial to those who work with open source on a daily basis. However when getting dependencies through Maven it may already be cumbersome to figure out where to get the source from. When used to git in the daily workflow it may be a hurdle to remember how to checkout stuff from svn ;) Back in June we had a Hadoop Hackathon in Berlin that was well attended - mostly by non-committers. Jakob Homan proposed a rather unusual but very well received format: In the Hadoop bug tracker there are several issues marked as trivial (typos in documentation and the like). Attendees were asked to choose one of these issues, checkout the source, create a patch and contribute it back to the project. Optionally they got explained how the process continues from there on the committer side of things. It may seem trivial to mechanically go through the patch process, however it help lower the bar in case you have a real issue to fix to first get accustomed to just how it works. If instead of contributing to Apache you are more into working on the Linux kernel I'd like to advise you to watch Greg Kroah Hartman on writing and submitting your first Linux kernel patch (FOSDEM).
  • Last but not least make sure to lower the bar for contribution - do not require people to jump through numerous loops, in general even just getting a patch ready is complicated enough. Provide a how to contribute page (e.g. see how to contribute and how to become a committer pages in the Apache Mahout wiki.
  • In particular when your project is still very young lower the bar by turning contributors into committers quickly - even if they are "just" contributing documentation fixes - in my view one of the most important contribution there is as only users spot areas for documentation improvement.

In case you yourself are thinking about contributing and need some additional advice as to why and for what purposes: Dr Dobbs has more information on reasons why developers tend to start to contribute to Apache software, Shalin explains why he contributes to open source, on the Mahout mailing list we hade a discussion on why also students should consider contributing, on the Apache community mailing list there was an interesting discussion on whether developers working on open source are happier than those that don't.


2011-12-30 02:07

Restate my assumptions.

One: Mathematics is the language of nature.

Two: Everything around us can be represented and understood through numbers.

Three: If you graph the numbers of any system, patterns emerge. Therefore, there are patterns everywhere in nature.

The above is a quote from today's "Hackers in movies" talk at 28c3 - which amongst others also showed a brief snippet of the movie Pi. For several years I stayed well away from that one famous Hackers' conference in Berlin that takes place annually between Christmas and New Year. 23C3 was the last congress I attended until today. Though there were several fantastic talks and mean presentation quality was pretty good the standard deviation of talk quality was just too high for my taste. In addition due to limited room sizes with 4 tracks there were quite a few space issues.

In recent years much of that has changed: The maximum number of tickets is strictly enforced, there is an additional lounge area in a large tent next to the entrance, for the sake of having larger rooms the number of tracks was reduced to three. Streaming works for the most part making it possible for those who did not get one of the 3000 full conference tickets to follow the program from their preferred hacker space. In addition fem does an amazing job of recording, mastering, encoding and pushing videos online: Hacker Jeopardy - a show that wasn't over until early Thursday morning (about 3a.m.?) - was up on Youtube at least on Thursday at 7a.m if not earlier.

Several nice geeks got me talked into joining the crowd briefly this evening for a the last three talks in "Saal 1" depicted above: You cannot be in Berlin during 28c3 and not see the so-called "fnord Jahresrückblick" by Fefe and Frank Rieger, creators of the Alternativlos podcast.

Overall it is amazing to watch BCC being invaded by a large group of hackers. It's fun to see quite a few of them on Alexanderplatz, watch people have fun with a TV B Gone in front of large electronics stores. It's great to get to watch highly technical but also several political talks 4 days in a row from 11 a.m. until at least 2p.m. the following day that are being given by people who are very passionate about what they do and the projects they spend their time on.

If you are into tinkering, hacking, trying out sorting algorithms and generally having fun with technology make sure you check out the 28c3 Youtube channel. If you want to learn more on Hacker culture, mark the days between Christmas and New Year next year and attend 29c3 - don't worry if you do not speak German - the majority of talks is in English, most of the ones that aren't are being translated on the fly by volunteers. If you are good at translations, feel free to volunteer yourself for that task. Speaking of volunteering: My respect to all angels (helping hands), heralds (those introducing speakers), noc (network operating center), poc (phone operating center), the organisation team and anyone who helps keep make that event as enjoyable to attendees as it is.

Update: Thank you to the geeks who after staying in our apartment for #28c3 helped get it back to a clean state - actually cleaner than it was before. You rock!

Learning Machine Learning with Apache Mahout

2011-12-13 22:20
Once in a while I get questions like Where to start learning more on machine learning. Other than the official sources I think there is quite good coverage also in the Mahout community: Since it was founded several presentations have been given that give an overview of Apache Mahout, introduce special features or even go into more details on particular implementations. Below is an attempt to create a collection of talks given so far without any claim to contain links to all videos or lectures. Feel free to add your favourite in the comments section. In addition I linked to some online courses with further material to get you started.

When looking for books of course check out Mahout in Action. Also Taming Text and the data mining book that comes with weka are good starting points for practitioners.

Introductory, overview videos

Technical details

Further course material

Berlin Tech Meetups

2011-12-09 22:32
Berlin currently is of growing interest for software engineers, has a very active startup scene and as a result several community organised meetups. Listed below is a short, "highly objective" selection of local user groups - showing just the breadth of topics discussed.

If you want to discover new meetups: It helps attending one that is closest to your interest as usually people follow several user groups. In addition watching the scheduled event at co-working and hacker spaces like co-up Berlin, betahaus, c-base can help.

Video up: Douwe Osinga

2011-12-09 22:01

Video: Max Jacob on Pig for NLP

2011-12-09 21:26

Slides online

2011-12-09 06:55
Slides of this week's Apache Hadoop Get Together Berlin are online at:

Overall a great event, well organised - looking forward to seeing you at the next Get Together. If you want to get in touch with our participants, learn about new events or simply chat between meetups join our Apache Hadoop Get Together Linked.In Group.

Apache Hadoop Get Together Berlin December 2011

2011-12-08 01:50
First of all a huge Thank You to David Obermann for organising today's Apache Hadoop Get Together Berlin: After a successful Berlin Buzzwords and a rather long pause following that finally a Christmas meetup took place today at Smarthouse, kindly sponsored by Axel Springer and organised by David Obermann from idealo. About 40 guests from Neofonie, Nokia, Amen, StudiVZ, Gameduell, TU Berlin, nurago, Soundcloud, and many others made it to the event.

In the first presentation Douwe Osinga from triposo went into some details on what Triposo is all about, how development for it differs in terms of scope and focus at larger corporations and what patterns they use for getting the data crawled, cleaned and served to users.

The goal of Triposo is to be able to build travel guides in a fully automated way. In contrast to simply creating a catalog of places to go to the goal is to have an application that is competitive to Lonely Planet books: Have tours, detailed background information, recommend places to visit based on wheather and seasonal signals, allow users to create travel books.

Joining Triposo from Google, Douwe gave a rather interesting perspective on what makes a startup interesting for innovative ideas. There are four interesting aspects of application development that according to his talk matter for Google projects: First is embracing failure. Not only can single hard disks fail, but servers might be switched off automatically for maintenance, even entire datacenters going offline must not affect your application. Second is a strong focus on speed: Developers working with dynamic languages like Python that allow for rapid prototyping at the expense of slower runtime are generally frowned upon. Third building block is the focus on search that is ingrained in every piece of architecture and thinking. Fourth and last is a strong mentality to build your own which may lead to great software but leaves software developers in an isolated island of proprietary software that can limit but at least shapes your way of thinking.

He gave Youtube as an example: Though built on top of MySQL, implemented in Python and certainly not failure proof in every aspect they succeeded by concentrating on users' needs, time to market and iteratively improving their software with a frequent (as in one week) develop-release-deploy cycle. When entering new markets and providing innovative applications it often is crucial to be able to move quickly at the expense of speed and stability. It certainly is important to consider different architectures and chose the one that is appropriate to solve the problem at hand. Same reasoning applies for Apache Hadoop as well: Do not try to solve problems with it that it is not made to solve. Instead first think what is the right tool for your job.

Triposo itself is built on top of 12 data sources. Most are freely available, integrated to build a usable and valuable travel guide application for iOS and Android. The features available in Triposo can be phrased in terms of a search and information retrieval problem setting and as such lend itself well for integrated sources. With offers from Amazon, Google itself, Dropbox and the like it has become easy to deploy applications in an elastic way and scale with your user base and demand for more country coverage. For them it proved advantages to go for an implementation based on dynamic languages for pure development speed.

When it comes to QA they take a semi-manual approach: There are scripts checking recall (Brandenburger Tor must be found for the Berlin guide) as well as precision (there must be only one Brandenburger Tor in Berlin). Those rules need to be manually tuned.

When integrating different sources you quickly run into a duplicate discovery problem. Their approach is pretty pragmatic: Merge anything that you are confident enough to say it is a duplicate. Kill everything that is likely a duplicate but you are not confident enough to merge. The general guideline is to rather miss a place than have it twice.

For the wikipedia source so far they are only parsing the English version. There are plans to also support other languages - in particular for parsing to increase data quality as e.g. for some places geo coordinates may be available in the German article but not in the English one.

Though not going into too many technical details the talk gave some nice insights as to the strengths and weaknesses of different company sizes and mindsets when it comes to innovation as well as stabilization. Certainly a startup to watch, glad to hear that though incorporated in the US most developers actually live in Berlin now.

The second talk was given by Max Jakob from Neofonie GmbH (working on EU funded research project Dicode) gave an overview of their pipeline for named entity extraction and disambiguation based on a language model extracted from the raw German wikipedia dump. They used Pig to scale a pipeline down from about a week to 1.5 hours with not much development overhead: Quite some logic could be re-used from the open source project pignlproc initiated by Olivier Grisel. This project already features a Wikipedia loader, a UDF for extracting information from Wikipedia documents and additional scripts for training and building corpora.

Based on that they defined the ML probability of a surface form being a named entity. The script itself is not very magical: The whole process can be expressed as a few steps involving grouping and couting tuples. The effect in terms of runtime vs. development time however is impressive.

Checkout their DICODE github project for further details on the code itself.

After the meetup about 20 attendees followed David to a bar nearby. It is always great to get a chance to talk to the speakers, exchange experiences with others and learn more on what people are actually working on with Hadoop after the event.

Slides of all talks are going to be posted soon, videos go online as soon as they are post processed so stay tuned for further information.

Looking forward to seeing you again for the next meetup. If you could not make it this time, there is a very easy way to not have that happen again next time: First speaker to submit a talk proposal to David sets the date and time of the meetup (taking into account any constraints with venue and video taping of course).