Don't dream it, be it

2013-12-24 12:07
After two years in a row of receiving 120 submissions for Berlin Buzzwords from the usual crowd - young, white, male, caucasian - only this year we decided we needed to work towards increasing diversity.One piece in the puzzle was to get in touch with several Berlin local "tech for non-tech" people groups. In a content exchange kind of setting I was asked to do an interview as some kind of role model.

In addition to a serious lack of time back then I felt the typical way these interviews go would do no good - even anyone who's in IT already learning I co-founded Apache Mahout, am a member of the Apache Software foundation, have co-founded Berlin Buzzwords (after running quite a few successful meetups around related topics in Berlin), am married to a Linux kernel developer tends to shy away (unless the person I'm talking to happens to be into OSS development themselves thus knowing that despite quite some work this also means having lots of fun).

However the invitation did get me started thinking about what kind of advise I would share with the next generation of hackers. Over time though I realised that what was most helpful for me doesn't only apply to those who want to become successful in IT. On first sight it sounds like an extremely easy to follow advise:

Once upon a time after coming back from the Kindergarden provided by my mom's employer I spent part of an afternoon in front of a computer in an office close the hers. The game was trivial: Direct a little Snake through a maze, collect items, avoid biting yourself or the walls.

Years in primary school I got to play with the computer of one of my relatives. Ever since beating their highscore I wanted a computer for myself. When I finally got a first computer on my own I used to play lots of games together with a good friend of mine - until the game supply for my Amiga 500 dried out. Back then I made a decision: To work towards simply coding my own games. Ever since I followed this tiny little dream - by now for almost twenty years.

Even despite the fact I got all support I could wish for from parents, teachers and university professors seen from the outside it may have seemed like not always being easy as pie: More often than not it meant being different - instead of being part of the "I don't know what I want to do after school" it meant being part of that small group of people who know what they are working for. Instead of being part of that large "I hate technology and I'm utterly bad at math" it meant being part of that tiny group of people who love math and who have fun dealing with any new technology.

Instead of being at one of those great parties for New Year's Eve it meant filling in the details of a project proposal a few hours before midnight. Instead of being home at 6p.m. it meant going to meetups more often than not. Instead of being home during weekends it meant flying to California for a conference on a weekend on my private budget. Despite getting 2.5 days a week from March to June to work on Berlin Buzzwords from my employer for the first two years and having lots of help and knowledge over at newthinking who did the heavy lifting of taking on the financial risk, managing registrations, booking the venue and handling speaker travel support it still meant lots of additional mornings, evenings and weekends spent on making the event fly (and an inbox that never went silent - neither at noon nor at midnight - hint: any mail you send to info@berlinbuzzwords ends up not only in an anonymous mailing list - every organiser including Simon Willnauer, Daniela Bentrup and myself will receive your mail and make sure it gets dealt with).

There are a couple of reasons I kept doing this kind of stuff. But I guess the most influential reason is simply that it also is a whole lot of fun for me.

There were several fellow students at school who didn't have the courage to follow their dreams from the very start: There's the girl who was teased into studying biology by her parents - only years later she had the courage to go for professional gardening. There's this guy who didn't know exactly what to do and followed many of his friends to study mechanical engineering. Months later he pivoted towards social sciences and politics, today working on a PhD. thesis on economics in German hospitals. There's the girl who successfully passed her math degree but really also wanted to follow her musical passion - in the end she went to study music in addition to become a math and music teacher.

It takes determination and courage to follow your dreams - especially if that means following a different path than what your parents had on their mind for you or following a path that doesn't quite fit into the cliché of what society has in mind for you*. However in my statistically absolutely non-significant, completely biased and personal opinion it's worth every effort.

* Sorry for as long as it is special to love repairing cars for a girl and to love working in child day care for a boy I don't believe in society not influencing career decisions.

On geeks growing up

2013-12-12 05:49
I'm a regular visitor of the Chemnitzer Linuxtage in March - at first going to talks learning lots of interesting stuff I didn't know about like aspect oriented programming, strace, squeak, which open source licenses are best for different strategies. As of late I had been there mostly to help out with the FSFE booth.

For context: The conference itself is hosted by the technical university in Chemnitz, it takes place on a weekend, they charge the tiny amount of 5 Euros for admission. In turn visitors get two full days of mostly well prepared, diverse talks and workshops. Speakers and exhibitors get access to the backstage catering area including free food and drinks all day and an after show dinner on Saturday evening. In general organisation is highly professional - WiFi just works, no super-long queues for meals (that for attendees are available for purchase during the breaks), equipment in the rooms usually just works.

One thing I found particular about the Linuxtage in Chemnitz was always how family friendly they are: Standing at the FSFE booth I've had it more often than not that parents who are not into IT at all would take their young kids who are "into computers" to the event. However also quite a few geeks tend to bring their off-spring: It all started with a toy corner years ago. By now the offer has been extended to be a separate quite room stuffed with lots of toys, visited not only by parents and kids but also engaged clowns and magicians for entertainment.

Ever since it seems like other conferences have followed the example:

Froscon isn't only offering a nursing room and play area - there's a jumping castle in the backyard for smaller children. For little hackers there is a special track stuffed with coding topics suitable for children - often even taught by younger ones.

EuRuCamp went another step further: Not only do they sell children tickets that are a lot cheaper than those offered for adults. For the very young ones the ticket includes babysitting services - organised in collaboration with a local Berlin babysitting service.

I been there for a while but last time I visited also Chaos Communication Congress and Camp drew several small hackers - in general there were tinkering workshops well suitable for slightly older little people.

Even FOSDEM that to my knowledge doesn't yet offer any special tracks or separate rooms for smaller ones was still able to draw a few families - most likely due in part to the "we are one big family" nature of the event (despite attendee numbers as high as 5k each year).

At least for Berlin it seems this trend has been acknowledged - as tech conferences you can get makey-makey packages for free from a local IT foundation.

On a more personal note: In contrast to all of the above the conference I'm involved in personally - Berlin Buzzwords - is pretty much business driven and profit oriented. However for good reason it has the reputation of still being very community oriented. For several editions I have tried finding ways to turn the event into something that is slightly more family friendly:

  • There once was an offer to bring your non-tech spouse or relatives with us organising a city tour for them. In an initial trial run this was tried on speakers - there was some response, but overall too few people made use of the offer to run it again.
  • There usually were play areas featuring foosball tables, table tennis and the like - but those mostly catered the geeks themselves really.
  • We ran at least one blog post asking for people in need for child care to get in touch with us - though there is the occasional request on twitter, nothing substantial came out of these initiatives.
  • I asked parents who I knew were visiting the conference themselves what would make them bring their children - the ones I asked mostly came back with a need for child care for very little people or a conference date during school holidays to bring older kids.
This year the approach we try is slightly different: We again host the event in Kulturbrauerei - a venue that is itself very well suited to experimenting with different formats: Several rooms from large to small, a nice back yard, a cinema and a few shops, well located in Prenzlauer Berg which itself is known for being almost too family friendly. We got in touch with the organisers of EuRuCamp to learn how they got baby sitting services sponsored - Dajana, thanks a ton for your input. In addition we put the invitation to bring kids and the baby sitting offering up online where every attendee inevitably will see it: There is a special ticket for kids (with limited availability though as this is the first trial run) that includes catering and day care you can book.

In addition there's also a catering only ticket that is way cheaper than the full conference access pass - so in case the conference pass is too expensive for you to pay privately however you'd still like to be at the event during your lunch break or in the evening this is the ideal option for you.

I have to admit I'm highly curious how this will play out. For me Berlin Buzzwords always was a great excuse to hand to friends in order to get them to visit the city at the best time of the year. As a result it meant that I could go to the conference by bicycle and have everyone else I would love to meet in town. It would be great if these two changes enable more people to be with us. It would be even better if these two changes did actually support the community flavour that I have been told Buzzwords has. Looking forward to seeing you in June!

Hello elasticsearch

2013-12-02 20:17
First of all a disclaimer: I had a little bit of time left during the last few weeks. As a result my blog migrated from dynamic wordpress content to statically hosted pages. If anything looks odd, in case you find any encoding issues, if you miss specific functionality - please do let me know. I'll switch from this beta url back to the old sub-domain in a week or so unless there are major complaints.

Today was my first day in a new office. Some of you may have heard it already: As of today I'm working for Elasticsearch. Apparently the majority of devs here are using Apple hardware so with a little help from Random Tutor and my husband I got my new machine equipped to boot Linux in parallel yesterday.

As a result what was left for today was reading lots of documentation, getting accustomed to the internal tools, attending my first daily standup, forking the repository, starting to take a closer look at the source code, issue tracker and design docs. Now looking forward to both - the elasticsearch training in December and meeting other elasticsearch people over at FOSDEM next year: Find me at their booth there. Thanks for the warm welcome everyone!

Building online communities - from the 0MQ trenches

2013-11-13 21:38
After seeing several talks on how open source communitites are organised at FOSDEM, on how to license open source software strategically at Chemnitzer Linuxtage and on how to nurture open source communities at Berlin Buzzwords over the past couple of years during the past year or so I've come to read quite a few articles and books on the art of building online communities. It all started with the now famous video on poisonous people of a talk given by Brian Fitzpatrick and Ben Collins-Sussman. Starting from there I went on to read their book "Team Geek" - a great read not only if you are working in the open source space but also if you have to deal with tech geeks and managers on a daily basis as part of your job.

I continued the journey with reading "Producing Open Source Software" - a book commonly recommended to read for those trying to understand how to run open source projects. Even though I started Apache Mahout back in 2008, first got in touch with the nutch/Lucene community in 2004 and wrote my first mails to realtime Linux mailing lists to ask for help for some university assignment as far back as I guess 2001 the book still contained many ideas that were new and valuable to me. Most important of all it presented most of the important aspects of running an open source project in a very concise nicely presented format.

After going to a talk on engineering a collaborative culture in the midst of flame wars (including a side note on how to even turn trolls into valuable community members that help new comers substantially) given by Kristian Koehntopp earlier at Froscon this year I started reading a book that he recommended: Building Successful online communities by MIT press.

Many of these texts come from people that either have an Apache background one way or another - or are of more general nature. Yesterday I was happy to take the ZeroMQ guide (also available on dead trees) and as github project you can contribute to) that Pieter Hintjens had kindly given to my husband earlier this year during FOSDEM and find a whole chapter on how he manages ZeroMQ.

The text is unique in that iMatix got into a very influential position in the project very early on. However based on decades of open source experience Pieter managed to avoid many of the mistakes beginners make from the very outset. Also having built several online communities before (ranging from open source projects to the NGO FFII) he deliberately designed the ZeroMQ development in a
way that would encourage a healthy community.

There are several essential aspects that I find interesting:

The ZeroMQ development model is explicitly codified - they call this C4: After the painful experience of discussing what seemed obvious but unspoken rules before codification the development team came up with a protocol for developing ZeroMQ - the protocol definition formulation being based on the rules IETF RFC are written. Many rules at Apache are not written down - especially when explaining how the Apache Way works to new projects in the incubator this becomes obvious again and again. Granted - apart from a handful of core values - Apache projects are essentially free to define their own way of working. However even within one project your mileage may very depending on who you ask how things are done. This makes it hard for newcomers to understand what's going on - but also can become an issue when problems arise.

A concept that I find interesting about the way ZeroMQ works is the separation between maintainers and contributors: Maintainers are people who pull code into mainline - contributors are those doing the actual coding. Essentially this means that in order to get a patch in it needs at least two people to look at it. This isn't too much different from a review-than-commit policy - just enforced and written down as good practice. It helps avoid panic errors of people committing code in a hurry. But it also makes sure that those writing code actually get the positive feedback they deserve - which in turn might help
avoiding fast contributor burn out.

Also this kind of split in roles makes sure that there are no people with special privileges - just because someone has commit access to the main repository doesn't mean he can take any shortcuts process wise: They still have to come up with a decent problem description, file a ticket, create a patch, submit the patch through a pull request and have it reviewed like anyone else. I found it interesting that though ZeroMQ is backed and initiated by iMatix Pieter considers it to be very important to keep a balance in power and delegate to non-iMatix contributors both, coding and design decisions.

With iMatix being a small company the stance on making ZeroMQ an LGPL license project is a very clear decision. It's the only way to ensure that downstream users cannot just take the project, make modifications to it, re-package and ship it to users without the accompanying source code under the same license. In turn this tends to make it much more likely that even capable users tend to contribute to upstream. Of course taking the idea itself and turning it into some proprietary project would still be very possible. However the one thing that sets ZeroMQ apart from other efforts is not the source code or the architecture alone - it's the way the community works and blossoms.

One part where this choice of license is particularly handy is the deliberate decision to not go through any copyright assignment process. Instead each patch gets licensed to the project under the regular LGPL terms. This means that even should iMatix one day be sold or change their minds re-licensing the whole project is utterly hard. The impact on the community is clear: It makes sure that contributors' patches remain their own - including all merit and praise that comes with it. This approach prevents re-licensing but encourages a sense of shared ownership. Essentially this model of copyright handling is not unlike the way the Linux kernel works.

The last point that I found important is the way the project itself is structured: Instead of having everyone work on one single project ZeroMQ makes it easy to write extensions to the core library. There is a whole guide on how to write language bindings. Those writing these bindings aren't regulated at all - they are hosted in their own repositories with their own governance if they want - in the end it's up to the user to decide which ones are good and which ones will never become popular. In turn this lead to many people contributing indirectly to the value of ZeroMQ in significant ways. This is not unlike other projects: Apache HTTPd provides APIs to write modules against. ElasticSearch provides a clean REST API that encourages people speaking other languages to develop plugins that will translate the REST API into whatever their preferred language is. Open/Libre Office deliberately encourages writing extensions and plugins - even providing hosting facilities where users can search and download extensions from third parties.

I leave it as an exercise to the reader to check out the whole book. Even in the community chapter there are several other interesting concepts as well: The experience ZeroMQ went through with actively encouraging even developers with commit access to the main repository to work with forks instead of feature branches for experimental development, the trouble they went through with making backwards in-compatible changes to user facing APIs way too often, the exact definition of the C4 development process.

Overall a really interesting perspective on open source development from the trenches with lots of experience to back the advise given. If you are interested in learning more on how open source projects work - and if you are using any you definitely should be, otherwise you are betting part of your business on something you do not understand which generally isn't the best idea of all.

Wonder if you should switch from your RDBMS to Apache Hadoop: Don't!

2013-08-26 17:10
Last weekend I spend a lot of fun time at FrOSCon* in Sankt Augustin - always great to catch up with friends in the open source space. As always there were quite a few talks on NoSQL, Hadoop, but also really solid advise on tuning your system for stuff like MySQL (including a side note on PostgreSQL and Oracle) from Kristian Köhntopp. When following some of the discussions in the audience before and after the talk I could not help but shake my head on some of the advise given about HDFS and friends.

This is to give a really short rule of thumb on what project to use for which occasion. Maybe it helps clear some false assumptions. Note: All of the below are most likely gross oversimplifications. Don't use it as hard and fast advise but as a first step towards finding more information with your preferred search engine.

Use Case 1 - relational data Technology
I have usual relational data Use a relational database - think MySQL and friends
I have relational data but my database doesn't perform. Tune your system, go to step 1.
I have relational data but way more reads than one machine can accomodate. Have master-slave replication turned on, configure enough slaves to accommodate your traffic.
I have relational data, but way too much data for a single machine. Start sharding your database.
I have a lot of reads (or writes) and too much data for a single machine. If the sharding+replication pain gets unbearable but you still need strong consistency guarantees start playing with HBase. You might loose the option of SQL but win being able to scale beyond traditional solutions. Hint: Unless your online product is hugely successful switching to HBase usually means you've missed some tuning option.
Use Case 2 - Crawling Technology
I want to store and process a crawl of the internet. Store it as flat files, if you like encode metadata together with the data in protocol buffers, thrift or Avro.
My internet crawl no longer fits on a single disk. Put multiple disks in your machine, RAID them if you like.
Processing my crawl takes too long. Optimise your pipeline. Make sure you utilise all processors in your machine.
Processing the crawl still takes too long. If your data doesn't fit on a single machine, takes way too long to process but there is no bigger machine that you can reasonably pay for you are probably willing to take some pain. Get yourself more than one machine, hook them together, install Hadoop and use either plain map reduce , Pig, Hive or Cascading to process the data. Distribution-wise Apache, Cloudera, MapR, Hortonworks are all good choices.
Use Case 3 - BI Technology
I have structured data and want my business analysts to find great new insights. Use a data warehouse your analysts are most familiar with.
I want to draw conclusions from one year worth of traffic on a busy web site (hint: resulting log files no longer fit on the hard disk of my biggest machine). Stream your logs into HDFS. From here it depends: If it's your developers that want to get their hands dirty, Cascading and depending packages might be a decent idea. There's plenty of UDFs in Pig that will help you as well. If the work is to be done by data analysts that only speak SQL use Hive.
I want to correlate user transactions with social media activity around my brand. See above.

A really short three bullet point summary

  • Use HBase to scale the backend your end users interact with. If you want to trade strong consistency for being able to span multiple datacenters on multiple continents take a look at Cassandra.
  • Use plain HDFS with Hive/Pig/Cascading for batch analysis. This could be business intelligence queries against user transactions, log file analysis for statistics, data extraction steps for internet crawls, social media data or other sensor data.
  • Use Drill or Impala for low latency business intelligence queries.

Good advise from ApacheConEU 2008/9

Back at one of the first ApacheCons I ever attended there was an Apache Hadoop BoF. One of the attendees asked for good reasons to switch from his current working infrastructure to Hadoop. In my opinion the advise he got from Christophe Bisciglia is still valid today. Paraphrased version:
For as long as you wonder why you should be switching to Hadoop, don't.

A parting note: I've left CouchDB, MongoDB, JackRabbit and friends out of the equation. The reason for this is my own lack of first-hand experience with those projects. Maybe someone else can add to the list here.

* A note to the organisers: Thilo and myself married last year in September. So when seeing the term "Fromm" in a speaker's surname doesn't automatically mean that the speaker hotel room should be booked on the name "Thilo Fromm" - the speaker on your list could as well be called "Isabel Drost-Fromm". It was hilarious to have the speaker reimbursement package signed by my husband though this year around I was the one giving a talk at your conference ;)

JAX: Project Nashorn

2013-05-24 20:41
The last talk I went to was on project Nashorn - demonstrating the capability
to run dynamic languages on the JVM by writing a JavaScript implementation as a
proof of concept that is fully ECMA compliant and still performs better than
Mozilla's project Rhino.

It was nice to see Lisp, created in 1962, referenced as being the first
language that featured a JIT compiler as well as garbage collection. It was
also good to see Smalltalk referenced as pioneering class libraries, visual GUI
driven IDEs and bytecode.

As such Java essentially stands on the shoulders of giants. Now dynamic
language writers can themselves use the JVM to boost their productivity by
profiting from the VM's memory management, JIT optimisations, native threading.
The result could be a smaller code base and more time to concentrate on
interesting language features (of course another result would be that the JVM
becomes interesting not only for Java developers but also to people who want to
use dynamic languages instead).

The projects invoke dynamic as well as the DaVinci machine are both interesting
areas for people to follow who are interested in running dynamic languages on
the JVM.

JAX: Tales from production

2013-05-23 20:38
In a second presentation Peter Roßbach together with Andreas Schmidt provided
some more detail on what the topic logging entails in real world projects.
Development messages turn into valuable information needed to uncover issues
and downtime of systems, capacity planning, measuring the effect of software
changes, analysing resource usage under real world usage. In addition to these
technical use cases there is a need to provide business metrics.

When dealing with multiple systems you deal with correlating values across
machines and systems, providing meaningful visualisations to draw the correct

When thinking of your log architecture you might want to consider storing not
only log messages. In addition facts like release numbers should be tracked
somewhere - ready to join in when needed to correlate behaviour with release
version. To do that also track events like rolling out a release to production.
Launching in a new market, switching traffic to a new system could be other
events. Introduce not only pure log messages but also provide aggregated
metrics and counters. All of these pieces should be stored and tracked
automatically to free operations for more important work.

Have you ever thought about documenting not only your software, it's interfaces
and input/output format? What about documenting the logged information as well?
What about the fields contained in each log message? Are they documented or do
people have to infer their meaning from the content? What about valid ranges
for values - are they noted down somewhere? Did you store whether a specific
field can only contain integers or whether some day it also could contain
letters? What about the number format - is it decimal, hexadecimal?

For a nice architecture documentation of the BBC checkout

Winning the metrics battle by the BBC dev blog.

There's an abundance of tools out there to help you with all sorts of logging
related topics:

  • For visualisation and transport: Datadog, kibana, logstash, statsd,
    graphite, syslog-ng

  • For providing the values: JMX, metrics, Jolokia

  • For collection: collecd, statsd, graphite, newrelic, datadog

  • For storage: typical RRD tools including RRD4j, MongoDB, OpenTSDB based
    on HBase, Hadoop

  • For charting: Munin, Cacti, Nagios, Graphit, Ganglia, New Relic, Datadog

  • For Profiling: Dynatrace, New Relic, Boundary

  • For events: Zabbix, Icinga, OMD, OpenNMS, HypericHQ, Nagios,JbossRHQ

  • For logging: splunk, Graylog2, Kibana, logstash

Make sure to provide metrics consistently and be able to add them with minimal
effort. Self adaption and automation are useful for this. Make sure developers,
operations and product owners are able to use the same system so there is no
information gap on either side. Your logging pipeline should be tailored to
provide easy and fast feedback on the implementation and features of the

To reach a decent level of automation a set of tools is needed for:

  • Configuration management (where to store passwords, urls or ips, log
    levels etc.). Typical names here include Zookeeper,but also CFEngine, Puppet
    and Chef.

  • Deployment management. Typical names here are UC4, udeploy, glu, etsy

  • Server orchestration (e.g. what is started when during boot). Typical
    names include UC4, Nolio, Marionette Collective, rundeck.

  • Automated provisioning (think ``how long does it take from server failure
    to bringing that service back up online?''). Typical names include kickstart,
    vagrant, or typical cloud environments.

  • Test driven/ behaviour driven environments (think about adjusting not
    only your application but also firewall configurations). Typical tools that
    come to mind here include Server spec, rspec, cucumber, c-puppet, chef.

  • When it comes to defining the points of communication for the whole
    pipeline there is no tool you can use that is better than traditional pen and
    paper, socially getting both development and operations into one room.

The tooling to support this process goes from simple self-written bash scripts
in the startup model to frameworks that support the flow partially, up to
process based suites that help you. No matter which path you choose the goal
should always be to end up with a well documented, reproducable step into
production. When introducing such systems problems in your organisation may
become apparent. Sometimes it helps to just create facts: It's easier to ask for
forgiveness than permission.

JAX: Logging best practices

2013-05-22 20:37
The ideal outcome of Peter Roßbach's talk on logging best practices was to have
attendees leave the room thinking ``we know all this already and are applying
it successfully'' - most likely though the majority left thinking about how to
implement even the most basic advise discussed.

From his consultancy and fire fighter background he has a good overview of what
logging in the average corporate environment looks like: No logging plan, no
rules, dozens of logging frameworks in active use, output in many different
languages, no structured log events but a myriad of different quoting,
formatting and bracketing standards instead.

So what should the ideal log line contain? First of all it should really be a
log line instead of a multi line something that cannot be reconstructed when
interleaved with other messages. The line should not only contain the class
name that logged the information (actually that is the least important piece of
information), it should contain the thread id, server name, a (standardised and
always consistently formatted) timestamp in a decent resolution (hint: one new
timestamp per second is not helpful when facing several hundred requests per
second). Make sure to have timing aligned across machines if timestamps are
needed for correlating logs. Ideally there should be context in the form of
request id, flow id, session id.

When thinking about logs, do not think too much about human readability - think
more in terms of machine readability and parsability. Treat your logging system
as the db in your data center that has to deal with most traffic. It is what
holds user interactions and system metrics that can be used as business
metrics, for debugging performance problems, for digging up functional issues.
Most likely you will want to turn free text that provides lots of flexibility
for screwing up into a more structured format like json, or even some binary
format that is storage efficient (think protocol buffers, thrift, avro).

In terms of log levels, make sure to log development traces on trace, provide
detailed problem analysis stuff on debug, put normal behaviour onto info. In
case of degraded functionality, log to warn. In case of things you cannot
easily recovered from put them on error. When it comes to logging hierarchies -
do not only think in class hierarchies but also in terms of use cases: Just
because your http connector is used in two modules doesn't mean that there
should be no way to turn logging on just for one of the modules alone.

When designing your logging make sure to talk to all stakeholders to get clear
requirements. Make sure you can find out how the system is being used in the
wild, be able to quantify the number of exceptions; max, min and average
duration of a request and similar metrics.

Tools you could look at for help include but are not limited to splunk, jmx,
jconsole, syslog, logstash, statd, redis for log collection and queuing.

As a parting exercise: Look at all of your own logfiles and count the different
formats used for storing time.

JAX: Java performance myths

2013-05-22 20:37
This talk was one of the famous talks on Java performance myths by Arno Haase.
His main point - supported with dozens of illustrative examples was for
software developers to stop trusting in word of mouth, cargo cult like myths
that are abundant among engineers. Again the goal should be to write readable
code above all - for one the Java compiler and JIT are great at optimising. In
addition many of the myths being spread in the Java community that are claimed
to lead to better performance are simply not true.

It was interesting to learn how many different aspects of both software and
hardware contribute to code performance. Micro benchmarks are considered
dangerous for a reason - creating a well controlled environment that matches
what the code will encounter in production is influenced by things like just in
time compilation, cpu throttling, etc.

Some myths that Arno proved wrong include final making code faster (in case of
method parameters it doesn't make a difference up to bytecode being identical
with and without), inheritance being always expensive (even with an abstract
class between the interface and the implementation Java 6 and 7 can still
inline the method in question). Another one was on often wrongly scoped Java
vs. C comparisons. One myth resolved around the creation of temporary objects -
since Java 6 and 7 in simple cases even these can be optimised away.

When it comes to (un-)boxing and reflection there is a performance penalty. For
the latter mostly for method lookup, not so much for calling the method. What we
are talking about however are penalties in the range of about 1000 compute
cycles. Compared to doing any remote calls this is still dwarfed. Reflection on
fields is even cheaper.

One of the more wide spread myths resolved around string concatenation being
expensive - doing a ``A'' + ``B'' in code will be turned into ``AB'' in
bytecode. Even doing the same with a variable will be turned into the use of
StringBuilder ever since -XX:OptimizeStringConcat was turned on by default.

The main message here is to stop trusting your intuition when reasoning about a
system's performance and performance bottlenecks. Instead the goal should be to
go and measure what is really going on. Those are simple examples where your
average Java intuition goes wrong. Make sure to stay on top with what the JVM
turns your code into and how that is than executed on the hardware you have
rolled out if you really want to get the last bit of speed out of your

JAX: Does parallel equal performant?

2013-05-21 20:34
In general there is a tendency to set parallel implementations to being equal
to performant implementations. Except in the really naive case there is always
going to be some overhead due to scheduling work, managing memory sharing and
network communication overhead. Essentially that knowledge is reflected in
Amdahl's law (the amount of serial work limits the benefit from running parts
of your implementation in parallel,'s_law),
and Little's law ('s_law) in case of queuing

When looking at current Java optimisations there is quite a bit going on to
support better parallelisation: Work is being done to provide for improving
lock contention situations, the GC adaptive sizing policy has been improved to
a usable state, there is added support for parallel arrays and lampbda's
splitable interface.

When it comes to better locking optimisations what is most notable is work
towards coarsening locks at compile and JIT time (essentially moving locks from
the inside of a loop to the outside); eliminating locks if objects are being
used in a local, non-threaded context anyway; and support for biased locking
(that is forcing locks only when a second thread is trying to access an
object). All three taken together can lead to performance improvements that
will almost render StringBuffer and StringBuilder to exhibit equal performance
in a single threaded context.

For pieces of code that suffer from false sharing (two variables used in
separate threads independently that end up in the same CPU cacheline and as a
result are both flushed on update) there is a new annotation: Adding the
"@contended" annotation can help the compiler for which pieces of code to add
cacheline padding (or re-arrange entirely) to avoid that false sharing from
happening. One other way to avoid false sharing seems to be to look for class
cohesion - coherent classes where methods and variables are closely related
tend to suffer less from false sharing. If you would like to view the resulting
layout use the "-XX:PrintFieldLayout" option.

Java 8 will bring a few more notable improvements including changes to the
adaptive sizing GC policy, the introduction of parallel arrays that allow for
parallel execution of predicates on array entries, changes to the concurrency
libraries, internalised iterators.