Apache Mahout Podcast

2010-12-13 21:21
During Apache Con ATL Michael Coté interviewed Grant Ingersoll on Apache Mahout. The interview is available online as podcast. The interview covers the goals and current use cases of the project, goes into some detail on the reasons for initially starting it. If you are wondering what Mahout is all about, what you can do with it and which direction development is heading, the interview is a great option to find out more.

Apache Con – Wrap up

2010-11-29 23:27
After one week of lots of interesting input the ASF's user conference was over. With a focus on Apache software users quite a few talks are not too well suited for conference regulars but more or less targeted at newbies who want to know who too successfully apply the software. As a developer of Mahout with close ties to the Lucene and Hadoop community what is of course most interesting to me are stories of users putting the software into production.
The conference was well organised: The foundation features way more projects at the moment than can reasonably be covered in just one conference. As a result Apache Con only covers only a subset of what is being developed at Apache. As such more and more smaller community organised events are being run by individuals as well as corporations. Still Apache Con is a nice place to get in touch with other developers – and to get a glimpse of what is going on in project outsides ones own regular community.

Teddy in Atlanta

2010-11-28 23:24
While I was happily attending Apache Con US in Atlanta/GA my teddy had a closer look at the city: He first went to the centennial olympic park, took a picture of the world of coca-cola (wondering what strange kinds of museums there are in the US.



After that he headed over to Midtown having a quiet time in the Piedmont park. And finally had a closer look at the private houses still decorated for Halloween. Seems like it was squirrel day that day: Met more than ten squirrels he told me.



I found quite some impressive pictures of the arts museum on my camera after his trip out – as well as several images taken at the campus of the Georgia tech university. It's amazing to see what facilities are available to students – especially compared to the equipment of German universities.

Apache Con – last day

2010-11-27 23:23
Day three of Apache Con started with interesting talks on Tomcat 7, including an introduction to the new features of that release. Those include better memory leak prevention and detection capabilities – the implementation of these capabilities have lead to the discovery of various leaks that appear under more or less weird circumstances in famous open source libraries and the JVM itself. But also better management and reporting facilities are part of the new release.

As I started the third day over at the Tomcat track, unfortunately I missed the Tika and Nutch presentations by Chris Mattman – so happy, that at least the slides were published online: $LINK. The development of nutch was especially interesting for me as that was the first Apache project I got involved with back in 2004. Nutch started out as a project with the goal of providing an open source alternative internet-scale search engine. Based on Lucene as a indexer kernel, it also providing crawling, content extraction and link analysis.

Focussed on building an internet scale search engine the need for a distributed processing environment quickly became apparent. Initial implementations of a nutch distributed file system and a map reduce engine lead to the creation of the Apache Hadoop project.

In recent years it was comparably quiet around nutch. Besides Hadoop also content extraction was factored out of the project into Apache Tika. At the moment development is getting more momentum again. Future developments are supposed to be focussed on building an efficient crawling engine. As storage backend the project wants to leverage Apache HBase, for content extraction Tika is to be used, as indexing backend Solr.

I loved the presentation by Geoffrey Young on how they used Solr to replace their old MySQL search based system for better performance and more features at Ticketmaster. Indexing documents representing music CDs presents some special challenges when it comes to domain modeling: There are bands with names like “!!!”. In addition users are very likely to misspell certain artists names. In contrast to large search providers like Google these businesses usually have neither human resources nor enough log data to provide comparable coverage e.g. when implementing spell-checking. A very promising and agile approach taking instead was to parse log files for most common failing queries and from that learn more about features needed by users: There were many queries including geo information coming from users looking for an event at one specific location. As a result geo information was added to the index leading to happier users.

Apache Con – Mahout, commons and Lucene

2010-11-26 23:21
The second day the track interesting to me provided an overview of some of the Apache commons projects. So seemingly small in scope and light-weight in implementation and dependencies these projects provide vital features not yet well supported by the Sun JVM. There is a commons math implementation featuring a fair amount of algebraic, numeric and trigonometric functions (among others), the commons exec framework for executing processes externally to the JVM w/o running into the danger of creating dead-locks or wasting resources.

After that the Mahout and Lucene presentations were up. Grant gave a great overview of various use-cases of machine learning in the wild, rightly claiming that anyone using the internet today makes use of some machine learning powered application each day – be it e-mail spam filtering, the Gmail priority inbox, recommendaed articles on news sites, recommended items to buy at shopping sites or targeted advertisements shown when browsing. The talk was concluded by a more detailed presentation of how to successfully combine the features of Mahout and Lucene/Solr to built next generation web services that integrate user feedback into their user experience.

ApacheCon - Keynotes

2010-11-25 23:20
The first keynote was given by Dana Blankenhorn – a journalist and blogger regularly publishing tech articles with a clear focus on open source projects. Focussed on the evolution of open source projects with a special focus on Apache.

Coming from a research background the keynote given by Daniel Crichton from NASA was very interesting to me: According to the speaker scientists are facing challenges that are all to known to large and distributed corporations. Most areas in science is currently becoming more and more dependent on data intensive experiments. Examples include but are not limited to

  • The field of biology where huge numbers of experiments are needed to decipher the internal workings of proteins, or to be able to understand the fundamental concepts underlying data encoded in DNA.
  • In physics hadron collider experiments huge amounts of data are generated with each experiment. With facilities for running such experiments are expensive to build and the amount of data generated is for too large to be analysed by just one team groups of scientists are suddenly facing the issue of exchanging data with remote research groups. They suddenly run into the requirement of integrating their system to those of other groups. All of a sudden data formats and interfaces have to somehow be standardised.
  • Running space missions used to be limited to just a very small number of research institutions in a very tiny number of countries. However this is about to change as more countries are gaining the knowledge and facilities to run space missions. Again this leads to the need to be able to collaborate towards one common goal.

Not only are software systems so far distinct and incompatible. Even data formats used usually are incompatible. The result are scientists spending most of their time re-formatting, converting and importing datasets before being able to get any real work done. At the moment research groups are not used to working collaboratively in distributed teams. Usually experiments are run on specially crafted, one-of software that cannot be easily re-used, that does not adhere to any standards and that is being re-written over and over again by every research group. Re-using existing libraries is oftentimes a huge cultural shift as researchers seemingly are afraid of external dependencies, afraid of giving up control over part of their system.

One step into the right direction was taken by NASA earlier this year: They released their decision making support system OODT under a free software license (namely the Apache Software License) and put the project under incubation at Apache. The project currently is about to graduate to its own top level Apache project. This step is especially remarkable as successfully going through the incubator also means to have established a healthy community that is not only diverse but also open to accepting incoming patches and changes to the software. This means to not only give up control over your external dependencies but also having the project run in a meritocratic, community driven model. For the contributing organisation, this boils down to no longer having total control over the future roadmap of the project. In return this usually leads to higher community participation, and higher adoption in the wild.

Apache Con – Hadoop, HBase, Httpd

2010-11-24 23:19
The first Apache Con day featured several presentations on NoSQL databases (track sponsored by Day software), a Hadoop track as well as presentations on Httpd and an Open source business track.

Since its inception Hadoop always was intended to be run in trusted environments firewalled from hostile users or even attackers. As such it never really supported any security features. This is about the change with the new Hadoop release including better Kerberos based security.

When creating files in Hadoop a long awaited feature was append support. Basically up to now writing to Hadoop was a one-of job: Open a file, write your data, close it and be done. Re-opening and appending data was not possible. This situation is especially bad for HBase as its design relies on being able to append data to an existing file. There have been efforts for adding append support to HDFS earlier as well as an integration of such patches by third party vendors. However only with a current Hadoop version Append is officially supported by HDFS.

A very interesting use case-wise of the Hadoop stack was presented by $name-here from $name. They are using a Hadoop cluster to provide a repository of code released under free software licenses. The business case is to enable corporations to check their source code against existing code and spot license infringements. This does not only include linking to free software under incompatible licenses but also developers copying pieces of free code, e.g. copying entire classes or functions into internal projects that originally were available only under copyleft licenses.

The speaker went into some detail explaining the initial problems they had run into: Obviously it's no good idea to mix and match Hadoop and HBase versions freely. Instead it is best practice to use only versions claimed to be compatible by the developers. Another common mistake is to leave parameters of both projects at their defaults. The default parameters are supposed to be fool-proof. However they are optimised to work well for Hadoop newbies who want to try out the system on a single node cluster and in a distributed setting obviously need more attention. Other anti-patterns include storing only tiny little files in the cluster thus quickly running out of memory on the namenode (that stores all file information including block mappings in main memory for faster access).

In the NoSQL track Jonathan Grey from Facebook gave a very interesting overview on the current state of HBase. Turns out that Facebook would announce only a few days after that their internal use of HBase for the newly launched feature of Facebook messaging.

HBase has adopted a release cycle including development/ production releases to get their systems into interested users' hands more quickly: Users willing to try out new experimental features can use the development releases of HBase. Those who are not should go for the stable releases.

After focussing on improving performance in the past months the project is currently focussing on stability: Data loss is to be avoided by all means. Zookeeper is to be integrated more tightly for storing configuration information thus enabling live reconfiguration (at least to some extend). In addition also HBase is targeting to integrate stored procedures like behaviour: As explained in Googles Percolator paper $LINK batch oriented processing get's you only so far. If data that gets added constantly it makes sense to give up on some of the throughput batch-based systems provide and instead optimise for shorter processing cycles by implementing event triggered processing.

On recommendation of one of neofonie's sys-ops I visited some of the httpd talks: First Rich Bowen gave an overview of unusual tasks one can solve with httpd. The list included things like automatically re-writing http response content to match your application. There is even a spell checker for request URLs: Given marketing has given your flyer to the press with a typo in the url, chances are that the spellchecking module can fix these automatically for each request: Common mistakes covered are switched letters, numbers replaced by letters etc. The performance cost has to be paid only in case no hit could be found – so instead of returning a 404 right away the server first tries to find the document by taking into account common mis-spellings.

Apache Con – Hackathon days

2010-11-23 23:17
This year on Halloween I left for a trip to Atlanta/GA. Apache Con US was supposed to take place there featuring two presentations on Apache Mahout – one by Grant Ingersoll explaining how to use Mahout to provide better search features in Solr, one by myself with a general introduction to what features Mahout provides, giving a bit more detailed information on how to use Mahout for classificaiton.

I spent most of Monday in Sally Khudairi's media training. In the morning session she explained the Ins and Outs of successfully marketing your open source project: One of the most important questions is to be able to provide a dense but still accessible explanation of what your project is all about and how it differentiates from other projects potentially in the same space. As a first exercise attendees would meet in pairs interviewing each other about their respective project. When summarising the information I had gotten, Sally quickly pointed out additional pieces of valuable information I had totally forgotten to ask about:


  • First of all the full name of the interviewee, including the sur-name.
  • Second the background of the person with respect to the project. It seemed all to natural that someone you meet at Apache Con in a Media Training almost certainly is either founder or core-committer to the project. Still it is interesting to know more on how long he has been contributing, whether he maybe even co-founded the project.


After that first exercise we would go into detail on various publication formats. When releasing project information the first format that comes to mind are press releases. For software projects at the ASF these are created in a semi-standardised format containing

  • Background on the foundation itself.
  • Some general background on the project.
  • A few paragraphs on the news to be published on the project in an easily digestible format.
  • Contact information for more details.


Some of these parts can be re-used across different publications and occasions. It does make sense to keep these building blocks as a set of boilerplates ready to use when needed.

After lunch Michael Coté from redmonk visited us. Michael has a development background, currently he works as business analyst for redmonk. It is fairly simple to explain technical projects to fellow developers. To get some experience in explaining our project also to non-technical people Sally invited Michael to interview us. By the end of the interview Michael asked each whether they had any question for him. As understanding what machine learning can do for your average Joe programmer is not all to trivial I simply asked him for strategies for better explaining or show-casing our project. One option that came to his mind was to come up with one – or a few – example show cases where Mahout is applied to freely available datasets. Currently most data analysis systems are rather simple or based only on a very limited set of data. Showing on a few selected use cases what can be done with Mahout should be a good way to get quite some media attention for the project.

During the remaining time of the afternoon I started working a short explanation of Mahout and our latest release. The text was reviewed by the Mahout community. The text was published by Sally on the blog of the Apache Software foundation. I also used it as a basis for an article on heise open that got published that same day.

The second day was reserved for a mixture of attending the Barcamp session and hacking away at the Hackathon. Ross had talked me into giving an overview of various Hadoop use cases as that was requested by one of the attendees. However it turned out the guy wasn't really interested in specific use cases: The discussion quickly turned into the more fundamental question of how far the ASF should go in promoting its projects. Should there be a budget for case studies? Should there even be some marketing department. Well, clearly that is out of scope for the foundation. And in addition would run contrary to it being a neutral ground for vendors to collaborate towards common goals while still separately making money providing consulting services, selling case studies etc.

During the Hackathon I was turned into a Mentor for Stanbol, a new project entering incubation just now. In addition I spent some time to finally catch up with the Mahout mailing list.

Travelling

2010-11-22 23:17
Currently on my way back from a series of conferences in the past three weeks in the IC from Schiphol. After three weeks of conferences, lots of new input and lots of interesting projects I learned about it is finally time to head back and put the stuff I have learned to good use.


View Travelling in a larger map

As seems normal with open source conferences I got far more input on interesting projects than I can expect to ever get applied in on a daily basis. Still it is always inspiring to meet with other developers in the same field – or even quite different fields and learn more on what projects they are working on, how they solve various problems.

A huge Thank You goes to the DICODE EU research project for sponsoring the Apache Con and Devoxx trips, another Thanks to Sapo.pt for inviting me to Lisbon and covering travel expenses. A special thank you to the assistant at neofonie who made travel arrangements for Atlanta and Antwerp: It all worked without problems even up to me having a power outlet in the train that is finally taking me back.

Apache Mahout at Apache Con NA

2010-10-15 20:39
The upcoming Apache Con NA to take place in Atlanta will feature several tracks relevant to users of Apache Mahout, Lucene and Hadoop: There will be a full track on Hadoop as well as one on NoSQL on Wednesday featuring talks on the framework itself, Pig and Hive as well as presentations from users on special use cases and on their way of getting the system to production.

The track on Mahout, Lucene and friends starts on Thursday afternoon, followed by another series of Lucene presentations on Friday.



Also don't miss the track on the community and business tracks for a glimpse behind the scenes. Of course there will also be tracks on well-known Apache Tomcat, httpd, OSGi and many, many more.

Looking forward to meeting you in Atlanta!