Memoirs of a Roadie

Even Flow

Posted on 8th December 2013

The following is part of an occasional series of highlighting CPAN modules/distributions and why I use them. This article looks at Data::FlexSerializer.

Many years ago the most popular module for persistent data storage in Perl was Storable. While still used, it's limitations have often cause problems. It's most significant problem was that each version was incompatible with another. Upgrading had to be done carefully. The data store was often unportable, and made making backups problematic. In more recent years JSON has grown to be more acceptable as a data storage format. It benefits from being a compact data structure format, and human readable, and was specifically a reaction to XML, which requires lots of boilerplate and data tags to form simple data elements. It's one reason why most modern websites use JSON for AJAX calls rather than XML.

Booking.com had a desire to move away from Storable and initially looked to moving to JSON. However, since then they have designed their own data format, Sereal. But more of that later. Firstly they needed some formatting code to read their old Storable data, and translate into JSON. The next stage was to compress the JSON. Although JSON is already a compact data format, it is still plain text. Compressing a single data structure can reduce the storage by as much as half the original data size, which when you're dealing with millions of data items can be considerable. In Booking.com's case they needed to do this with zero downtime, running the conversion on live data as it was being used. The resulting code was to later become the basis for Data::FlexSerializer.

However, for Booking.com they found JSON to be unsuitable for their needs, as they were unable to store Perl data structures they way they wanted to. As such they created a new storage format, which they called Searal. You can read more about the thoughts behind the creation of Sereal on the Booking.com blog. That blog post also looks at the performance and sizes of the different formats, and if you're looking for a suitable serialisation format, Sereal is very definitely worth investigating.

Moving back to my needs, I had become interested in the work Booking.com had done, as within the world of CPAN Testers, we store the reports in JSON format. With over 32 million reports at the time (now over 37 million), the database table had grown to over 500GB. The old server was fast running out of disk space, and before exploring options for increasing storage capacity, I wanted to try and see whether there was an option to reduce the size of the JSON data structures. Data::FlexSerializer was an obvious choice. It could read uncompressed JSON and return compressed JSON in milliseconds.

So how easy was it to convert all 32 million reports? Below is essentially the code that did the work:

  my $serializer = Data::FlexSerializer->new( detect_compression => 1 );

    for my $next ( $options{from} .. $options{to} ) {
        my @rows = $dbx->GetQuery('hash','GetReport',$next);
        return unless(@rows);

        my ($data,$json);
        eval {
            $json = $serializer->deserialize($rows[0]->{report});
            $data = $serializer->serialize($json);
        };

next if($@ || !$data);

$dbx->DoQuery('UpdateReport',$data,$rows[0]->{id});
}

Simple, straighforward and got the job done very efficiently. The only downside was the database calls. As the old server was maxed out on I/O, I could only run the script to convert during quiet periods as the CPAN Testers server would become unresponsive. This wasn't a fault of Data::FlexSerializer, but very much a problem with our old server.

Before the conversion script completed, the next step was to add functionality to permanently store reports in a compressed format. This only required 3 extra lines being added to CPAN::Testers::Data::Generator.

  use Data::FlexSerializer;

$self->{serializer} = Data::FlexSerializer->new( detect_compression => 1 );

my $data = $self->{serializer}->serialize($json);

The difference has been well worth the move. The compressed version of the table has reclaimed around 250GB. Because MySQL doesn't automatical free the data back to the system, you need to run the optimize command on a table. Unfortunately, for CPAN Testers this wouldn't be practical as it would mean locking the database for far too long. Also with the rapid growth of CPAN Testers (we now receive over 1 million reports a month) it is likely we'll be back up to 500GB in a couple of years anyway. Now that we've moved to a new server, our backend hard disk is 3TB, so has plenty of storage capacity for several years to come.

But I've only scratched the surface of why I think Data::FlexSerializer is so good. Aside from its ability to compress and uncompress, as well as encode and decode, at speed, it is ability to switch between formats is what makes it such a versatile tool to have around. Aside from Storable, JSON and Sereal, you can also create your own serialisation interface, using the add_format method. Below is an example, from the module's own documentation, which implements Data::Dumper as a serialsation format:

    Data::FlexSerializer->add_format(
        data_dumper => {
            serialize   => sub { shift; goto \&Data::Dumper::Dumper },
            deserialize => sub { shift; my $VAR1; eval "$_[0]" },
            detect      => sub { $_[1] =~ /\$[\w]+\s*=/ },
        }
    );
     
    my $flex_to_dd = Data::FlexSerializer->new(
      detect_data_dumper => 1,
      output_format => 'data_dumper',
    );

It's unlikely CPAN Testers will move from JSON to Sereal (or any other format), but if we did, Data::FlexSerializer would be only tool I would need to look to. My thanks to Booking.com for releasing the code, and thanks to the authors; Steffen Mueller, Ævar Arnfjörð Bjarmason, Burak Gürsoy, Elizabeth Matthijsen, Caio Romão Costa Nascimento and Jonas Galhordas Duarte Alves, for creating the code behind the module in the first place.

File Under: database / modules / opensource / perl
2 COMMENTS

The Great Gates of Kiev

Posted on 27th October 2013

I've now uploaded the survey results for YAPC::Europe 2013 and The Pittsburgh Perl Workshop 2013. Both had only a third of attendees respond, which for PPW is still 20 out of 54, and 122 out of 333 for YAPC::Europe.

YAPC::Europe

In previous years we have had higher percentages of response at YAPC::Europe, but that is possibly because I was in attendance and promoted the surveys during lightning talks, and encouraged other speakers to remind people about them. It may also be the fact that there is a newer crowd coming to YAPCs, and the fact we had 44 out of the 122 respondees saying that this was their first YAPC, who have never experienced the surveys. While definitely encouraging to see newer attendees, it would be great to see more of their feedback to help improve the conferences each year. Like YAPC::NA 2013, we have reintroduced the gender question. This time around I didn't get the negative reaction, but this may also be due to the fact I've had more feedback about approaching the subject this time around. Perhaps unsurprisingly, there were rather more male respondees, but I am also very encouraged to see that 8 respondees were female. While its difficult to know the exact numbers at the event, I'd like to think that we have been able to welcome more women to the event, and hopefully will see this number increase in the future.

Looking at the locations where attendees were travelling from to attend YAPC::Europe in Kiev, it is interesting to see a much more diverse spread. Once upon a time the UK was often the highest number, even eclipsing the host country. This year, it seems many more from across the whole of Europe took advantage of the conference. Again I think this is very encouraging. If Perl is to grow and reach newer (and younger) audiences, it needs to be of interest to a large number of people, particular from many different locations. While the UK (particularly London, thanks to Dave Cross) was perhaps the start of European Perl community, YAPC::Europe is now capable of being hosted in just about any major European city and see several hundred people attend. It will be interesting to see if Sofia next year, has a similar evenly spread of locations.

Of those that responded, it does seem that we had more people in the advanced realm. Particularly seeing as we had 56 people respond with more than 10 years experience of Perl. Back when we started the surveys, it would likely have been only a handful of people who attended who could have said that they had been programming Perl for more than 10 years. Thankfully though, it isn't just us old hands, as those only programming in Perl for a few years or less, are still making it worthwhile for speakers to come back each year and promote their projects big and small to a new audience.

One comment in the feedback however, described the Perl community as hermetic. I'm not entirely convinced that's true, but it is quite likely that some find it difficult to introduce themselves and get involved with projects. Having said that, there are plenty of attendees who have only been coming to YAPCs, or been involved with the Perl community, for a short while, who have made an impact, and are now valued contributors. So I guess it may just be down to having the right personality to just get stuck in and introduce yourself. This is one area of the Perl community that Yaakov Sloman is keen to break down barriers for, even perceived ones. We do need more Yaakov's at these events to not just break the ice, but shatter it, so we all see the benefit of getting know each other better.

And talking of getting to know others better, it was a shame I didn't get to meet the 15 CPAN Testers who responded. We have had group photos in the past, and I'd like to do more when I next attend a YAPC, but I think it would also be very worthwhile if the Catalyst, Dancer, Padre and many other projects could find the time to do some group shots while at YAPCs. At YAPC::NA it is a bit of a tradition for all those who contribute to #perl on IRC to have a large group photo, but it's never encouraged others to do the same. Perhaps this is also a way for people to get to know project contributors better, as new attendees will have a better idea of who to look out for, rather than trying to figure out who fits an IRC nick or PAUSEID.

The suggest topics for future talks were quite diverse, and "Web Development Web Frameworks Testing" is definitely an interesting suggestion, particularly as we are seeing more and more web frameworks written in Perl now, and we are after all very well known for our testing culture. One question I'm planning to include next years surveys, also looks at some of these topics and attempts to find out what primary interests people have. Again, this might help guide future speakers towards subjects that are of interest to their target audience.

Pittsburgh Perl Workshop

Workshops, by their very nature, are much smaller events, but with Pittsburgh being the home of the very first YAPC::NA, it is well established to host a workshop, and it would seem attracted some high profile speakers too. Possibly as a consequence, at least one attendee felt some of the talks were a little too advanced for them. At a smaller technical event it is much harder to try and please everyone, and with fewer tracks there often is less diversity. Having said that, I hope that the attendee didn't feel too overwhelmed, and got something out of the event in other talks.

From the feedback it would seem that more knowledgeable Perl developers were in attendance, so understandable that more talks might lean towards more advanced subjects, but as mentioned for YAPCs, speakers shouldn't feel afraid of beginner style introductions or howtos for their project, that could appeal to all levels of interest.

Overall I think the Pittsburgh Perl Workshop went down very well.

What's Next?

I now have to compile the more detailed personal feedback for these and the YAPC::NA organisers, so expect to see some further documentation updates in the near future. In addition, I want to work more on the raw data downloads. While it's interesting to see the data as currently presented, others may have other ideas to interrogate the raw data for further interesting analysis. I also still need to put the current code base on CPAN/GitHub and add the features to integrate with Act better.

The next survey will be for the London Perl Workshop at the end of November. If you are planning a workshop, YAPC or other technical event that you'd to have a survey for, please let me know and I'll set you up. It typically takes me a weekend to set up an instance, so please provide as much advanced warning as possible.

File Under: community / conference / perl / survey / workshop / yapc
NO COMMENTS

Of All The Things We've Made

Posted on 26th August 2013

Several years ago, we frequently updated the Birmingham.pm website with book reviews. To begin with, updating all the book information was rather labourious. Thankfully, on CPAN there was a set of modules that had been written by Andrew Schamp, that provided the framework to search online resources. I then wrote drivers for Amazon, O'Reilly & Associates, Pearson Education and Yahoo!. As the books we were reviewing were technical books, these four sources were able to cover all the books we reviewed.

A few years ago, I started working for a book company. In one project, we needed to evaluate book data, particularly for books where we had no data or very little. Often these were imports or out of stock titles that we could still order, but we were lacking information about. As such I created a number of further drivers, particularly for non-UK online catalogues, to help retrieve this information. I managed to create a collection of 17 drivers, and 1 bundle, all available on CPAN.

Via my CPAN Testers work, I've been promoting the CPAN::Changes Kwalitee Service website. Neil Bowers read one of the posts, and thought it would be good to improve the Changes files in his distributions, by way of QuestHub. I'd not heard of this site before, but after reading Neil's post I joined up, as I had been looking for a suitable way to keep a TODO list of my Perl work for a while. Neil had created a stencil to standardise the Changes file in 5 distributions, but unfortunately, I only had a few distributions of my own to complete. Another stencil emerged to add License and Repository information to 5 CPAN distributions. Again, I'd completed this for most of my distributions, apart from my 18 ISBN distributions, which I'd never got around to creating repositories for.

Then Neil had the idea to look at some of the quality aspects of all the CPAN distributions, and highlight those that might need adoption. As part of his reviews of similar modules over the past few years, he's adopted several modules, and was looking at what others he could help with. The results included 2 of the modules written by Andrew Schamp, which formed part of the ISBN searching framework I used for my ISBN distributions. Seeing as they hadn't been touched in eight years, I suspected that Andrew had moved on to other languages or work. So I contacted him to see whether he was interested in letting me take the modules on and update them.

It turns out that Andrew had written the modules for a college project, and since moving to C and with his programming interests now nothing to do with books, he was happy to hand over the keys to the modules. Over the past week, I have now taken ownership of Andrew's 5 modules, added these and my own 18 ISBN distributions to my local git repository, added all 23 to GitHub, updated the Changes file, and License & Repository info to the 5 new modules and released them all to CPAN. My next task is to update the Repository info in my 18 ISBN distributions and release these to CPAN.

Although I don't work in the book industry anymore, writing these search drivers has been fun. The distributions are perhaps my most frequently releases to CPAN, due to the various websites updating their sites. Now that I have access to the core modules in the framework, I plan to move some of the repeated code across many of the drivers into the core modules. I also plan to merge the three main modules into one distribution. When Andrew originally wrote the modules, it wasn't uncommon to have 1 module per distribution. However, as all three are tightly bound together, it doesn't make much sense to keep them separate. The two drivers Andrew wrote have not worked for several years, as unsurprisingly the websites have changed in the last 8 years. I've already updated one, and will be working on the other soon.

It's nice to realise that a few of my CPAN Testers summary posts inspired Neil, who in turn has inspired me, and has ended up with me helping to keep a small corner of CPAN relevant and up to date again.

If you're a new Perl developer, who wants to take a more active role in CPAN and the Perl community, a great way to start is to look at the stencils on QuestHub, and help to patch and submit pull/RT requests to update distributions. If you feel adventurous, take a look at the possible adoption list, and see whether anything there is something you'd like to fix and bring up to date. You can also look at the failing distributions lists, and see whether the authors would like help with the test suites in their distributions. You can then create your tasks as quests in QuestHub and earn points for your endeavours. Be warned though, it can become addictive :)

There is one more ISBN distribution on the adoption list, and I have now emailed the author. Depending on the response, I may be going through the adoption process all over again :) [Late update, the author came back to me and he's happy for me to take on his distribution too]

File Under: isbn / opensource / perl
NO COMMENTS

Who Knows Where The Time Goes?

Posted on 24th July 2013

YAPC::NA 2013 - The Results Are Out

The YAPC::NA 2013 Conference Survey results are now online.

The number of responses was much lower than in previous years, which is a shame, but may in part be due to one comment I received, saying it was too long. Reviewing the survey, I'd have to agree, and I'll be removing some of the questions for future surveys. Some of the questions had good intentions originally, and did provide an insight to what people got out of the conference. However, there is now a degree of predictability about them, that doesn't warrant their inclusion. Such questions about holidays and speakers you missed really don't add anything any more. The latter has generated some interesting comments over the years, but typically the same names appear each year.

This year was also slightly different, as the organisers asked for a lot of additional questions. Particularly related to the Code of Conduct. I will be forwarding the results of these questions to the TPF in the next day or two. They may choose to make the results public, but for now they won't appear on the YAPC Survey site. Of the other questions they asked, most related specifically to YAPC::NA, and wouldn't be applicable to other conferences and workshops. These too will be reviewed for next year.

Interestingly, VM Brasseur has done some analysis of the survey data, particularly around the age of attendees, and the length of time people have been a Perl programmer. Although the survey includes the former, it doesn't really include the latter. We do ask what level people feel they are at, but it'll be an area I'll be reviewing for future surveys.

As both the surveys and VM's analysis shows, the Perl community (at least those answering the survey) is getting older. I've noticed this too when attending. There are new and younger people attending, but generally the audience has been getting older. In the UK, this was identified in an technical article I read a few years ago (sadly I don't have a link to the source), which highlighted a shift in the late 80s/early 90s away from writing computer games on Spectrums, Dragons and Beebs to just playing consoles. I suspect the age of attendees at other technical conferences are also seeing a shift.

As noted in a previous post, I'm going to be looking at the Conference Survey software over the summer, and hopefully integrate it more with the Act software. I'm hoping this may encourage more to respond. I'll also be reviewing the survey itself, and looking at better and more relevant questions to include. If you have ideas of how to improve the survey, please feel free to drop me an email.

Enjoy :)

File Under: conference / perl / survey / yapc
NO COMMENTS

The Time of the Turning

Posted on 7th May 2013

A few weeks ago I had the pleasure of attending the 6th annual QA Hackathon. The event has become THE event for developers of test modules, projects and toolchain applications to come together to discuss ideas and plan for the future, as well as release some great work while they are there too.

This year Shadowcat, the primary sponsors, took on the organisational duties. The event was originally to be in London, but due to personal circumstances the decision was made to move the location to Lancaster in the North West of England. Personally they made the right choice. The venue itself was the new InfoLab building at Lancaster University. The attendees came from far and wide once again, and it was great to catch-up with friends old and new, and even be introduced to some newer friends.

My plan for the weekend was mainly to look at CPAN Testers. With the servers for the Metabase coming soon, David Golden and myself had hoped to be able to set them up, and start looking at changing the backend code to work with the new Metabase database. Unfortunately, the servers weren't ready for us just yet, so I started looking at other things. For myself, one area of CPAN Testers, particularly the cpanstats database side of things, needed attention. Speed of processing reports.

My first task once settled in, was to look at the way that the reports are consumed from the Metabase. Due to the way SimpleDB has become very unreliable with the results it sends, in order to avoid missing reports the criteria for the date search has been altered slightly to be a little more thorough, and a smaller range is now used to retrieve a set of GUIDs. The results now appear to be a little more complete, although we still appear to be missing some every so often. There is also a tail of log.txt which also helps to catch up with the reports. This work saw a new release of CPAN-Testers-Data-Generator.

A big factor with the slowness of the CPAN Testers server is that it requires a lot of disk I/O, with the database updates being a key factor. The most intensive updates are surrounding the SQLite database that could be downloaded. This also includes creating the Gzip and Bzip2 archives. As only web crawlers seem to be downloading the files, I've suspended the update. This has now freed up a lot of resources and consequently some of the other tasks, particularly the builder has improved.

Next, the builder was the focus of my attention. Previously the builder has been building pages for both authors and distros all at once. Although the author pages are viewed slightly less, they were getting built more frequently, due to the way the requests are pushed into the queue for each report. Initially the logic for building pages was altered, which improved some of the higher requested pages, but the more optimal solution was to split the builder into two, one for authors and one for distros. With the reduction in processing elsewhere, this improved the builder performance considerably. Monitoring the way the author pages are built since the hackathon, has also allowed me to alter when the builder for authors runs. This has then allowed the builder for distros to take a higher priority. With more distro pages than authors, this now gives distro pages more opportunity to be built quicker. Currently reports are being built in less than 24 hours of being submitted. These updates saw a new release of CPAN-Testers-WWW-Reports.

Another release while at the event, related to the QA Hackathon itself, was the main QA Hackathon website. Before the event, BooK had asked if the files that make up the website that the main QA Hackathon uses could be added to GitHub. As such, I packaged up the site into a git repository and released it. If you wish to help contribute to the site, please do.

Although there was a lot of coding work involved in the weekend, one of the bigger uses of time was the Lancaster Consensus organised by David Golden. For a few hours each afternoon, a large group of key toolchain developers, secondary project developers and various interested parties, gathered to discussed various aspects associated with configuration, installation, testing and specification of Perl and CPAN. With so many developers in one room, it wasn't too surprising to have a few opposing views, but with a guiding hand from David, we did achieve a consensus. If you wish to read the outcome, please read David's write-ups of the discussion points. The Consensus meetings were perhaps the greatest achievement of the event. While there might not have been too much immediate coding output from them, the potential to improve Perl and CPAN is considerable. From a CPAN Testers perspective, Post-installation testing, Case insensitive package permissions and Rules for distribution naming were perhaps of most interest. Although it may be some time before Post-installation testing could be hooked into a CPAN Testers smoker, it will be a valuable addition to the testing reports against pre-installed environments.

During the event, I had several discussions with Garu regarding his work on the cpanminus smoker client, and the common smoker client. In the last minutes of the hackathon we were able to push through a very notable report submission. It is exactly this sort of collaborative effort that makes these hackathons worthwhile. I look forward to see everyone again in Lyon.

The QA Hackathons could not be the success they are with the support of all the sponsors. My personal thanks to them for helping to providing accommodation, food and a venue for us all to hack. A big thank you to cPanel, Dijkmat, Dyn, Eligo, Evozon, $foo, Shadowcat Systems Limited, Enlightened Perl Organisation and Mongueurs de Perl.