Even Flow
Posted on 8th December 2013
The following is part of an occasional series of highlighting CPAN modules/distributions and why I use them. This article looks at Data::FlexSerializer.
Many years ago the most popular module for persistent data storage in Perl was Storable. While still used, it's limitations have often cause problems. It's most significant problem was that each version was incompatible with another. Upgrading had to be done carefully. The data store was often unportable, and made making backups problematic. In more recent years JSON has grown to be more acceptable as a data storage format. It benefits from being a compact data structure format, and human readable, and was specifically a reaction to XML, which requires lots of boilerplate and data tags to form simple data elements. It's one reason why most modern websites use JSON for AJAX calls rather than XML.
Booking.com had a desire to move away from Storable and initially looked to moving to JSON. However, since then they have designed their own data format, Sereal. But more of that later. Firstly they needed some formatting code to read their old Storable data, and translate into JSON. The next stage was to compress the JSON. Although JSON is already a compact data format, it is still plain text. Compressing a single data structure can reduce the storage by as much as half the original data size, which when you're dealing with millions of data items can be considerable. In Booking.com's case they needed to do this with zero downtime, running the conversion on live data as it was being used. The resulting code was to later become the basis for Data::FlexSerializer.
However, for Booking.com they found JSON to be unsuitable for their needs, as they were unable to store Perl data structures they way they wanted to. As such they created a new storage format, which they called Searal. You can read more about the thoughts behind the creation of Sereal on the Booking.com blog. That blog post also looks at the performance and sizes of the different formats, and if you're looking for a suitable serialisation format, Sereal is very definitely worth investigating.
Moving back to my needs, I had become interested in the work Booking.com had done, as within the world of CPAN Testers, we store the reports in JSON format. With over 32 million reports at the time (now over 37 million), the database table had grown to over 500GB. The old server was fast running out of disk space, and before exploring options for increasing storage capacity, I wanted to try and see whether there was an option to reduce the size of the JSON data structures. Data::FlexSerializer was an obvious choice. It could read uncompressed JSON and return compressed JSON in milliseconds.
So how easy was it to convert all 32 million reports? Below is essentially the code that did the work:
my $serializer = Data::FlexSerializer->new( detect_compression => 1 );
for my $next ( $options{from} .. $options{to} ) {
my @rows = $dbx->GetQuery('hash','GetReport',$next);
return unless(@rows);
my ($data,$json);
eval {
$json = $serializer->deserialize($rows[0]->{report});
$data = $serializer->serialize($json);
};
next if($@ || !$data);
$dbx->DoQuery('UpdateReport',$data,$rows[0]->{id});
}
Simple, straighforward and got the job done very efficiently. The only downside was the database calls. As the old server was maxed out on I/O, I could only run the script to convert during quiet periods as the CPAN Testers server would become unresponsive. This wasn't a fault of Data::FlexSerializer, but very much a problem with our old server.
Before the conversion script completed, the next step was to add functionality to permanently store reports in a compressed format. This only required 3 extra lines being added to CPAN::Testers::Data::Generator.
use Data::FlexSerializer;
$self->{serializer} = Data::FlexSerializer->new( detect_compression => 1 );
my $data = $self->{serializer}->serialize($json);
The difference has been well worth the move. The compressed version of the table has reclaimed around 250GB. Because MySQL doesn't automatical free the data back to the system, you need to run the optimize command on a table. Unfortunately, for CPAN Testers this wouldn't be practical as it would mean locking the database for far too long. Also with the rapid growth of CPAN Testers (we now receive over 1 million reports a month) it is likely we'll be back up to 500GB in a couple of years anyway. Now that we've moved to a new server, our backend hard disk is 3TB, so has plenty of storage capacity for several years to come.
But I've only scratched the surface of why I think Data::FlexSerializer is so good. Aside from its ability to compress and uncompress, as well as encode and decode, at speed, it is ability to switch between formats is what makes it such a versatile tool to have around. Aside from Storable, JSON and Sereal, you can also create your own serialisation interface, using the add_format method. Below is an example, from the module's own documentation, which implements Data::Dumper as a serialsation format:
Data::FlexSerializer->add_format(
data_dumper => {
serialize => sub { shift; goto \&Data::Dumper::Dumper },
deserialize => sub { shift; my $VAR1; eval "$_[0]" },
detect => sub { $_[1] =~ /\$[\w]+\s*=/ },
}
);
my $flex_to_dd = Data::FlexSerializer->new(
detect_data_dumper => 1,
output_format => 'data_dumper',
);
It's unlikely CPAN Testers will move from JSON to Sereal (or any other format), but if we did, Data::FlexSerializer would be only tool I would need to look to. My thanks to Booking.com for releasing the code, and thanks to the authors; Steffen Mueller, Ævar Arnfjörð Bjarmason, Burak Gürsoy, Elizabeth Matthijsen, Caio Romão Costa Nascimento and Jonas Galhordas Duarte Alves, for creating the code behind the module in the first place.
File Under:
database
/ modules
/ opensource
/ perl
|
Points of Authority
Posted on 27th May 2011
Back in February I did a presentation for the Birmingham Perl Mongers, regarding a chunk of code I had been using to test websites. The code was originally based on simple XHTML validation, using the DTD headers found on each page. I then expanded the code to include pattern matching so I could verify key phrases existed in the pages being tested. After the presentation I received several hints and suggestions, which I've now implemented and have set up a GitHub repository.
Since the talk, I have now started to add some WAI compliance testing. I got frustrated with finding online sites that claimed to be able to validate full websites, but either didn't or charged for the service. There are some downloadable applications, but most require you to have Microsoft Windows installed or again charge for the service. As I already had the bulk of the DTD validation code, it seemed a reasonable step to add the WAI compliance code. There is a considerable way to go before I get all the compliance tests that can be automated written into the distribution, but some of the more immediate tests are now there.
As mentioned in my presentation to Birmingham.pm, I still have not decided on a name. Part of the problem being that the front-end wrapper, Test::XHTML, is written using Test::Builder so you can use it within a standard Perl test suite, while the underlying package, Test::XHTML::Valid uses a rather different approach and does provides a wider API than just validating single pages against a DTD specification. Originally, I had considered these two packages should be two separate releases, but now that I've added the WAI test package, I plan to expose more of the functionality of Test::XHTML::Valid within Test::XHTML. If you have namespace suggestions, please let me know, as I'm not sure Test-XHTML is necessarily suitable.
Ultimately I'm hoping this distribution can provide a more complete validation utility for web developers, which will be free to use and will work cross-platform. For those familiar with the Perl test suite structure, they can use it as such, but as it already has a basic stand-alone script to perform the DTD validation checks, it should be usable from the command-line too.
If this sounds interesting to you, please feel free to fork the GitHub repo and try it out. If you have suggestions for fixes and more tests, you are very welcome to send me pull requests. I'd be most interested in anyone who has the time to add more WAI compliance tests and can provide a better reporting structure, particularly when testing complete websites.
File Under:
modules
/ opensource
/ perl
/ technology
/ testing
/ usability
/ web
|
Loose Change
Posted on 1st April 2011
Many years ago I wrote a set of scripts and modules that together formed a way for me to access eBay internationally. I frequently bought records from the UK, US, Germany and Australia, so those were the plugins that I focused on, but the intention was to allow more to interface to other eBay sites. I even did a presentation at YAPC::Europe in 2004, called The Perl Auctioneer, which explained my progress.
As part of the currency calculations and conversion, I used the same site that eBay themselves were using, XE.com. As I became more involved in other projects, and my international eBay buying declined, my efforts to finish and release the Perl Auctioneer waned. However, I was still using the currency conversion module, so released it as a stand-alone package. In time this became Finance::Currency::Convert::XE.
Although I have occasionally updated the module, I no longer use it. However, others still do. XE.com themselves are very protective of their data, understandably, and are very resistent to screen scrapers. Even though their own terms of use allow for personal use, and do not explicitly say screen scrapers are prohibited, they do make accessing the data from the command line very difficult. They have very recently upgraded their website with further measures to prevent automated tools scraping their data.
As I no longer use the module, I feel I have two choices. Pass on the distribution to someone else, who does want to invest time and effort on the module, or to abandon the module and distribution and remove it from CPAN. As the module does not currently work with the latest XE.com site, unless someone does come forward I plan to remove the distribution from CPAN by the end of the month.
If you would like to take over the module, please email me (barbie@cpan.org) and let me know your PAUSE ID. I'll then put the wheels in motion to give you maintainer/author permissions.
File Under:
modules
/ opensource
/ perl
|
Some Heads Are Gonna Roll
Posted on 11th February 2011
Some time ago I wrote Test-YAML-Meta. At the time the name was given as a compliment to Test-YAML-Valid, which validates YAML files in terms of the formatting, rather than the data. Test-YAML-Meta took that a step further and validated the content data for META.yml files included with CPAN distributions against the evolving CPAN META Specification.
With the release of Parse-CPAN-Meta I wrote Test-CPAN-Meta, which dropped the sometimes complex dependency of the more verbose YAML parsers, for the one that was specifically aimed at CPAN META.yml files. With the emergence of JSON, there was a move to encourage authors to release META.json files too. Although considered a subset of the full YAML specification, JSON has a much better defined structure that has more complete parser support. Coinciding with this move was the desire by David Golden to properly define a specification for the CPAN Meta files. It was agreed that v2.0 of the CPAN Meta Specification should use JSON as the default implementation. As a consequence I then released Test-JSON-Meta.
Although the initial naming structure seemed the right the thing at the time, it is becoming clearer that really the names need to be revised. As such I looking to change two of the distributions to better fit the implementations. So in the coming weeks expect to see some updates. The name changes I'm planning are:
- Test-CPAN-Meta => Test-CPAN-Meta (no change)
- Test-YAML-Meta => Test-CPAN-Meta-YAML
- Test-JSON-Meta => Test-CPAN-Meta-JSON
Underneath these current namespaces is the Version module that describes the data structures of the various specifications. In the short term these will also move, but will be replaced by a dependency on the main CPAN-Meta distribution in the future. There will be final releases for Test-YAML-Meta and Test-JSON-Meta, which will act as a wrapper distribution to re-point the respective distributions to their new identities.
File Under:
modules
/ perl
/ qa
/ testing
|