Counting Out Time

Posted on 20th March 2014

I had an SQL query I wanted to translate into a DBIx::Class statement. I knew there must be a way, but trying to find the answer took some time. As a result I though it worth sharing in the event somebody else might be trying to find a similar answer.

The SQL I was trying to convert was:

SELECT status,count(*) AS mailboxes,
count(distinct username) AS customers
FROM mailbox_password_email GROUP BY status

The result I got running this by hand gave me:

+-----------+-----------+-----------+
| status    | mailboxes | customers |
+-----------+-----------+-----------+
| active    |     92508 |     48791 |
| completed |       201 |       174 |
| inactive  |    116501 |     56843 |
| locked    |    129344 |     61220 |
| pending   |      1004 |       633 |
+-----------+-----------+-----------+

My first attempt was:

my @rows = $schema->resultset('Mailboxes')->search({},
    {
        group_by => 'status',
        distinct => 1,
        '+select' => [
            { count => 'id', -as => 'mailboxes' },
            { count => 'username', -as => 'customers' } ]
    })->all;

Unfortunately this gave me the following error:

DBIx::Class::ResultSet::all(): Useless use of distinct on a grouped 
resultset ('distinct' is ignored when a 'group_by' is present) at
myscript.pl line 469

So I took the 'distinct  => 1' out and got the following results:

+-----------+-----------+-----------+
| status    | mailboxes | customers |
+-----------+-----------+-----------+
| active    |     92508 |     92508 |
| completed |       201 |       201 |
| inactive  |    116501 |    116501 |
| locked    |    129344 |    129344 |
| pending   |      1004 |      1004 |
+-----------+-----------+-----------+

Which might be distinct for the mailboxes, but is not sadly distinct for customers. So I try:

my @rows = $schema->resultset('Mailboxes')->search({},
    {       
        group_by  => 'status',
        '+select' => [
            { count => 'id', -as => 'mailboxes' },
            { count => 'username', -as => 'customers', distinct  => 1 } ]
    })->all;

and get:

Failed to retrieve mailbox password email totals: 
DBIx::Class::ResultSet::all(): Malformed select argument - too many keys
 in hash: -as,count,distinct at myscript.pl line 469\n

After several attempts at Google, and reading the DBIx::Class::Manual, I finally stumbled on: SELECT COUNT(DISTINCT colname)

My query now looks like:

my @rows = $schema->resultset('Mailboxes')->search({},
    {
        group_by  => 'status',
        '+select' => [
            { count => 'id', -as => 'mailboxes' },
            { count => { distinct => 'username' }, -as => 'customers' } ]
    })->all;

And provides the following results:

+-----------+-----------+-----------+
| status    | mailboxes | customers |
+-----------+-----------+-----------+
| active    |     92508 |     48791 |
| completed |       201 |       174 |
| inactive  |    116501 |     56843 |
| locked    |    129344 |     61220 |
| pending   |      1004 |       633 |
+-----------+-----------+-----------+

Exactly what I was after.

DBIx::Class does require some head-scratching at times, but looking at  the final statement it now seems obvious, and pretty much maps directly  to my original SQL!

Hopefully, this provides a lesson others can find  and learn from.

File Under: database / perl
NO COMMENTS


Even Flow

Posted on 8th December 2013

The following is part of an occasional series of highlighting CPAN modules/distributions and why I use them. This article looks at Data::FlexSerializer.

Many years ago the most popular module for persistent data storage in Perl was Storable. While still used, it's limitations have often cause problems. It's most significant problem was that each version was incompatible with another. Upgrading had to be done carefully. The data store was often unportable, and made making backups problematic. In more recent years JSON has grown to be more acceptable as a data storage format. It benefits from being a compact data structure format, and human readable, and was specifically a reaction to  XML, which requires lots of boilerplate and data tags to form simple data elements. It's one reason why most modern websites use JSON for AJAX calls rather than XML.

Booking.com had a desire to move away from Storable and initially looked to moving to JSON. However, since then they have designed their own data format, Sereal. But more of that later. Firstly they needed some formatting code to read their old Storable data, and translate into JSON. The next stage was to compress the JSON. Although JSON is already a compact data format, it is still plain text. Compressing a single data structure can reduce the storage by as much as half the original data size, which when you're dealing with millions of data items can be considerable. In Booking.com's case they needed to do this with zero downtime, running the conversion on live data as it was being used. The resulting code was to later become the basis for Data::FlexSerializer.

However, for Booking.com they found JSON to be unsuitable for their needs, as they were unable to store Perl data structures they way they wanted to. As such they created a new storage format, which they called Searal. You can read more about the thoughts behind the creation of Sereal on the Booking.com blog. That blog post also looks at the performance and sizes of the different formats, and if you're looking for a suitable serialisation format, Sereal is very definitely worth investigating.

Moving back to my needs, I had become interested in the work Booking.com had done, as within the world of CPAN Testers, we store the reports in JSON format. With over 32 million reports at the time (now over 37 million), the database table had grown to over 500GB. The old server was fast running out of disk space, and before exploring options for increasing storage capacity, I wanted to try and see whether there was an option to reduce the size of the JSON data structures. Data::FlexSerializer was an obvious choice. It could read uncompressed JSON and return compressed JSON in milliseconds.

So how easy was it to convert all 32 million reports? Below is essentially the code that did the work:

  my $serializer = Data::FlexSerializer->new( detect_compression => 1 );

    for my $next ( $options{from} .. $options{to} ) {
        my @rows = $dbx->GetQuery('hash','GetReport',$next);
        return  unless(@rows);

        my ($data,$json);
        eval {
            $json = $serializer->deserialize($rows[0]->{report});
            $data = $serializer->serialize($json);
        };

        next  if($@ || !$data);

        $dbx->DoQuery('UpdateReport',$data,$rows[0]->{id});
    }

Simple, straighforward and got the job done very efficiently. The only downside was the database calls. As the old server was maxed out on I/O, I could only run the script to convert during quiet periods as the CPAN Testers server would become unresponsive. This wasn't a fault of Data::FlexSerializer, but very much a problem with our old server.

Before the conversion script completed, the next step was to add functionality to permanently store reports in a compressed format. This only required 3 extra lines being added to CPAN::Testers::Data::Generator.

  use Data::FlexSerializer;

    $self->{serializer} = Data::FlexSerializer->new( detect_compression => 1 );

    my $data = $self->{serializer}->serialize($json);

The difference has been well worth the move. The compressed version of the table has reclaimed around 250GB. Because MySQL doesn't automatical free the data back to the system, you need to run the optimize command on a table. Unfortunately, for CPAN Testers this wouldn't be practical as it would mean locking the database for far too long. Also with the rapid growth of CPAN Testers (we now receive over 1 million reports a month) it is likely we'll be back up to 500GB in a couple of years anyway. Now that we've moved to a new server, our backend hard disk is 3TB, so has plenty of storage capacity for several years to come.

But I've only scratched the surface of why I think Data::FlexSerializer is so good. Aside from its ability to compress and uncompress, as well as encode and decode, at speed, it is ability to switch between formats is what makes it such a versatile tool to have around. Aside from Storable, JSON and Sereal, you can also create your own serialisation interface, using the add_format method. Below is an example, from the module's own documentation, which implements Data::Dumper as a serialsation format:

    Data::FlexSerializer->add_format(
        data_dumper => {
            serialize   => sub { shift; goto \&Data::Dumper::Dumper },
            deserialize => sub { shift; my $VAR1; eval "$_[0]" },
            detect      => sub { $_[1] =~ /\$[\w]+\s*=/ },
        }
    );
    
    my $flex_to_dd = Data::FlexSerializer->new(
      detect_data_dumper => 1,
      output_format => 'data_dumper',
    );

It's unlikely CPAN Testers will move from JSON to Sereal (or any other format), but if we did, Data::FlexSerializer would be only tool I would need to look to. My thanks to Booking.com for releasing the code, and thanks to the authors; Steffen Mueller, Ævar Arnfjörð Bjarmason, Burak Gürsoy, Elizabeth Matthijsen, Caio Romão Costa Nascimento and Jonas Galhordas Duarte Alves, for creating the code behind the module in the first place.

File Under: database / modules / opensource / perl
2 COMMENTS


Dreams Never End

Posted on 8th August 2007

After promising a while ago to upload some of my code, I've created a new section on the site. Click the Code tab on the menu at the top and you'll see what I've done.

My first launch is the latest version of my dbdump.pl utility. I use it to backup my databases to remote servers. It supports MySQL and PostgreSQL at the moment, but potentially it could support others. At some point I'll get around to packaging other utilities too. If you find the code useful, please let me know.

File Under: database / opensource / perl
2 COMMENTS


History Never Repeats

Posted on 5th June 2007

It's always handy doing backups, but getting into the habit of doing them is another thing. I've mostly just used a simple bash script, run manually when I need to. Which isn't what I should be doing. Today I decided it was about time I organsied soemthing to officially backup the databases. At the moment I'm only backing up the MySQL databases, but will add the one postgresql DB once it's been rehosted.

Originally I was going to use automysqlbackup, which may well have worked, but looked a bit too much work to do a relatively simple job, only supported MySQL and didn't enable me to backup via ssh to multiple servers. As the data potentially contains private information, sending via email is not an option. So I wrote my own. The result is a program a tenth of the size of automysqlbackup, it's written in Perl, backs ups multiple sites to multiple servers and the only non-core module is Net::SCP :)

Once I've added the necessary PostgreSQL support and written some documentation, I'll put it online somewhere.

File Under: backups / database
NO COMMENTS


Some Rights Reserved Unless otherwise expressly stated, all original material of whatever nature created by Barbie and included in the Memories Of A Roadie website and any related pages, including the website's archives, is licensed under a Creative Commons by Attribution Non-Commercial License. If you wish to use material for commercial puposes, please contact me for further assistance regarding commercial licensing.