Caching Functions In Perl

Synopsis

There are occasions, where you write a function that takes some parameters and outputs consistentish data. Example:

sub add {
    my($first,$second)=@_;
    return $first + $second;
}

If you call add(1,1) you get 2: always. Consistent input yields consistent output. Now let’s say you have something much more complex, possibly slow, that will possibly be called over and over again in the same program. Yes, we could create a hash table of value and before calling the function, we check the hash table… But that’s not why we have machines to work for us, and is certainly not why we have Perl.

Lesson 1 – Basics

By using three neat tricks of Perl- The symbol table, anonymous functions, and variable scope – we can do that work behind the scenes, and let our calling code be oblivious to whether they’re getting “live” data, or “cached”.

Symbol Table

The symbol table lets us mess with things. Take the below two functions:

sub a { return "a"; }
sub b { return "b"; }

For whatever reason, we’ve decided that everything in our program that calls function “a”, should actually be calling function “b”. Perhaps we don’t have access to the programs making the calls. Perhaps we’re just awesome. Either way, the solution is simple:

sub a { return "a"; }
sub b { return "b"; }
*a = &b;

So you call &b and get “b”. You call &a and get “b”. Magic. We’ll be back to this.

Anonymous Functions

Perl allows you to do some pretty powerful things. One of those things is the very simple idea of an anonymous function. An anonymous function isn’t called traditionally, but rather crammed into a variable and possibly passed around to other functions or whatever. Example:

my $afunc = sub { return "a" }

In this case $afunc, a variable, now contains a reference to the code declared by the sub, so we can pass it around as any other variable, and invoke it ala:

$return = &$afunc; # $return eq "a"
$return = $afunc->(); # $return eq "a"
$return = $afunc; # $return now contains a copy of the function
                  # that $afunc contains

Variable Scope

To many people, Perl’s scoping logic is maddening. I’m not going to talk about it at length here, but rather show a very simply example that is important for this exercise.

my $var = "x";
{
  print "$var"; # x
  $var = "y";
  print "$var"; # y
}
print "$var"; # y

If you’ve programmed in any programming language that has blocks, this makes total sense, and you’re wondering why I even bring it up.  Let’s change this slightly:

my $var = "x";
$xfunc = sub {
  print "$var"; # x
  $var = "y";
  print "$var"; # y
}
print "$var"; # x

We’ve now turned that block into an anonymous function. The first time we call that function, we get the expected output of “xy”. Thereafter, we get “yy” because the function doesn’t have a copy of $var, rather it is using the parent’s $var. So what happens if we pass $xfunc as a variable to some other function miles away where $var doesn’t exist? It still references the existing $var which will exist in memory until nothing else in memory is pointing to it. Consider $var our cache: This is critical to how I will implement this cache.

Lesson 2 – A Simple Cache

So we know how to mess with the symbol table, compose anonymous functions, and how variable scope can be perverted to our needs. So how do combine those into a function that creates caching functions?

sub cache_it {
  my $func = shift;
  my %cache;

  $afunc = sub {
    my $key = join ',', @_;

    unless(exists $cache{$key}) {
      # We don't have a cached value, make one
      my $val = $func->(@_);

      # ... and cache it
      $cache{$key} = $val;
    }

    # return the cached value
    return $cache{$key};
  };
  return $afunc;
}

So this function takes a function reference as a value, and returns a caching anonymous function.  It’s cache is a hash that is declared in its parent’s scope, and thus will outlive individual invocations of the anonymous function itself.

We key the cache by joining the arguments with a character, in this case a simple comma. We check the cache to see if there is a matching entry, if so we return it. If not we call the real function that we’ve stored anonymously, and store its return values in the cache. In all cases, we return the cached version of the results.

So, when we have a function we want to make a caching version of, it’s as easy as:

sub myfunc {
  my $thing1=shift;
  # Do something complicated with $thing1, and generate
  # consistent result $thing2
  return $thing2;
}

*myfunc = cache_it(\&myfunc); # Magic

That’s it.  Anywhere in your code you call myfunc, you’ll now be calling the caching version. If you’re having problems and need to debug, just comment out that one little line starting with the asterisk, and your code will call the real myfunc “as usual”. It’s that easy.

Lesson 3 – A Better Cache

I don’t use the cache I wrote above. I tend to write long-living infrastructure code that caches or micro-caches result sets that need to expire or that I need to control. Take the above function and adding item expiration and some simple commands is pretty easy, however.

sub cache_it {
  my ($func,$life) = @_;
  my %cache;

  $afunc = sub {
    my $key = join ',', @_;
    my $now=time();

    if($key eq 'CLEARCACHE') {
      %cache=();
      return;
    }

    # Should the cache be cleaned?
    if($key eq 'CLEANCACHE') {
      foreach my $ckey (keys %cache) {
        delete $cache{$ckey} if $cache{$ckey}->{expires} < $now;
      }
      return;
    }

    unless(exists $cache{$key} and $cache{$key}->{expires} >= $now) {
      # We don't have a cached value, make one
      my $val = $func->(@_);

      # ... and cache it
      $cache{$key} = {
           value => $val,
           expires => $now+$life,
      };
    }

    # return the cached value
    return $cache{$key}->{value};
  };
  return $afunc;
}

And when we call it:

*myfunc = cache_it(\&myfunc,300); # Magic

Where the second parameter is the number of seconds we want a cache item to stay valid. The anonymous function our new cache_it function returns now does three things:

  1. If you call the function with the parameter “CLEARCACHE”, it will empty the entire cache
  2. If you call the function with the parameter “CLEANCACHE”, it will purge cache items that have expired.
  3. If you call the function normally, it will save both the results and the current timestamp into the cache entry, and call the real function if the results in the cache are longer than the life you specified when you created it.

Use

I’ve used very similar caching technology in lots of places:

  1. Anagram generator (or any recursive, consistent function, like Fibonacci or gcd)
  2. DNS resolver
  3. Database queries
  4. Surge-handling in web apps (you can “turn on” that caching when you start surging, and “turn off” when request load goes down)
  5. Smart logging functions (inverse cache – don’t bother logging something if you’ve logged it already)
  6. Distributed caching infrastructure

Prove It

Fibonacci numbers are fun:

sub fib {
    my $n = shift;
    return $n if $n < 2;
    return fib($n - 1) + fib($n - 2);
}

So this will make 2 calls to itself for every one call where n>=2, but by remembering what fib(2),fib(3),fib(4) etc are, it won’t have to recompute them

[root@mlap ~]# perl ./x 40
    102334155 : 146
[root@mlap ~]# perl ./x 40 cache
    102334155 : 0

The first is a run of the above fib function on a value of 40. The ‘answer’ is 102334155, and it took 146 seconds. The second is a run of the above fib function that has been cached using a cache method very similar to the “simple cache” in Lesson 1. The ‘answer’ is 102334155, and it took <1 second.

Gotchas

This is not a magic bullet. If your function doesn’t return consistent data given a consistent parameter list, or if your function affects something else that will be missed by returning cached results, this isn’t your solution.

The function above assumes that the function it is capturing returns a single scalar (which could be a reference). If the cached function returns an array or a hash, the cache function will not work as expected. This is very easy to fix (4 lines of code), but I wanted to point that out before you get overly frustrated.

 

Posted in Architecture, Coding, Linuxy | Tagged | Leave a comment

Sorting Strings Ending In Numbers In Perl

Synopsis

I deal with a lot of names that look like “somedumbserver2” and “somedumbserver15”. Using Perl’s default sort, “somedumbserver2” comes before “somedumbserver15” because the character “2” is greater than the character “1”, and that’s where the sort stops.  This sort, “snsort” or “string-number sort”, is a little wiser, and all else being equal compares the number “2” to the number “15”, and properly orders it. It’s also safe to use if some strings don’t conform to this pattern.

I wrote this to be a Vmethod to list objects in Template Toolkit, so I kept the surrounding code intact in case anyone wants to use it as-is, but “sub sns” can be taken out and put in any Perl program, and used in conjunction with the Perl “sort” function.

Code

$Template::Stash::LIST_OPS->{ snsort } = sub {
        my $list = shift;
        return sort sns @$list;
        
        sub sns {
                my @aparts=$a =~ /^(\D+)(\d+)$/;
                my @bparts=$b =~ /^(\D+)(\d+)$/;
                
                return $a cmp $b unless @aparts and @bparts; # Safety
                return $aparts[0] cmp $bparts[0] unless $aparts[0] eq $bparts[0];
                return $aparts[1] <=> $bparts[1];
        }
};
Posted in Coding, Linuxy | Tagged , , | 1 Comment

Fixing Mis-cased URIs Under Apache

Synopsis

This is rather old code, but saved my bacon more than once.Runs under Apache with Mod_Perl, and corrects the URI requested when it is giving lazily. Thus a request for “/INDEX.HTML” is rewritten to “/index.html” as appropriate.

Code

=head1 NAME

M::Apache::fixcase - Want to fix case, showing the user the error of their ways?

=head1 SYNOPSIS

PerlModule M::Apache::fixcase

<VirtualHost>

PerlTransHandler M::Apache::fixcase

</VirtualHost>

=cut

package M::Apache::fixcase;

use Apache2::Const qw(DECLINED);
use Apache2::RequestUtil ();

sub handler {
        my $r = shift;
        my $file=$r->document_root() . $r->uri();
        unless(-e $file) {
                # File doesn't exist, let's try to find it!
                my @uribits=split(/\//,$r->uri());
                shift @uribits; # shift off the beginning ''
                my $newuri=$r->document_root(); # $newuri is the uri we're building
                my $sofar=0;
                foreach my $bit (@uribits) {
                        $bit =~ s/\(|\)|\`//g; # stupid url tricks
                        $sofar=0; # reset sofar, so we know if we're still on track
                        opendir(FD,$newuri) or last;
                        foreach my $dthing (readdir(FD)) {
                                if($dthing =~ /^\./) { next; } # safety first;
                                if($dthing =~ /^$bit$/i) { # case-insenstive pattern match
                                        # We have a match!
                                        $sofar=1;
                                        $newuri .= "/" . $dthing;
                                        last;
                                }
                        }
                        closedir(FD);
                        unless($sofar) { last; } # we missed this bit, don't bother recursing further
                }
                if($sofar) {
                        # We made it!
                        my $dr=$r->document_root();
                        $newuri =~ s/^$dr//; # strip off the document_root from the new uri
                        $r->uri($newuri); # Set the uri to the new uri
                } # else we can't do anything...
        }
        return DECLINED; # Always pass it on
}

1;
Posted in Coding, Linuxy | Tagged , , | Leave a comment

Merging Filesystems Virtually Under Apache

Synopsis

This is rather old code, but saved my bacon more than once. Runs under Apache with Mod_Perl, and “merges” any number of filesystems. You’ll see the @docroots array that takes a list of “document roots”. These all need correct permissions in Apache, or else the end request will fail. This module doesn’t trump Apache’s security model. In short, it tries the request against each of the entries in @docroots successively, setting the URI to the first found file, or declining altogether if none is found.

Strategy

Most people who understand their web systems, know where their requests go. For performance, you really want to make sure the file system most likely to serve the request is listed first, the next most likely second, et cetera.

The exception to this is if you’re using this as a migration strategy. I have, twice, listed a mostly empty file system first, making the second file system “read-only”. The webby people then had a way to migrate their content to the new file system systematically and deliberately, with immediate results. This module is very efficient, so failed lookups are invisible to the user experience.

Code

=head1 NAME

M::Apache::fsmerge - Want to merge multiple filesystems into one Apache filesystem?

=head1 SYNOPSIS

PerlModule M::Apache::fsmerge

<VirtualHost>

PerlTransHandler M::Apache::fsmerge

</VirtualHost>

=cut

package M::Apache::fsmerge;

use Apache2::Const qw(DECLINED);
use Apache2::RequestUtil ();

=head1 VARIABLES

=over 4

=item @docroots

An array of filesystem paths to merge. They are processed in array order, and the first match wins.

=cut

my @docroots = (
                        "/var/www/html/web/managed/groups",
                        "/var/www/html/webusers",
                        );
sub handler {
        my $r = shift;
        my $sofar=0;
        foreach my $dr (@docroots) {
                my $file=$dr . $r->uri();
                $sofar=0;
                my $newuri=$dr; # $newuri is the uri we're building
                unless(-e $file) {
                        # File doesn't exist, let's try to find it!
                        my @uribits=split(/\//,$r->uri());
                        shift @uribits; # shift off the beginning ''
                        foreach my $bit (@uribits) {
                                $bit =~ s/\(|\)|\`//g; # stupid url tricks
                                $sofar=0; # reset sofar, so we know if we're still on track
                                opendir(FD,$newuri) or last;
                                foreach my $dthing (readdir(FD)) {
                                        if($dthing =~ /^\./) { next; } # safety first;
                                        if($dthing =~ /^$bit$/i) { # case-insenstive pattern match
                                                # We have a match!
                                                $sofar=1;
                                                $newuri .= "/" . $dthing;
                                                last;
                                        }
                                }
                                closedir(FD);
                                unless($sofar) { last; } # we missed this bit, don't bother recursing further
                        }
                } else {
                        # Woo hoo!
                        $sofar=1;
                        $newuri=$file;
                }

                if($sofar) {
                        # We made it!
                        $newuri =~ s/^$dr//; # strip off the document_root from the new uri
                        $r->document_root($dr); # Set the new docroot
                        $r->uri($newuri); # Set the uri to the new uri
                        last;
                } # else we can't do anything...
        }

        return DECLINED; # Pass it on, regardless of the outcome

}

1;
Posted in Architecture, Coding, Linuxy | Tagged , , | Leave a comment

Reliance On GPS Is A Liability

I’ve written a couple times previously about how we shouldn’t rely on GPS solely. It appears the US government might have just come to that conclusion as well.

GPS is awesome, but it cannot be the only mechanism used for navigation. CANNOT.

Posted in Architecture, Life | Leave a comment

Use MongoDB

So an Anonymous Coward’s pastebin rant against MongoDB has an awful lot of legs. I circulated a few thoughts yesterday morning to head-off the inevitable concerns of “um, we’re doing a lot with Mongo, and now I’m nervous”:

[It] really smacks of oops-I-didn’t-plan-and-got-bit, which is entirely too easy to do on the razor’s edge. 10s millions users without a decent pre-launch beta, load-testing etc? Most of the arguments here are things well-known in the mongo community: GWL, sharding problems under load, eventual consistency, etc.

So to everyone out there who, after reading that pastebin:

  • might have thrown up a little
  • questioning why/if they’re using MongoDB
  • thinking perhaps last week you made the worse career decision of your life
  • might want to re-evaluate the technology decisions of co-workers who advocated it

GOOD! These feelings are totally valid and in-general will do everything to help MongoDB and its community. Really, I mean that. Even if everything in that pastebin is a lie (most things are half-truths, at best) and the negative reactions because of it are “baseless”, they serve to raise questions, they serve to validate, they serve to do what we should do with every technology decision we make: be skeptical and prove.

A lot of the MongoDB adoption I’ve seen amounts to “well I think it’s cool because I can use JavaScript” from the UI crowd or “I suck at writing SQL queries” from the app crowd or “I hate tuning my systems” from the systems crowd… or, worse “I’ve heard it’s awesome”.

I know in the Age of iEverything, it’s a knee-jerk reaction to just buy/do things because some salesman in a turtle-neck tells you he just changed everything, but if you ever take anything home from my writing:

Don’t Ever Make An Infrastructure Technology Decision Because You Heard It Was Awesome

But, in case you were curious, MongoDB is awesome, and you should use it: just don’t take my word for it. Be skeptical and prove.

 

Posted in Architecture, Linuxy | Tagged , | Leave a comment

SSL CAs Are An Unnecessary Evil

I’ve talked about this to numerous groups, going back to 1999, but seems I’ve never done so publicly.

Certificate Authorities are completely unnecessary.

“OH MY GOD HOW DO WE MAINTAIN THE WEB OF TRUST?!” you scream. Easy, the same way we do In Real Life, with a little twist.

Posit 1: Your browser starts with a completely empty CA database. Yup. No CAs, no Certs, no CRLs. It has a concept of “levels of trust”, just like PGP/GPG: you explicitly trust certain people/companies/etc. different than others.

Posit 2: CAs only sign for their domain, and all Certs in a domain must be signed by the domain CA.

Posit 3: Domain CAs can also sign other domain CAs that they sufficiently trust. This signature has no functional impact on the system.

Scenario 1: High-Trust, offline relationship. You are a Bank of M@ customer. You visit a Bank of M@ branch and open an account, signing up for online banking. You’re given a read-only USB stick that has the Bank of M@ Certificate Authority Public Key on it. You go home, plug it in, and your browser and other software now knows that all ‘bom@.com’ addresses (email addresses, web RLs, etc.) signed by that key are legit.

Scenario 2: Opportunistic Trust, online relationship. A friend sends you a link to a pair of gloves you’ve been looking for. The link is to a site you’ve never been to, Glamazon.com (that I just made up). You add the item to your shopping cart, click the ‘checkout’ button, and the site redirects you to their SSL site to complete the transaction. Your browser has no data for Glamazon.com. How does it get it? Well, how did it get to Glamazon.com to begin with? A DNS request. It makes a similar DNS request asking for the KEY, which is either a URL pointing to the CA Public Key, or the Public Key block itself. In either case, the browser can load the CA for Glamazon.com, implicitly trust other Glamazon.com Certs and keys signed by it, and you have a new pair of gloves on the way.

Scenario 2a: Treason Uncloaked! You receive an e-mail, purportedly from Glamazon.com, but signed with a different CA, or links to a site purporting to be Glamazon.com but is signed with a different CA. Your e-mail client, which has the CA Key for Glamazon.com loaded warns you visibly of this discrepancy. The content of the e-mail, however, says “We needed to change our CA key, please use this one instead”.

  1. The CA Key enclosed was not signed with the previous CA Key for Glamazon.com
  2. The previous CA Key for Glamazon.com isn’t on the published CRL for Glamazon.com
  3. The CA Key enclosed is not the CA Key obtainable via DNS.

The software, and thus the user, can be pretty certain this is bogus. This is not the Glamazon.com you’re looking for.

Scenario 2b: Subversive Treason Uncloaked! Someone hijacks your DNS. Perhaps the government, perhaps malware, perhaps your jealous life partner. They know your penchant for Glamazon.com and decide to forge their own CA key, and make it so that your browser will download that key, but login through a Man-In-The-Middler server they have set up. Nefarious!

  1. Your browser indeed downloads the forged key.
  2. The forged key is not signed by the CA Key you have.
  3. The CA Key you have is not on a CRL for Glamazon.com
  4. Or the CA Key you have is on a forged CRL for Glamazon.com, that is not signed with the CA Key you have.

The software, and thus the user, can be pretty certain this is bogus. This is not the Glamazon.com you’re looking for.

Scenario 2c: Real certificate/key/CA change. Glamazon.com’s CA is going to expire soon, so they generate a new one. They sign it with the old CA Key. They add the fingerprint of the old CA Key to their public CRL. They sign the CRL with both the old and new CA Keys.

  1. New/first-time Glamazon.com customers are oblivious to the switch.
  2. Existing customers download the new CA Key and the CRL.
  3. They authenticate the CRL as it is signed by both the old (that they trust) and new (that they’re skeptical of) CA Keys.
  4. They authenticate the new CA Key as it is signed by the old CA Key.

The software, and thus the user, can be pretty certain this is legitimate. This is the Glamazon.com you’re looking for.

Conclusion: This is a very simple, inexhaustive exercise. More questions remain, but they’re all solvable. There is no reason why every domain can’t be a CA for itself. There is no reason why being a Certificate Authority is a billion-dollar business. There is no reason why we have to put up with an anachronistic monopoly on “trust”. There is no reason why this cannot be free and open, and a domain can generate infinite certificates/keys/etc. for their own domain.

In Verisign We Trust, My Ass.

Posted in Architecture, Opinions, Work | Leave a comment

Redundant Array of Independent Datacenters

I used a phrase last night/this morning that I use to refer to distributed datacenter architecture, and afterwards decided to google it. It seems that while Cisco mentioned it a lot in 2010, I beat them by a few years in describing and naming the concept of a Redundant Array of Independent Datacenters – and other than that, it has received almost no searchable results. Unfortunately, the vast majority of my communications about this were in-person, but a zgrep of my mail archives did return  a few hits, the below being the oldest. I thought that was cool. A more informative post forthcoming as time allows.

To: [REDACT]
Subject: Quick Reply
From: Matthew Keller <[REDACT]>
Date: 06 Dec 2002 12:21:30 -0500

[REDACT],

I got your vmail and think we're on the same page. Basically thinking
about it like storage, but a Redundant Array of Independent 
Datacenters, if you will. Between the 3 or 4 of us, there is no 
reason we can't make the JNOC idea work, and really save all of 
us a lot of time and energy. Let me know when you get back from 
vacation and we'll chat more. 

Thanks again for your call,

-- 
Matthew Keller
Enterprise Systems Analyst
[REDACT]

 

Posted in Architecture, General, Work | Leave a comment

Open Response to Open Letter to the Netatalk Community

I can’t help but notice that you seem to think yourself above some pretty important facts with the coup you appear to be attempting towards Netatalk.

First, and most importantly, while there are currently only a small handful of devs, you’re standing on the shoulders of giants. I haven’t done any source analysis as to how many klocs you have contributed vs not, but I’ll bet its a tiny percentage- especially if you factor in total klocs contributed over the 15(?) years of the project versus your relatively recent involvement.

Secondly, your act of “big companies should pay me or they can’t have Netatalk” shows a gross misunderstanding as to how Open Source works, and the rules the GPL binds you by. Please don’t feed the trolls with claims that there are lots of BSD headers running around source files, as some of your strawmen have been doing in various forums. That’s bunk and well-explained in historical discussions. Every line of Netatalk code is under the GPL unless you want to base “your” version of Netatalk off of versions prior to 1.5, when it was BSD-licensed.

Lastly, I’m disappointed in you. I won’t betray confidence implied in discussions we’ve had, but this isn’t something I would have expected from you. Your assertion that “it’s better to have an actively developed Netatalk, than a non-free, non-open Netatalk” is very incorrect. It is the freedom and openness that got it to the point where you could derive value from locking it up.

I’m sorry your business plan is obviously not working the way you’d like. This isn’t the solution. The solution is to change or replace your business plan, not commit at worst legal license violations, and at best very disrespectful and dishonorable acts with that which is not yours.

Sincerely,
Matthew Keller

Posted in Coding, Linuxy, Opinions | 6 Comments

Apple Touchscreen Patent – Invalid

You may be surprised to learn that Apple was granted, essentially, a patent on touchscreen computing. (and yes, I know it’s N-finger, so technically it’s multi-touch.. regardless) There is TONS of prior art to invalidate this, and I have in my museum-of-geeky-products-i-bought-that-never-caught-on, several pre-2007 devices that can prove that.

Really. Shush.

Posted in Uncategorized | Leave a comment