Caching Functions In Perl

Synopsis

There are occasions, where you write a function that takes some parameters and outputs consistentish data. Example:

sub add {
    my($first,$second)=@_;
    return $first + $second;
}

If you call add(1,1) you get 2: always. Consistent input yields consistent output. Now let’s say you have something much more complex, possibly slow, that will possibly be called over and over again in the same program. Yes, we could create a hash table of value and before calling the function, we check the hash table… But that’s not why we have machines to work for us, and is certainly not why we have Perl.

Lesson 1 – Basics

By using three neat tricks of Perl- The symbol table, anonymous functions, and variable scope – we can do that work behind the scenes, and let our calling code be oblivious to whether they’re getting “live” data, or “cached”.

Symbol Table

The symbol table lets us mess with things. Take the below two functions:

sub a { return "a"; }
sub b { return "b"; }

For whatever reason, we’ve decided that everything in our program that calls function “a”, should actually be calling function “b”. Perhaps we don’t have access to the programs making the calls. Perhaps we’re just awesome. Either way, the solution is simple:

sub a { return "a"; }
sub b { return "b"; }
*a = &b;

So you call &b and get “b”. You call &a and get “b”. Magic. We’ll be back to this.

Anonymous Functions

Perl allows you to do some pretty powerful things. One of those things is the very simple idea of an anonymous function. An anonymous function isn’t called traditionally, but rather crammed into a variable and possibly passed around to other functions or whatever. Example:

my $afunc = sub { return "a" }

In this case $afunc, a variable, now contains a reference to the code declared by the sub, so we can pass it around as any other variable, and invoke it ala:

$return = &$afunc; # $return eq "a"
$return = $afunc->(); # $return eq "a"
$return = $afunc; # $return now contains a copy of the function
                  # that $afunc contains

Variable Scope

To many people, Perl’s scoping logic is maddening. I’m not going to talk about it at length here, but rather show a very simply example that is important for this exercise.

my $var = "x";
{
  print "$var"; # x
  $var = "y";
  print "$var"; # y
}
print "$var"; # y

If you’ve programmed in any programming language that has blocks, this makes total sense, and you’re wondering why I even bring it up.  Let’s change this slightly:

my $var = "x";
$xfunc = sub {
  print "$var"; # x
  $var = "y";
  print "$var"; # y
}
print "$var"; # x

We’ve now turned that block into an anonymous function. The first time we call that function, we get the expected output of “xy”. Thereafter, we get “yy” because the function doesn’t have a copy of $var, rather it is using the parent’s $var. So what happens if we pass $xfunc as a variable to some other function miles away where $var doesn’t exist? It still references the existing $var which will exist in memory until nothing else in memory is pointing to it. Consider $var our cache: This is critical to how I will implement this cache.

Lesson 2 – A Simple Cache

So we know how to mess with the symbol table, compose anonymous functions, and how variable scope can be perverted to our needs. So how do combine those into a function that creates caching functions?

sub cache_it {
  my $func = shift;
  my %cache;

  $afunc = sub {
    my $key = join ',', @_;

    unless(exists $cache{$key}) {
      # We don't have a cached value, make one
      my $val = $func->(@_);

      # ... and cache it
      $cache{$key} = $val;
    }

    # return the cached value
    return $cache{$key};
  };
  return $afunc;
}

So this function takes a function reference as a value, and returns a caching anonymous function.  It’s cache is a hash that is declared in its parent’s scope, and thus will outlive individual invocations of the anonymous function itself.

We key the cache by joining the arguments with a character, in this case a simple comma. We check the cache to see if there is a matching entry, if so we return it. If not we call the real function that we’ve stored anonymously, and store its return values in the cache. In all cases, we return the cached version of the results.

So, when we have a function we want to make a caching version of, it’s as easy as:

sub myfunc {
  my $thing1=shift;
  # Do something complicated with $thing1, and generate
  # consistent result $thing2
  return $thing2;
}

*myfunc = cache_it(\&myfunc); # Magic

That’s it.  Anywhere in your code you call myfunc, you’ll now be calling the caching version. If you’re having problems and need to debug, just comment out that one little line starting with the asterisk, and your code will call the real myfunc “as usual”. It’s that easy.

Lesson 3 – A Better Cache

I don’t use the cache I wrote above. I tend to write long-living infrastructure code that caches or micro-caches result sets that need to expire or that I need to control. Take the above function and adding item expiration and some simple commands is pretty easy, however.

sub cache_it {
  my ($func,$life) = @_;
  my %cache;

  $afunc = sub {
    my $key = join ',', @_;
    my $now=time();

    if($key eq 'CLEARCACHE') {
      %cache=();
      return;
    }

    # Should the cache be cleaned?
    if($key eq 'CLEANCACHE') {
      foreach my $ckey (keys %cache) {
        delete $cache{$ckey} if $cache{$ckey}->{expires} < $now;
      }
      return;
    }

    unless(exists $cache{$key} and $cache{$key}->{expires} >= $now) {
      # We don't have a cached value, make one
      my $val = $func->(@_);

      # ... and cache it
      $cache{$key} = {
           value => $val,
           expires => $now+$life,
      };
    }

    # return the cached value
    return $cache{$key}->{value};
  };
  return $afunc;
}

And when we call it:

*myfunc = cache_it(\&myfunc,300); # Magic

Where the second parameter is the number of seconds we want a cache item to stay valid. The anonymous function our new cache_it function returns now does three things:

  1. If you call the function with the parameter “CLEARCACHE”, it will empty the entire cache
  2. If you call the function with the parameter “CLEANCACHE”, it will purge cache items that have expired.
  3. If you call the function normally, it will save both the results and the current timestamp into the cache entry, and call the real function if the results in the cache are longer than the life you specified when you created it.

Use

I’ve used very similar caching technology in lots of places:

  1. Anagram generator (or any recursive, consistent function, like Fibonacci or gcd)
  2. DNS resolver
  3. Database queries
  4. Surge-handling in web apps (you can “turn on” that caching when you start surging, and “turn off” when request load goes down)
  5. Smart logging functions (inverse cache – don’t bother logging something if you’ve logged it already)
  6. Distributed caching infrastructure

Prove It

Fibonacci numbers are fun:

sub fib {
    my $n = shift;
    return $n if $n < 2;
    return fib($n - 1) + fib($n - 2);
}

So this will make 2 calls to itself for every one call where n>=2, but by remembering what fib(2),fib(3),fib(4) etc are, it won’t have to recompute them

[root@mlap ~]# perl ./x 40
    102334155 : 146
[root@mlap ~]# perl ./x 40 cache
    102334155 : 0

The first is a run of the above fib function on a value of 40. The ‘answer’ is 102334155, and it took 146 seconds. The second is a run of the above fib function that has been cached using a cache method very similar to the “simple cache” in Lesson 1. The ‘answer’ is 102334155, and it took <1 second.

Gotchas

This is not a magic bullet. If your function doesn’t return consistent data given a consistent parameter list, or if your function affects something else that will be missed by returning cached results, this isn’t your solution.

The function above assumes that the function it is capturing returns a single scalar (which could be a reference). If the cached function returns an array or a hash, the cache function will not work as expected. This is very easy to fix (4 lines of code), but I wanted to point that out before you get overly frustrated.

 

This entry was posted in Architecture, Coding, Linuxy and tagged . Bookmark the permalink.

Leave a Reply

Your email address will not be published.