Monday, September 24, 2007

Bloom filter stuff

Bloom filters is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not. The Bloom filter was conceived by Burton H. Bloom in 1970

Where X is in Y like comparisons can be quite resource intensive, code having comparisons and equations like this can run for days when you execute them against large datasets. Bloom filters can be a big relieve and save you a lot of resources and time.

In Perl for example this is a lookup hash, a handy idiom for doing existence tests:

foreach my $e ( @things ) { $lookup{$e}++ }

sub check {
my ( $key ) = @_;
print "Found $key!" if exists( $lookup{ $key } );
}

When running this against a small set of data and in a situation where time is not a very big issue this will work fine. However if one or possibly both are against you you might want to use a bloom filter. In Perl this would look something like this:

use Bloom::Filter;

my $filter = Bloom::Filter->new( error_rate => 0.01, capacity => $SONG_COUNT );
open my $fh, "enormous_list_of_titles.txt" or die "Failed to open: $!";

while (<$fh>) {
chomp;
$filter->add( $_ );
}

sub lookup_song {
my ( $title ) = @_;
return unless $filter->check( $title );
return expensive_db_query( $title ) or undef;
}


An empty Bloom filter is a bit array of m bits, all set to 0. There must also be k different hash functions defined, each of which maps a key value to one of the m array positions.

For a good hash function with a wide output, there should be little if any correlation between different bit-fields of such a hash, so this type of hash can be used to generate multiple "different" hash functions by slicing its output into multiple bit fields. Alternatively, one can pass k different initial values (such as 0, 1, ..., k-1) to a hash function that takes an initial value; or add (or append) these values to the key.

For larger m and/or k, independence among the hash functions can be relaxed with negligible increase in false positive rate (Dillinger & Manolios (2004a), Kirsch & Mitzenmacher (2006)). Specifically, Dillinger & Manolios (2004b) show the effectiveness of using enhanced double hashing or triple hashing, variants of double hashing, to derive the k indices using simple arithmetic on two or three indices computed with independent hash functions.

To add an element, feed it to each of the k hash functions to get k array positions. Set the bits at all these positions to 1.
To query for an element (test whether it is in the set), feed it to each of the k hash functions to get k array positions. If any of the bits at these positions are 0, the element is not in the set – if it were, then all the bits would have been set to 1 when it was inserted. If all are 1, then either the element is in the set, or the bits have been set to 1 during the insertion of other elements.

Unfortunately, removing an element from this simple Bloom filter is impossible. The element maps to k bits, and although setting any one of these k bits to zero suffices to remove it, this has the side effect of removing any other elements that map onto that bit, and we have no way of determining whether any such elements have been added. The result is a possibility of false negatives, which are not allowed.

Removal of an element from a Bloom filter can be simulated by having a second Bloom filter that contains items that have been removed. However, false positives in the second filter become false negatives in the composite filter, which are not permitted. This approach also limits the semantics of removal since adding a previously removed item is not possible.

However, it is often the case that all the keys are available but are expensive to enumerate (for example, requiring many disk reads). When the false positive rate gets too high, the filter can be regenerated; this should be a relatively rare event.


Bloom filter in:
Perl
C/C++
Ruby
Java

No comments: