Jeff Gould

    How to write a rolling hash string search algorithm in JavaScript

    December 17, 2019

    Some time ago I watched an interview on interviewing.io where the task that the interviewer posed to the interviewee was to create a function called grep() that should accept two strings, haystack and needle, and return an array of indexes where needle exists in haystack.

    The naive approach

    The naive approach is pretty simple in javascript, just loop through each letter in the haystack, grab a slice starting at the current letter that is the length of the needle, and check to see if that slice is equal to the needle.

    Here's what that might look like:

    // Naive approach
    function grepNaive(haystack, needle) {
      const matches = [];
      for(let i = 0; i < haystack.length - (needle.length-1); i++){
        const substr = haystack.slice(i, i + needle.length);
        if( substr === needle ) matches.push(i);
      } 
      return matches; 
    }

    And here it is in action:

    Haystack: Needle:

    React is the entry point to the React library. If you load React from a script tag, these top-level APIs are available on the React global. If you use ES6 with npm, you can write import React from 'react'. If you use ES5 with npm, you can write var React = require('react'). React

    This works great but it's not super efficient. The loop only runs once for each possible match, but the string comparison is arguably a heavy operation that further requires that each letter in the slice be compared to each letter in the needle. JavaScript hides this bit of complexity from us by allowing us to check string equality, but under the hood each individual letter still needs to be evaluated and since we're making this comparison once for each letter in the haystack, most letters actually get compared multiple times.

    A more efficient solution: rolling-hash search

    If you were to get asked this question in an interview, this is the point where the interviewer would say something a long the lines of "this is great, but is there a more efficient solution?" Any maybe you've memorized Rabin-Karp because you're a 10x-rockstar-ninja-guru. Awesome. If you're like me and you have no idea what fish have to do with pattern matching algorithms - then you've come to the right place!

    First thing's first: What even is a rolling hash? Well, basically it's the idea that you can find hash of a substring within a string and then shift the first letter out of that hash and push another letter onto the end of the hash to find the hash of the substring next to the original without recalculating the entire hash. To do that we need to come up with a way to hash strings that will allow us to do this shift and push operation while still producing hashes that don't produce too many collisions (eg hash('ba') ≠ hash('ab')).

    To do that, let's start simple and hash just a single character with a hash1() function. We can do this by finding the character code of a character and just multiplying that by a large number - multiplying by a large number will help us to avoid hash collisions when we're finding hashes of whole strings.

    function hash1(str) {
      return str.charCodeAt(0) * 256;
    }

    We could use this function to create hashes of stings by adding the hashes together, but right now this will still be very prone to collision because hash1('a') + hash1('b') === hash1('b') + hash1('a') - we need to add a way to create hashes that respect the order of the characters in a string, and we can do that by first adding a vector argument to our hash1() function.

    function hash1(char, vector = 0) {
      return char.charCodeAt(0) * 256 ** vector;
    }

    All that's changed here is that we're accepting that vector argument and using it as an exponent on our hash - so now, since vector defaults to 0, hash1('d') will actually return the same value as 'd'.charCodeAt(0) which is 100 but we can specify a vector to signify different positions of the letter, so the hash1 of the letter d within the string mad would be hash1('d', 2) which would be 6553600.

    Now we just need to create a string hashing function that will use our hash1() function to create hashes of whole strings. We can do that by just passing each letter to hash1(), using its position in the string as the vector and adding all of those values up. I'll do that by splitting the string up into an array of characters and then utilize JavaScript's Array.map and Array.reduce methods to find the hash value and then get the sum of all hash values in the string:

    function hash(str) {
      return str
        .split('') // transform the string into an array of characters
        .map(hash1) // hash each character, using its index in the array as vector
        .reduce(add, 0) // add all of the values up (assume add function defined as `(a,b) => a+b`)
    }

    What's nice about this hash method is that while there is a low chance of hash collisions, the math is pretty straightforward which allows us to modify the hash without having to jump through too many hoops. For example, if you find the hash for a string like foo (7303014) and want to find the hash for just oo, all you have to do is subtract the hash for f which is just its character code 7303014 - 102 = 7302912 and then shift the exponents on the rest of the hashed characters down by one which can be done by dividing the the remaining hash number by 256 7302912/256 = 28527.

    Why divide by 256? Because (a*256^n)/256 is equal to (a*256^(n-1)). So assuming the character code for "f" is 102 and "o" is 111, we can find the original hash using the following equation hash(foo) = (102*256^0) + (111*256^1) + (111*256^2) so when we subtract 102 that equation looks like hash(_oo) = (111*256^1) + (111*256^2) and then dividing by 256 gets us hash(oo) = (111*256^0)+(111*256^1). Let's create a function to express that:

    function shiftHash(originalHash, firstChar) {
      return (originalHash - hash1(firstChar)) / 256;
    }

    Next, we can create a function to pop a character on to the end of the existing hash. To add a character onto an existing hash, we just need to add the hash1() of the character that we'd like to add, with the vector appropriate for the position that we'll add it. So if we have the hash for oo (28527) and we'd like to find the hash for oof, we just need to find hash1('f', 2) //6684672 and add that to the existing hash 28527 + 6684672 = 6713199. Here's how that would work as a function

    function pushHash(originalHash, strLen, nextChar) {
      return originalHash + hash1(nextChar, strLen-1); // Subtract 1 from strLen because indexes start at 0
    }

    And from that we can compose a rollHash function that will remove the first character from an existing hash and add a new character onto the end:

    function rollHash(originalHash, firstChar, nextChar, strLen) {
      const shiftedHash = shiftHash(originalHash, firstChar);
      return pushHash(shiftedHash, strLen, nextChar);
    }

    Now we can find the hash of a string hash('foo') //7303014 and then easily find the hash of oof by using rollHash(7303014, 'f', 'f', 3); //6713199.

    To put it all together, we can now build our grep function by looping through our haystack to find occurrences of our needle

    // Rolling Hash Approach
    function grepHash(haystack, needle) {
      const matches = []; // This will store all indexes where a match occurs
      const needleHash = hash(needle);
      const needleLength = needle.length;
      let currentHash = hash(haystack.substr(0, needleLength)); 
      for (let i = 0; i < haystack.length - (needleLength - 1); i++) {
        if (needleHash === currentHash) matches.push(i); // If hashes match, push the current index into matches
        currentHash = rollHash(
          currentHash,
          haystack[i], // first character in current search
          haystack.substr(i + needleLength, 1), // next letter in haystack
          needleLength
        );
      }
    
      return matches;
    }

    The code here is definitely more complicated, but this is often the case with optimized solutions - we're doing more work up front to prevent the computer from having to do it at runtime. Here it is in a CodeSandbox for you to play around with.

    Conclusion

    To be completely honest, modern hardware and languages would probably still support the naive solution just fine, even at scale. So this may only be something that you use in the context of interviewing or if you enjoy trying to level up on platforms like HackerRank. As a self-taught developer, I've come to enjoy learning about algorithms because even if I don't end up using them in my day to day, it helps me to expand the way that I think about code and problem solving.

    Categories:
    Next 5 Great Git CLI Shortcuts