Home

Short comparison of a neat Count-Distinct algorithm in JavaScript and Haskell

22 February 2023

Recently, I read a fun article called "A Charming Algorithm for Count-Distinct" from Justin Jaffray. If you haven't read it already, I suggest you do as it's much more entertaining than this one 🙂. It explains a concise algorithm to estimate the number of distinct elements in a set in bounded memory. Below is the JavaScript implementation from the article, which is very straightforward once you know the algorithm (see the linked article for a neat explanation):

function countDistinct(list, thresh) {
  let p = 1;
  let seen = new Set();
  for (let value of list) {
    seen.delete(value);
    if (Math.random() < p) {
      seen.add(value);
    }
    if (seen.size === thresh) {
      // Objects now need to win an extra coin flip to be included
      // in the set. Every element in `seen` already won n-1 coin
      // flips, so they now have to win one more.
      seen = new Set([...seen].filter(() => Math.random() < 0.5));
      p *= 1 / 2;
    }
  }
  return seen.size / p;
}

I indeed found the algorithm very charming. As a Haskeller, seeing a neat algorithm forces me to try to write it in Haskell to see how it'd look like. My initial instinct was that it would look cleaner and more concise. However, once I wrote it, it did not feel "unequivocally better". After staring at it for some time, I found parts that I appreciated in the Haskell implementation, and other parts that felt like noise.

Before I ramble on further, here is my Haskell implementation:

import Control.Monad.Random
import Data.Set (Set)
import qualified Data.Set as Set

countDistinct :: (MonadRandom m, Ord a) => Int -> [a] -> m Double
countDistinct threshold xs = do
  (seen, p) <- foldM go (Set.empty, 1) xs
  return $ fromIntegral (Set.size seen) / p
 where
  go (seen, p) value = do
    -- Add the element if lucky, ignoring the existing value
    seen' <- Set.alterF (\_ -> isLucky p) value seen

    -- If the set reached the threshold, make it twice as hard
    if Set.size seen' < threshold
      then return (seen', p)
      else (,) <$> thanos seen' <*> return (p / 2)

-- Remove half of the elements in a set, randomly
thanos :: MonadRandom m => Set a -> m (Set a)
thanos =
  fmap Set.fromDistinctAscList
    . filterM (\_ -> isLucky 0.5)
    . Set.toAscList

-- Toss a coin with given probability of 'True'
isLucky :: MonadRandom m => Double -> m Bool
isLucky p = (< p) <$> getRandom

Just stare at them for a bit. I'll wait.

. . .

They are so similar, yet so different, right? Even as a seasoned Haskell programmer, I find the Haskell implementation very wordy. There are types, constraints, monads, operators and so on.

This is just a fun exercise, so I try not to go over them in detail, but just share some of my thoughts.

First, some disclaimers:

And now to the specific points.

Use of MonadRandom

Math.random() < p
-- Toss a coin with given probability of 'True'
isLucky :: MonadRandom m => Double -> m Bool
isLucky p = (< p) <$> getRandom

I believe this is the biggest difference between the two implementations. In most languages, no one would bat an eye for an expression like Math.random() < p. However, as Haskell is a pure language, an expression can't just change the state of the world (in this case, state of the random number generator). So, we use MonadRandom to carry around the state of the random number generator, but this forces us to lift our expressions into the monad often, often with weird-looking operators like <$> and <*>.

This use of monads does introduce some verbosity, but it also brings some benefits. Most importantly, the random number generation is now explicit, and we can use it to test our code with a fixed seed to get deterministic output; and use a "real" random seed in production.

Again, surely there must be hundreds of libraries on npm that provides a deterministic random number generator in JavaScript, but I don't think people would reach out to them; and their implementation would either allow setting the seed globally which makes it less composable, or require passing the RNG (random number generator) around which would make the code more verbose.

In the end, I do think the "default" Haskell implementation has a lot of benefits over the "default" JavaScript implementation. However, those benefits do come with a syntactical overhead.

Set interface

seen.delete(value);
if (Math.random() < p) {
  seen.add(value);
}
Set.alterF (\_ -> isLucky p) value seen

alterF is a function that can "modify" the presence of an element in the set while possibly having access to a context (the RNG in this case). Here, the function we pass to alterF ignores its parameter, making it obvious that the existence of the element in the set does not matter.

This seems like a minor point, but I think it's a good example of how the two languages differ in their approach to data structures. While the Set in the JS spec comes with a couple of useful functions, Haskell's Data.Set module comes with a much more powerful interface. I see this pattern in many other data structures, both in the API of a specific data structure, and also the variety of them readily available.

I understand the reason - JavaScript interpreters natively implement the core data structures according to the spec, and anything on top of them that is implemented on the JavaScript side is likely to be a lot slower. I believe that is one of the main reason libraries like immutable.js is not used more often.

There is another detail worth mentioning here is that the JavaScript Set is usually (always?) a hash table. In Haskell-land, we have two major packages that provide a Set data type: containers and unordered-containers, and they provide tree-based and hash-based containers respectively. Both are super widely used, but I usually go with containers because tree-based structures tend to provide an interface better suited for functional programming. In this case, alterF is a good example of that. But it does mean that if I were to go with unordered-containers, my implementation here would look more similar to the JavaScript version.

Loops vs folds

let p = 1;
let seen = new Set();
for (let value of list) {
  ...
}
return seen.size / p;
 (seen, p) <- foldM go (Set.empty, 1) xs
 return $ fromIntegral (Set.size seen) / p
where
  go (seen, p) value = do
    ...

This is what most people think of when they think of functional programming, using higher-order functions insteads of loops.

I guess it's a choice of style, I like that the fold makes the structure of the iteration explicit. However, after working with colleagues more experienced with JavaScript, I can now see that the fold makes the "execution order" of the code less obvious. I think it's a Haskeller habit that we do not usually think of the code as a sequence of instructions, so the execution order being different from the order of the code is less of a problem for us.


In the end, I guess there's not much to say other than "Haskell code is fancier". In this example the fanciness actually has many benefits - whether it is worth it or not depends on the situation.

This is pretty much how much time I'm willing to spend writing about this. If you made it this far, thanks!

Home
© Utku Demir, 2015-2024. Source: github.com/utdemir/utdemir.com.
This work is licensed under a Creative Commons Attribution 4.0 International License.