I haven't messed around with the count-min augmentation for HLLs, but I've had a lot of success in practice.
I've experimented with using HLL sketches for set operations for genome comparisons. The great thing about these operations is that the cost of operations scales with the size of the sketch, not the size of the dataset being sketched. Add SIMD acceleration, and you have an extremely fast, compact, accurate way to perform approximate set operations.
Unfortunately, the naive intersection operation (a sketch consisting of an element-wise 'min' of the counts in both sketches) does not perform well in cardinality estimation.
The recent Ertl paper  and associated code  puts a lot of effort into more accurate estimation and set operations. In my experiments, I've found that his modified estimation formula is always more accurate than the standard method and remains accurate to much higher cardinalities for a given sketch size.
My SIMD-accelerated, threadsafe HyperLogLog implementation in C++ (with python bindings) using this improved estimation method is available at . Using this structure, I was able to perform over a billion full-genome Jaccard index calculations overnight using `bonsai dist` from .
This is great! Probabilistic datastructures are amazing.
I did straight up HLL++ a while ago but had to do intersection with the inclusion–exclusion principle. It would be cool to try to fold this in somehow! (https://github.com/mynameisfiber/gohll).
Forgive me if this is a dumb question, but is this library a way of merging multiple HyperLogLogs together and still have a reasonable cardinality estimate?
What sort of loss do you see the more HyperLogLogs you merge? As in, if you had Set1,2,3,4,5.....n
What is this supposed to do?
Useful library. I hope someone out there is writing one for Java as well.