A (Python) script to approximate the number of distinct values in a stream of elements using the (simple) Chakraborty/Vinodchandran/Meel algorithm (https://arxiv.org/pdf/2301.10191#section.2).
tldr:
Compared to sort/uniq:
– sort/uniq always uses less memory (about 30-50%).
– sort/uniq is about 5 times slower.
Compared to 'the awk construct':
– awk uses about the same amount of time (0.5x-2x).
– awk uses much more memory for large files. Basically linear to the file size, while ApproxiCount has an upper bound. For typical multi-GiB files this can mean factors of 20x-150x, e.g. 5GiB (awk) vs. 40MiB (aprxc).
Facebook just launched Presto, our distributed SQL query engine for huge data stores. It's amazing.
Storm now has a website!
My slides from the second talk today: "HBase Advanced Schema Design" via @slideshare #bbuzz