A (Python) script to approximate the number of distinct values in a stream of elements using the (simple) Chakraborty/Vinodchandran/Meel algorithm (https://arxiv.org/pdf/2301.10191#section.2).

tldr:

Compared to sort/uniq:

– sort/uniq always uses less memory (about 30-50%).

– sort/uniq is about 5 times slower.

Compared to 'the awk construct':

– awk uses about the same amount of time (0.5x-2x).

– awk uses much more memory for large files. Basically linear to the file size, while ApproxiCount has an upper bound. For typical multi-GiB files this can mean factors of 20x-150x, e.g. 5GiB (awk) vs. 40MiB (aprxc).

RT @colbycosh: Random numbers are larger than favourite numbers.

Insane! ;) RT @jtauber: Slides from my talk tonight: You Used Python For WHAT?!