Skip to content

n3slami/Sublime

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

94 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sublime

Sublime is the first framework for generalizing a frequency estimation sketch (such as the Count-Min Sketch, Count Sketch, or Misra-Gries) to improve their accuracy and memory footprint by adapting to the workload's skew and the stream's length. To save memory under skew, Sublime uses short counters upfront and elongates them as they overflow with extensions stored within the same cache line. It leverages specialized bit manipulation routines to quickly access a counter’s extension. To maintain accurate estimates in the face of stream growth, Sublime expands its number of counters. In doing so, it sublinearly bounds both the error and the memory footprint and establishes a Pareto frontier of tradeoffs to choose from.

Our paper describes Sublime in detail and provides an in-depth theoretical analysis for its memory footprint and accuracy guarantees. It also proves a memory footprint lower bound for sketches that adapt to the stream's length, showing that Sublime gives rise to sketches that use the optimal amount of space for their accuracy guarantee.

Getting Started

To use Sublime in developing your own project, simply add the files in the include directory and include the header file corresponding to the desired version of Sublime.

The file SublimeCMS.hpp implements Sublime when applied to the Count-Min Sketch, complete with VALE's auto-tuning features. The file SublimeCMSNoTuning.hpp.in fixes VALE's tuning and allows the compiler to apply intrusive optimizations to significantly improve insertion, deletion, and query performance. Moreover, SublimeCMSNoTuningMorris.hpp.in probabilistically increments each counter during insertions to enable a fair comparison with Tailored Sketch.

Similarly, the file SublimeCS.hpp implements Sublime applied to the Count Sketch, and the file SublimeCSNoTuning.hpp.in fixes VALE's tuning in this case to enable higher performance.

All versions of Sublime provide the following APIs:

  • Insert: Inserts the provided key into the sketch.
  • Delete: Deletes the provided key from the sketch.
  • Query: Returns an estimate of the provided key's frequency.
  • Size: Returns the size of the sketch.
  • FlushPrefetchQueue: Flushes all updates whose requests for chunks are in-flight. This ensures queries take into account all updates previously made to the filter. The file example.cpp in the examples directory illustrates the usage of these APIs. All other operations such as auto-tuning VALE are transparently handled by Sublime and therefore do not have APIs.

Quick Reprodicibility

This repository contains an evaluate.sh script. This script downloads the relevant datasets, compiles the different versions of Sublime, sets the benchmarks up, runs them, and produces the figures and tables. To ease the reproducibility process, we provide a Dockerfile that spins up an Ubuntu Docker container with the requirements installed and runs evaluate.sh to gather the results. One may choose to run the experiments in Docker, or to run the experiments natively. Both methods take ~1 hour to complete and we provide instructions on how to do either. We advise using a machine with at least 16GB of RAM.

Note: We exclude the full CAIDA 2018 dataset used in our evaluation and only use a 10% slide of the dataset. This exclusion is due to the CAIDA dataset requiring approval for use in research projects. We repeat this slice of the dataset 10 times to scale the create a workload of the same size as when using the full CAIDA dataset. Nevertheless, it is still possible to reproduce the exact results in our paper by adding the remaining 90% of the dataset, i.e., the files 1.dat to 10.dat, to the paper_results/real_datasets directory before running the evalute.sh script in either the dockerized version or the native version of our benchmarks.

Executing with Docker

Running the following commands clones Sublime and evaluates it:

git clone https://github.com/---/Sublime.git
cd Sublime
docker build -t sublime_eval .
mkdir -p ../paper_results/figures/ && docker run -v ../paper_results/figures:/usr/local/paper_results/figures -it sublime_eval

The above creates a directory next to the cloned repository named paper_results and puts the generated figures and tables in the paper_results/figures subdirectory within. The figures and tables will will be numbered to match the paper.

Executing Natively

To use the evaluate.sh script, ensure that the following dependencies are satisfied:

  • Make >=3.30.8 (Used to compile the baselines)
  • python3-env >=3.13.5 (Used to plot the experimental results only)
  • Python 3 >=3.7 (Used to plot the experimental results only)
  • Latex full (Used to plot the experimental results only)

Then, running

git clone https://github.com/---/Sublime.git
cd Sublime
bash evaluate.sh

reproduces the results in the paper, creating a paper_results directory next to the cloned repository in the process. It places the generated figures and tables in the paper_results/figures subdirectory, numbering them to match the paper.

Building

The following lists the dependencies required for building Sublime and its evaluation suite:

  • cmake 3.5 (or later)
  • gcc-11 (or later)
  • Git 2.13 (or later)
  • Python 3.8 (or later)
  • Bash 4.4 (or later)
    • realpath 8.28 (or later)
    • wget 1.19.4 (or later)

To build the different versions of Sublime along with their examples, unit tests, and benchmarks, navigate to the project's root directory and execute the commands

mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j8

You can control which parts are configured and compiled with the following CMake options:

  • BUILD_TESTS: Builds the tests.
  • BUILD_EXAMPLES: Builds the examples.
  • BUILD_BENCHMARKS: Builds the benchmark suite while cloning the repositories of the baselines. All these options are set by default. To turn one off, for example, BUILD_BENCHMARKS, substitute the second line in the build script above with:
cmake .. -DCMAKE_BUILD_TYPE=Release -DBUILD_BENCHMARKS=0

In addition to the above options, you can fix the tuning of Sublime's (i.e., VALE's) parameters by specifying the following CMake options before compiling the code:

  • FIXED_TUNING_C={c}: Fixes the number of counters per chunk to c.
  • FIXED_TUNING_S={s}: Fixes the stub length to s. These parameters impact the files SublimeCMSNoTuning.hpp.in, SublimeCMSNoTuningMorris.hpp.in, and SublimeCSNoTuning.hpp.in. They do not change the tuning of the core versions of Sublime, i.e., those in SublimeCMS.hpp and SublimeCS.hpp.

For SublimeCMSNoTuningMorris.hpp.in, you can also set the increment probability of the counters in Sublime by supplying the following CMake option:

  • LOG_REC_INC_PROB={p}: Set the increment probability to $2^{-p}$.

Running Unit Tests

After building Sublime, run the following command from the project's root directory to execute the unit tests:

ctest -VV

Citation

Please use the following BibTeX entry to cite our work in your own publications:

@article{10.1145/3802116,
author = {Eslami, Navid and Bercea, Ioana and Pagh, Rasmus and Dayan, Niv},
title = {Sublime: Sublinear Error \& Space for Unbounded Skewed Streams},
year = {2026},
issue_date = {June 2026},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {4},
number = {3},
url = {https://doi.org/10.1145/3802116},
doi = {10.1145/3802116},
abstract = {Modern stream processing systems often need to track the frequency of distinct keys in a data stream in real-time. Since maintaining exact counts can require a prohibitive amount of memory, many applications rely on compact, probabilistic data structures known as frequency estimation sketches to approximate them. However, mainstream frequency estimation sketches fall short in two critical aspects. First, they are memory-inefficient under skewed workloads because they use uniformly-sized counters to count the keys, thus wasting memory on storing the leading zeros of many small counts. Second, their estimation error deteriorates at least linearly with the length of the stream --- which may grow indefinitely --- because they rely on a fixed number of counters. We present Sublime, a framework that generalizes frequency estimation sketches to address these challenges. To reduce memory footprint under skew, Sublime begins with short counters and dynamically elongates them as they overflow, storing their extensions within the same cache line. It employs efficient bit manipulation routines to quickly locate and access a counter's extensions. To maintain accuracy as the stream grows, Sublime also expands its number of counters at a configurable rate, exposing a new spectrum of accuracy-memory tradeoffs that applications can tune to their needs. We apply Sublime to both Count-Min Sketch and Count Sketch. Through theoretical analysis and empirical evaluation, we show that Sublime significantly improves accuracy and memory over the state of the art while maintaining competitive or superior performance.},
journal = {Proc. ACM Manag. Data},
month = may,
articleno = {239},
numpages = {29},
keywords = {frequency estimation sketch, data growth, scalability}
}

About

The first framework for generalizing a frequency estimation sketch to improve accuracy and memory consumption by adapting to the stream's skew and its length.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors