PERRY: A Flexible and Scalable Data Preprocessing System for "ML for Networks" Pipelines
Skip to main content
eScholarship
Open Access Publications from the University of California

UC Santa Barbara

UC Santa Barbara Electronic Theses and Dissertations bannerUC Santa Barbara

PERRY: A Flexible and Scalable Data Preprocessing System for "ML for Networks" Pipelines

Abstract

The integration of machine learning techniques into networking research has catalyzed significant advancements in areas such as traffic classification, intrusion detection, and quality of experience (QoE) estimation. This progress has been fueled by remarkable developments in deep learning, leading to state-of-the-art models in various domains, leveraging powerful neural networks, encoders, transformers, and language model architectures.

Developing these complex ML-based models relies heavily on the data pre-processing module to extract features from the raw network data (e.g., packet traces) and add labels to different data points. Different model specifications require extracting disparate sets of features. Currently, there is a tight coupling between the data pre-processing and model training modules in the ML pipelines used for developing ML artifacts for networking. Specifically, the pre-processing modules are only suited to extract a limited set of features (e.g., extract time series features) that are suitable for specific downstream model specifications (e.g., LSTM). Consequently, researchers exploring new learning models for different networking problems end up spending a significant amount of their time developing custom data pre-processing modules, impeding the pace of innovation.

This thesis focuses on decoupling data pre-processing from model training in ML pipelines for networking. Specifically, we present the design and implementation of PERRY, a flexible data pre-processing module for networking that can extract a wide range of (high-quality) features at scale that can be consumed by disparate model specifications for model training. PERRY offers an intuitive user interface that allows developers to express their data pre-processing intents. More concretely, PERRY supports three distinct classes of features: packet content, time series, and aggregate statistics. For each class, it lets the user specify different parameters. For instance, the user can express which set of fields (e.g., timestamp, number of bytes, etc.) to use for time series features and at what granularity (e.g., per packet, burst, or flow). Similarly, it lets the user select which set of aggregate features to extract and at what granularity.

To scale the pre-processing tasks, PERRY leverages state-of-the-art data analytics and storage tools—making the best use of limited computing and storage resources. Specifically, it decomposes the pre-processing task at flow-level granularity. Such decomposition offers horizontal scalability offered by existing tools without compromising the semantic integrity of the extracted features. Further, to minimize wasteful data processing, it offers a hybrid schema that aims to strike a balance between expressiveness and scale. Specifically, this schema only exposes a subset of popular features to the user, offering pointers to raw data. Such an approach ensures that only a subset of features is extracted for network traffic, and more complex features are dynamically extracted from a subset of network traffic on demand. By decoupling data pre-processing and model training in ML pipelines for networking, PERRY lowers the threshold for developing new ML models in networking. PERRY represents a step forward in simplifying and enhancing data processing in networking research and opens new possibilities for future innovations in the field.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View