Evaluation of Unsupervised Feature Selection and Clustering Algorithms for Network Traffic Analysis

Open Access
Author:
Coulter, Steven Scott
Area of Honors:
Computer Science
Degree:
Bachelor of Science
Document Type:
Thesis
Thesis Supervisors:
  • George Kesidis, Thesis Supervisor
  • David Jonathan Miller, Thesis Supervisor
  • Lee David Coraor, Honors Advisor
Keywords:
  • hierarchial clustering
  • unsupervised clustering
  • feature selection
  • network traffic analysis
Abstract:
Proper identification of network traffic is an essential component of network administration and must be performed in an efficient and accurate way. Due to issues of cost, complexity, inaccuracy, and time it is often times infeasible to place effort into separating network traffic data into different labels and classifications. Instead, methods can be devised to create meaningful groupings of network traffic with no predetermined knowledge of the traffic aside from purely statistical data of the communications sessions themselves. Clustering is an approach to accomplish this, and it does so by developing a measure of similarity over several of the statistics, or extit{features}, of each communication session. For the same reasons of cost and complexity, in addition to accuracy, it is desirable to have a way to extract those statistics that are the most meaningful. This paper discusses several means of extracting the most discriminating features from an unlabeled set of data, while measuring how well those features perform on a set of labeled data where a measure of accuracy can be had. The variance that a feature exhibits over a set of data is a measure that can be exploited to develop a means of extracting meaningful features. These features can be found using either an iterative method, involving steps each with one feature removed, or by developing a more advanced measure of feature similarity. Most importantly, this paper shows that certain features, such as the ratio of maximum to minimum payload size, are rated as highly discriminating features both across different selection methods and across data gathered from very different locations.