Universiti Teknologi Malaysia Institutional Repository

Practical training dataset generation and retraining mechanism for on-line peer-to-peer traffic classification

Zarei, Roozbeh (2012) Practical training dataset generation and retraining mechanism for on-line peer-to-peer traffic classification. Masters thesis, Universiti Teknologi Malaysia, Faculty of Electrical Engineering.

[img]
Preview
PDF
299kB

Official URL: http://dms.library.utm.my:8080/vital/access/manage...

Abstract

Peer-to-Peer (P2P) detection by Machine Learning (ML) classification is affected by the quality and recency of training dataset. Hence, to classify P2P traffic on-line requires the removal of these limitations. In this research work, a novel practical training dataset generation and automatic retraining mechanism for on-line P2P traffic classification are proposed. These two proposals are integrated in a system that removes the limitations of ML classification and makes them suitable for on-line P2P traffic classification. For the first part, a novel two-stage training dataset generation is proposed by combining a 3-class heuristic and a 3-class statistical classification to accurately generate training dataset. In the heuristic stage, traffic is classified as P2P, nonP2P or unknown. In statistical stage, a dual-Decision Tree (DT) is built based on dataset generated in heuristic stage to classify unknown traffic into three classes in order to reduce the amount of classified unknown traffics. The final training dataset is generated based on all flows which are classified in these two stages. In the second part of the system, an automatic retraining mechanism is proposed to satisfy the needs of retraining ML classifier by detecting the changes of traffic behavior and updating the on-line ML classifier with recent accurate training dataset. This mechanism evaluates the accuracy of the on-line ML classifier based on flows labeled by the two-stage training dataset generation. The on-line ML classifier is retrained if its accuracy falls below a predefined threshold. The proposed system has been evaluated on traces captured from the Universiti Teknologi Malaysia (UTM) campus network between October and November 2011. The overall results shows that the two-stage training dataset generation can generate accurate training dataset by classifying more than 95% of total flows with high accuracy (98:59%) and low false positive (0:91%). The on-line ML classifier which is built based on (J48) algorithm and training dataset generated by the two-stage training dataset generation classifies traffic with high accuracy (99%) by using the 25 feature extracted from first 5 packets of each flow. The results also show that using automatic retraining mechanism allow the on-line ML classifier able to maintain its accuracy above a set threshold over time.

Item Type:Thesis (Masters)
Additional Information:Thesis (Sarjana Kejuruteraan (Elektrik - Elektronik dan Telekomunikasi)) - Universiti Teknologi Malaysia, 2012; Supervisor : Dr. Muhammad Nadzir Marsono
Uncontrolled Keywords:Peer-to-Peer (P2P), machine learning, retraining mechanism
Subjects:T Technology > TK Electrical engineering. Electronics Nuclear engineering
Divisions:Electrical Engineering
ID Code:33398
Deposited By: Narimah Nawil
Deposited On:25 Oct 2013 00:24
Last Modified:27 May 2018 08:07

Repository Staff Only: item control page