Conference Paper

IM Session Identification by Outlier Detection in Cross-correlation Functions

Conference: 49th Annual Conference on Information Sciences and Systems (CISS), At Johns Hopkins University, Baltimore, MD, USA

ABSTRACT The identification of encrypted Instant Messaging (IM) channels between users is made difficult by the presence of variable and high levels of uncorrelated background traffic. In this paper, we propose a novel Cross-correlation Outlier Detector (CCOD) to identify communicating end-users in a large group of users. Our technique uses traffic flow traces between individual users and IM service provider's data center. We evaluate the CCOD on a data set of Yahoo! IM traffic traces with an average SNR of −6.11dB (data set includes ground truth). Results show that our technique provides 88% true positives (TP) rate, 3% false positives (FP) rate and 96% ROC area. Performance of the previous correlation-based schemes on the same data set was limited to 63% TP rate, 4% FP rate and 85% ROC area.

Download full-text


DOI: 10.13140/RG.2.1.3524.5602 · Available from: Saad Saleh, Apr 07, 2015
  • [Show abstract] [Hide abstract]
    ABSTRACT: We present a novel attack on relayed instant messaging (IM) traffic that allows an attacker to infer who's talking to whom with high accuracy. This attack only requires collection of packet header traces between users and IM servers for a short time period, where each packet in the trace goes from a user to an IM server or vice-versa. The specific goal of the attack is to accurately identify a candidate set of top-k users with whom a given user possibly talked to, while using only the information available in packet header traces (packet payloads cannot be used because they are mostly encrypted). Towards this end, we propose a wavelet-based scheme, called COmmunication Link De-anonymization (COLD), and evaluate its effectiveness using a real-world Yahoo! Messenger data set. The results of our experiments show that COLD achieves a hit rate of more than 90% for a candidate set size of 10. For slightly larger candidate set size of 20, COLD achieves almost 100% hit rate. In contrast, a baseline method using time series correlation could only achieve less than 5% hit rate for similar candidate set sizes.
    2013 21st IEEE International Conference on Network Protocols (ICNP); 10/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: More than twelve years have elapsed since the first public release of WEKA. In that time, the software has been rewritten entirely from scratch, evolved substantially and now accompanies a text on data mining [35]. These days, WEKA enjoys widespread acceptance in both academia and business, has an active community, and has been downloaded more than 1.4 million times since being placed on Source-Forge in April 2000. This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.
    ACM SIGKDD Explorations Newsletter 11/2009; 11(1-1):10-18. DOI:10.1145/1656274.1656278