230 likes | 350 Views
Characterizing Files in the Modern Gnutella Network: A Measurement Study. Shanyu Zhao, Daniel Stutzbach, Reza Rejaie Multimedia & Internetworking Research Group (Mirage) Computer & Information Science Department University of Oregon http://mirage.cs.uoregon.edu. Introduction.
E N D
Characterizing Files in the Modern Gnutella Network: A Measurement Study Shanyu Zhao, Daniel Stutzbach, Reza Rejaie Multimedia & Internetworking Research Group (Mirage) Computer & Information Science Department University of Oregon http://mirage.cs.uoregon.edu Multimedia Computing & Networking 2006
Introduction • P2P applications are very popular over the Internet • File-sharing: Gnutella, Kazza, eDonkey • Content distribution: BitTorrent • IP telephony: Skype • P2P applications remain popular because of • Ease of deployment, self-scaling, infrastructure-less • Significant impact on the Internet • Characterizing P2P applications is essential for • Evaluating their performance and improving their designs • Conducting meaningful simulations and analytical study • Examining their impact on the network • Characteristics of large scale P2P applications are not well understood! Multimedia Computing & Networking 2006
P2P Systems: An Overview (I) • Theme: enabling a group of peers (computers) to share their resources (e.g. file, bandwidth, storage, CPU) • As participating peers arbitrarily join & leave, they form an (application level) overlay topology. • Overlay is inherently dynamic • No especial support from the network (e.g. multicast) • Overlay is used for resource discovery, management Multimedia Computing & Networking 2006
P2P Systems – Overview (II) • Inherent properties: • Scalability: available resources organically grows with the number of peers • Churn: peers voluntarily join/leave • Heterogeneity: peers have different capabilities • Two basic architectures: • Unstructured: peers form a randomly connected overlay 2) Structured: peers form an overlay with certain properties (ring, tree) Multimedia Computing & Networking 2006
Effect on the Internet • 60% of all Internet traffic [CacheLogic Research 2005] • Some P2P apps have millions of simultaneous users. • Geographically distributed. Gnutella overlay in 2002 Gnutella population (Oct 04 – Jan 06) Multimedia Computing & Networking 2006
Research on P2P Networking • Active area of research since 2001 • Mostly focusing on new architectures, new resource discovery/management techniques • Evaluation is only feasible through simulation or small scale experiments with synthetic workloads. • Few empirical studies on P2P systems • Characteristics of widely-deployed P2P systems are not well understood. • Peer dynamics: e.g. dist of peer uptime • Overlay properties: e.g. dist of peer degree • Resource properties: e.g. popularity dist of files Multimedia Computing & Networking 2006
Methodology • Characterizing P2P applications requires capturing system “snapshots”. • Snapshot is a graph that represents state of the system at a given point of time (peers = nodes, connections = edges). • Individual snapshots reveal instantaneous properties. • Consecutive snapshots reveal dynamics. • Ideally, a snapshot is captured instantaneously. • In practice, a snapshot is iteratively discovered by a P2P crawler. • P2P apps should provide support for crawler, e.g. query a peer for list of neighbors, files. • It is difficult to characterize proprietary P2P applications. Multimedia Computing & Networking 2006
Cruiser: a Fast P2P Crawler • We developed a parallel crawler, called Cruiser. • Features: • Master-slave architecture, master coordinates among slaves, each slave crawls hundred peers simultaneously • Dynamic adaptation to bandwidth & CPU constraints • Generic crawler, accommodates plug-ins • Orders of magnitude faster than other P2P crawlers: • Captures one million Gnutella nodes in around 7 minutes • 140K peers/min (visiting 22K peers/min) >> 2.5 peers/min • Lots of important implementation issues: • Setting timeout, no of file-descriptors per process, dealing with local NAT box Multimedia Computing & Networking 2006
Cruiser/ Evaluating Snapshot Accuracy • No ref. snapshot to compare • Completeness of captured snapshots: edges, nodes • Tradeoff between granularity & completeness of snapshots • Node distortion > 4% • Edge distortion > 15% • 30% of peers are unreachable • 3% departed peer • 17% behind firewall (NAT) • 10% overloaded !! Peers discovered (*10,000) Multimedia Computing & Networking 2006
Characterizing Files/ Previous Studies • Captured a small population of peers • Partial snapshot through a short crawl • Periodic probe of a fixed group of peers • Have not verified whether the captured population is representative • Conducted more than 3 years ago (outdated) • Population of these apps has significantly grown • New features & two-tier arch. were incorporated Multimedia Computing & Networking 2006
Characterizing Files/ Measurement Methodology • Characterizing files requires file snapshots. • Obtaining the list of shared files & neighbor info. from individual peers • a content crawl + a topolgy crawl • Individual snapshots reveal static & topological analysis. • Consecutive snapshots reveal dynamic analysis. • Topology crawl is much faster than content crawl (minutes vs hours) • Other challenges: NAT, DHCP, fileID, …(see paper). • Minimizing the distortion in file snapshots by • Capturing a complete snapshot with a high-speed crawler • Decoupling topology crawl from content crawl Ultrapeer Top-level overlay Topology crawl Topology crawl Content Crawl 5.5 hours 15 min 15 min Leaf Multimedia Computing & Networking 2006
Characterizing Files/ Dataset • Captured around 50 snapshots • Average log size/snapshot: 10GByte • Each snapshot represents • 800 Terabyte content • 100 million unique files • 0.5 million reachable peers, 20% of identified peers • Available content in Gnutella = 4,000 Terabytes • Reported results were consistent across multiple snapshots • Post processing • e.g. Removed duplicate files reported by individual peers (9% of all captured files) Multimedia Computing & Networking 2006
Characterizing Files/ Summary of Characterizations 1) Static analysis: characteristics of files at a given point of time 2) Topological analysis: correlation between file distribution and overlay topology 3) Dynamics analysis: changes in file characteristics over time Multimedia Computing & Networking 2006
Characterizing Files/Static Analysis Free Riding Free Riders • % of free riders reported in previous studies • 66% in 2000 [Adar] • 25% in 2002 [Saroiu] • % of free riders have dropped Peers None Files 352 159K 12% Ultra 235K 15% 332 Leaf 125K 12% 349 Long-lived Ultra 34K 12% 363 Short-lived Ultra 156K 350 16% Long-lived leaf 79K 297 14% Short-lived Leaf 340 394K 14% total June 13, 2005 [rounded numbers] Multimedia Computing & Networking 2006
Characterizing Files/Static Analysis Resource Sharing • How much resources (files, storage) peers contribute? • Dist. of peers contributing: • x files conforms power-law • x MByte conforms power-law • Most peers contribute little, but few contribute a lot • Shared files vs storage • Not as strong as reported by Saroiu et al. 2002 Multimedia Computing & Networking 2006
Characterizing Files/Static Analysis File Popularity • Representing availability of individual files. • Follows Zipf distribution • Popularity distribution remains stable over time Multimedia Computing & Networking 2006
Characterizing Files/Static Analysis File Types Major Audio Types File% Byte% Type mp3 61% 37% • in 2001, chu et al. reported • Audio: 67% of files, 79% of bytes • Video: 2% of files, 19% of bytes • mp3 files are very popular! • mm files make up: 73% files, 93% bytes • Non-mm: jpg, gif, htm, exe, txt • Video files become more popular wma 2.7% 1.3% wave 1.9% 0.7% m4a 1.4% 0.7% total 67% 40% Major Video Types File% Byte% Type wmv 2.3% 3.4% mpg 2.4% 23.3% avi 0.8% 24.5% asf .14% 0.64% total 5.6% 52% Multimedia Computing & Networking 2006
Characterizing Files/ Topological Analysis • Is there any correlation between locations of a file and overlay topology? • i.e. Are copies of a file topologically clustered? • File locations are affected by two factors: 1) Scoped search => topological clustering 2) Churn => random distribution • Which factor is dominant? • Examining from two angles: • Per-file perspective • Per-peer perspective Multimedia Computing & Networking 2006
Characterizing Files/ Topological Analysis • Simulate flood-based query from 100 random peers • No of messages to find 5 copies • Files with different popularity • Random vs realistic file distr. • Average similarity of content between 100 random peers with one/two/three-hop neighbors. • No topological clustering exists • Churn is the dominant factor • Use random file dist. for sim • Select random peers to characterize files (non trivial) Multimedia Computing & Networking 2006
Characterizing Files/ Dynamic Analysis • How do various characteristics of available files change over different timescales? • Peers add/download or remove files • Peers join/leave the system 1) Variations in shared files by individual peers • Dynamics IP address introduces error 2) Variations in popularity of individual files • Trend in popularity changes Multimedia Computing & Networking 2006
Characterizing Files/Dynamic Analysis Variations of files at individual peers • Ratio of added/removed files to total files (degree of change) • 3000 random peers • Timescales: 2hr, 6hr, 1day, 1wk • More change over longer timescales seems intuitive • Change in popularity of 50K files over one-day interval • More changes for more popular Multimedia Computing & Networking 2006
Characterizing Files/Dynamic Analysis Change in file popularity Top 100 files • Change in popularity • For top 100 and 1000 files • Over different timescales • For any timescale, more popular files • exhibit larger changes • Changes occur more rapidly • Caching references is useful • These all seem intuitive but one needs to quantify rate of changes Top 1000 files Multimedia Computing & Networking 2006
Characterizing Files/Dynamics Analysis Trends in Popularity Changes • Goal: to predict popularity of a file in the future? • No major change in popularity over several days • Larger changes over a few months • The key is to quantify the rate and pattern of changes. • Significantly more snapshots are required to derive any reliable conclusion Multimedia Computing & Networking 2006