1 / 26

A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring

A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring. Patrick P. C. Lee 1 , Tian Bu 2 , Girish Chandranmenon 2 1 The Chinese University of Hong Kong 2 Bell Labs, Alcatel-Lucent April 2010. Outline. Motivation

santos
Download Presentation

A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring Patrick P. C. Lee1, Tian Bu2, Girish Chandranmenon21The Chinese University of Hong Kong2Bell Labs, Alcatel-Lucent April 2010

  2. Outline • Motivation • MCRingBuffer, a multi-core ring buffer • Parallel network monitoring prototype • Conclusions

  3. Network Traffic Monitoring • Monitoring data streams in today’s networks is essential for network management: • Accounting • resource provisioning • failure diagnosis • intrusion detection/prevention • Goal: achieve line-rate monitoring • Monitoring speed must keep up with link bandwidth (i.e., prepare for the worst) • Challenges: • Data volume keeps increasing (e.g., to Gigabit scales) • Single CPU systems may no longer support line-rate monitoring

  4. Can Multi-Core Help? • Can multi-core architectures help line-rate monitoring? • Parallelize packet processing Quad-core CPU CPU raw packets core core raw packets core core core Single-core case Multi-core case • The answer should be “yes”…… yet, exploiting full potential of multi-core is still challenging • Inter-core communication has overhead: • Upper layer: protocol messages • Lower layer: thread synchronization in shared data structures

  5. Can Multi-Core Help? • Multi-core helps only if we minimize inter-core communication overhead • Let’s focus on minimizing thread synchronization • Benefit a broad class of multi-threaded network monitoring applications

  6. Our Contribution • Why lock-free? • Allows concurrent thread accesses • Why cache-efficient? • Saves expensive memory accesses • We embed the mechanism to MCRingBuffer, a lock-free, cache-efficient shared ring buffer tailored for multi-core architectures Design a lock-free, cache-efficient multi-core synchronization mechanism for high-speed network traffic monitoring

  7. Producer/Consumer Problem • Classical OS problem • Ring buffer: bounded buffer with fixed number of slots • Thread synchronization: • Producer inserts elements when buffer is not full • Consumer extracts elements when buffer is not empty • First-in-first-out (FIFO): inserted elements and extracted elements in the same order element Producer Consumer Ring buffer

  8. CPU core core Producer Consumer L1 cache L1 cache L2 cache System bus Memory Control variables Ring buffer Producer/Consumer Problem • Ring buffer in multi-core context: • Thread synchronization operates on control variables. Make the operations as cache-friendly as possible.

  9. read write N-1 0 Lamport’s Lock-Free Ring Buffer [Lamport, Comm. of ACM, 1977] • Operate on control variables: read and write, which resp. point to next read and write slots NEXT(x) = (x + 1) % N Insert(T element) 1: wait until NEXT(write) != read 2: buffer[write] = element 3: write = NEXT(write) Extract(T* element) 1: wait until read != write 2: *element = buffer[read] 3: read = NEXT(read)

  10. Previous Work • FastForward [Giacomoni et al., PPoPP, 2008]: • couple data/control operations • need a special NULL data element defined by applications • Hardware-primitive ring buffers • support multiple-producers/multiple-consumers • use hardware synchronization primitives (e.g., compare and swap) • Hardware primitives are expensive in general

  11. MCRingBuffer Overview • Goal: use Lamport’s ring buffer as a building block to further minimize cost of thread synchronization • Properties: • Lock-free: allow concurrent accesses of producer and consumer • Cache-efficient: improve cache locality of synchronization • Generic: no assumption on data types and insert/extract patterns • Deployable: works on general-purpose multi-core CPUs • Components: • Cache-line protection • Batch updates of control variables

  12. MCRingBuffer Assumptions • Assumptions inherited from Lamport’s ring buffer: • single-producer/single-consumer • reading/writing read/write are atomic • memory accesses follow sequential consistency

  13. Cache-line Protection • Cache is in unit of cache lines • False sharing occurs when two threads access different variables on the same cache line • Cache line invalidated when a thread modifies a variable • Cache line reloaded from memory when a thread reads a different variable, even unchanged cache read/write modified frequently for thread synchronization N read write N (ring buffer size) is reloaded from memory even if it’s constant

  14. Cache-line Protection • Add padding bytes to avoid false sharing cache int read int write char cachePad1[CL–2*sizeof(int)] int N char cachePad2[CL–sizeof(int)] cachePad1 read write N cachePad2 CL = cache line size

  15. Cache-line Protection • Use cache-line protection to minimize memory accesses cache Shared variables read write cachePad1 localWrite nextRead cachePad2 Consumer’s local variables localRead nextWrite cachePad3 Producer’s local variables Constants N cachePad4 • Shared variables are main controls of synchronization • Use local variables to “guess” shared variables • Goal: minimize freq. of reading shared control variables

  16. Batch Updates of Control Variables • Intuition: • nextRead/nextWrite are the positions where to read/write • Update read/write after batchSize reads/writes Producer Consumer buffer[nextWrite] = element nextWrite = NEXT(nextWrite) wBatch++ if (wBatch >= batchSize) { write = nextWrite wBatch = 0 } *element = buffer[nextRead] nextRead = NEXT(nextRead) rBatch++ if (rBatch >= batchSize) { read = nextRead rBatch = 0 } • Goal: minimize freq. of writing shared control variables

  17. Batch Updates of Control Variables • Limitation: • read/write advanced on per-batch basis elements may not be extracted even buffer is not empty • However, if elements are raw packets in high-speed networks, read/write will be updated regularly

  18. Correctness of MCRingBuffer • Correctness based on Lamport’s ring buffer: • Lamport’s: • Insert only if write – read < N • Extract only if read < write • We prove for MCRingBuffer: • Insert only if nextWrite – nextRead < N • Extract only if nextRead < nextWrite • Details in the paper.

  19. Evaluation • Hardware: Intel Xeon 5355 Quad-core • sibling cores: pair of cores sharing L2 cache • non-sibling cores: pair of cores not sharing L2 cache • Ring buffers: • LockRingBuffer: lock-based ring buffer • BasicRingBuffer: Lamport’s ring buffer • MCRingBuffer: • batchSize = 1: cache-line protection • batchSize > 1: cache-line protection + batch control updates • Metrics: • Throughput: number of insert/extract pairs per second • Number of L2 cache misses: number of cache-line reload operations

  20. Experiment 1 • Throughput vs. element size buffer capacity = 2K elements Sibling cores Non-Sibling cores • MCRingBuffer with batchSize > 1 has a higher throughput gain (up to 5x) for smaller element size

  21. Experiment 2 • Throughput vs. buffer capacity element size = 128 bytes Sibling cores Non-Sibling cores • MCRingBuffer’s throughput invariant with large enough buffer capacity

  22. Experiment 3 • Code profiling from Intel VTune Performance Analyzer Metric numbers for 10M inserts/extracts element size = 8 bytes, capacity = 2K elements • MCRingBuffer improves cache locality

  23. Recap of Evaluation • MCRingBuffer improves throughput in various scenarios: • Different data sizes • Different buffer capacities • Sibling/non-sibling cores • MCRingBuffer has higher throughput gain via: • careful organization of control variables • careful accesses to control variables • MCRingBuffer’s gain does not require any special insert/extract patterns

  24. ring buffer SubAnanlyzer raw packets SubAnanlyzer Dispatcher MainAnanlyzer … SubAnanlyzer decoded packets state reports Parallel Traffic Monitoring • Applying MCRingBuffer to parallel traffic monitoring

  25. SubAnalysis Dispatch MainAnalysis … Parallel Traffic Monitoring • Dispatch stage: • Decode raw packets • Distribute decoded packets by (srcIP, dstIP) • SubAnalysis stage: • Local analysis on address pairs • e.g., 5-tuple flow stats, vertical portscans • MainAnalysis stage: • Global analysis: aggregate results of all SubAnalyzers • e.g., source’s volume, horizontal portscans Evaluation results: MCRingBuffer helps scale up packet processing throughput (details in paper)

  26. Take-away Messages • Proposed a building block for parallel traffic monitoring: a lock-free, cache-efficient synchronization mechanism • Next question: • How do we apply MCRingBuffer to different network monitoring problems?

More Related