1 / 37

Detection of Metamorphic Malware Variants Using Global Control Flow Analysis

Detection of Metamorphic Malware Variants Using Global Control Flow Analysis. Hira Agrawal Lisa Bahler Mike Little Josephine Micallef Systems & Security Research Applied Communication Sciences Basking Ridge, NJ. Shane R Snyder CERDEC US Army Aberdeen Proving Ground, MD. The Problem.

diella
Download Presentation

Detection of Metamorphic Malware Variants Using Global Control Flow Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Detection of Metamorphic Malware Variants Using Global Control Flow Analysis HiraAgrawal Lisa Bahler Mike Little Josephine Micallef Systems & Security Research Applied Communication Sciences Basking Ridge, NJ • Shane R Snyder • CERDEC • US Army • Aberdeen Proving Ground, MD

  2. The Problem • Detect zero-day vulnerabilities arisingfrom new variants of old malware • A vast majority of existing malware is made up of variations of old malware • There is a substantial lag between the time a new variant is discovered and the time its signature is added to the local AV signature database • Machines remain vulnerable during this time window, which may often be long

  3. Current Solutions are Inadequate Current techniques derive syntactic malware signatures—based on control structures such as specific byte sequences, flow graphs, and call graphs These signatures are easily defeated using automated program diversification techniques such as those employed by new metamorphic transformation engines • Adding spurious sub graphs within flow graphs • Adding spurious functions and calls to those functions • Inlining and outlinig code into- and out-of existing functions

  4. Approach ⁞ ⁞ ⁞ Proposed Approach Signature A1 Signature A2 ⁞ Signature Aa Malware Family A: Variant A1 Variant A2 ⁞ Variant Aa Abstract Signature A Current Approaches Signature B1 Signature B2 ⁞ Signature Bb Malware Family B: Variant B1 Variant B2 ⁞ Variant Bb Abstract Signature B ⁞ ⁞ ⁞ Signature X1 Signature X2 ⁞ Signature Xx Malware Family X: Variant X1 Variant X2 ⁞ Variant Xx Abstract Signature X Even though the variants differ in their control and data flow, they share a common set of core elements!

  5. Abstract, Semantic Signatures if x f1(); else f3(); S5; f1() S1; f2(); f2() S2; f3() S3; f4(); f4() S4; d a b c b c d main() l f1() f3() l m n 1 4 u m o r o p q f2() f4() s v r y x a 2 t s q if x S1; S2; else S3; S4; S5; w t n p b r u x q S5 3 u 5 S5 Flow Graph of the Sample Variant c w v Flow Graph ofthe Original Version d s1; s2 s3; s4 y Signature of theOriginal Version OriginalVersion A Sample Variant Signature of theSample Variant s1; s2 s3; s4

  6. Construction of Local Signatures a if x S1; S2; else S3; S4; S5; b a, d d a a d a b b b b b b c c c c c c c d (ii) Flow Graph (iii) Pre Dominator Tree (iv) Post-Dominator Tree d d d a (i) Malware x,s5 s5 Predicate Node s1; s2 s3; s4 Non-Predicate Node (vii) Malware Signature(Super-block Dominator Tree Projected OverNon-Predicate Nodes) s1; s2 s3; s4 (vi) Super-blockDominator Graphand Tree (v) Merged-Dominator Graph Key

  7. Construction of Global Signatures main if x f1(); else f3(); S5; f2() S2; f1() S1; f2() f3() S3; f4(); f4() S4; l r l m n f1 f3 t s u m o r f1 f3 o p q f2 f4 s v f2 f4 (i) Malware Variant y x x w t n p (ii) Flow Graphs q S5 q q q r,x, u,y main u m,n o,p o,p,u,v,w,y m,n,r,s,t,x s1; s2 s3; s4 w v (vii) Variant Signature(Projected Mega Block Dominator Tree) r,s,t u,v,w y (vi) Mega Block Dominator Graph and Tree x y (iv) Call Graph (v) Intermediate Dominator Graph

  8. Flow Graph Projection r r System or library call node a a b b d Entry or exit node c e Othernodes f KEY g i g h j j OriginalFlow Graph ProjectedFlow Graph

  9. Flow Graph Projection Steps r r r r r r a,d,e a,d,e,f a a,d,e,f a a b b,c b,c b c b b d d c c f e e g i g,h i h f f j j Assign the label of the first instruction of each node as the label of that node Identify and mark all nodes that represent system- or library-calls Mark the entry and exit nodes, if not already marked, as relevant Merge all cycles made up entirely of irrelevant nodes Merge any irrelevant node that has a single predecessor node with the latter node Merge any irrelevant node that has a single successor node, with that node g g i i g,h g h h j j j,i j

  10. CallGraph Projection Relevant function, which contains a system/library call f1 f1 f2 f4 f4 Function that directly or indirectly calls a relevant function f3 f5 f5 f6 Other functions OriginalCall Graph ProjectedCall Graph KEY

  11. Call Graph Projection Steps f1 f1 f1 f1 f1 f1 f2 f2 f2 f4 f4 f4 f2 f4 f4 f4 f3 f3 f3 f5 f5 f5 f3 f5,f6 f5 f5,f6 f6 f6 f6 Identify and mark any node that represents function containing a relevant node as relevant Mark the root node of the call graph, if not already marked, as relevant Mark all predecessors of all marked nodes, as well as all their call sites, as relevant Merge any irrelevant node that has a single predecessor node, with that node Remove all irrelevant nodes from which no relevant nodes may be reached Make the label of the first instruction in each node, as the label of that node

  12. Concept of Operations Instance of knownmalware binary 1. Offline Analysis 2. Online Analysis Graph basedrepresentation MalwareBinary New, unknown binary instance (a) Analyze and transform Graph basedrepresentation (b) Generate abstract signatures Abstract signature of the new binary (c) Compare signatures Graph-based Signature Graph-based Signature Abstract signatures For efficiency, the chosen “distance”measure must satisfy the triangle inequality! Find the nearest neighbor of the new binary in the malware library and computethe distance between them Graph-based Signature no Benign instance! (d) distance < neighbor’s threshold Malware instance! yes

  13. Tree Edit Distance u u u v x v w x x v v w y insert(z,1,w) delete(v,1,u) w z y y z z T1 T2 T3 Edit Script (T1,T3) : (i) insert z as the first child of w (ii) delete v, the 1st child u or, (i) delete v, the 1st child u (ii) insert z as the first child of w or, (i) relabelv to w (ii) relabelw to z Edit Distance, De(Ti,Tj) = The length the shortest Edit Script(Ti,Tj)  De(T1,T3) = 2

  14. Tree Edit Distance (cont’d) u u u v x w v x x w y w z y y z T1 T2 T3 De(T1,T2) = 1 De(T2,T3) = 1 De(T1,T3) = 2 De(T1,T3) ≤ De(T1,T2) + De(T2,T3) De satisfies the triangle inequality!

  15. Tree Edit Distance (cont’d) u u u u v x w v x x x w y w v y y y v De(T1,T2) = 1 z w T1 (5 nodes) T2 (6 nodes) z ⁞ ⁞ (+100 nodes) t t T3 (105 nodes) T4 (106 nodes) De(T3,T4) = 1!!! (+100 nodes)

  16. Normalized Tree Edit Distance u u u u v x w v x x x Dn(Ti,Tj) = De(Ti,Tj) / |Ti|+|Tj| w y w v y y y v z w T1 (5 nodes) T2 (6 nodes) Dn(T1,T2) = 1/11 z ⁞ ⁞ (+100 nodes) t z Dn(T3,T4) = 1/211 « 1/11 T3 (105 nodes) T4 (106 nodes) (+100 nodes)

  17. Normalized Tree Edit Distance (cont’d) u u u v x v w x x w y v w y y v T2 (6 nodes) T1 (5 nodes) T3 (5 nodes) Dn(T1,T2) = 1/11 = 10/110 Dn(T2,T3) = 1/11 = 10/110 Dn(T1,T3) = 2/10 = 22/110 > 20/110 Dn(T1,T3) >Dn(T1,T2) + Dn(T2,T3) Dn does NOT satisfy the triangle inequality!

  18. Normalized, Metric Tree Edit Distance u u u v x v w x x w y v w y y v T2 (6 nodes) T1 (5 nodes) T3 (5 nodes) Dnm(T1,T2) = 2*De(T1,T2) / (|T1|+|T2| + De(T1,T2)) Dnm(T1, T2) = 2*1/(11+1) = 2/12 Dnm(T2, T3) = 2*1/(11+1) = 2/12 Dnm(T1, T3) = 2*2/(10+2) = 4/12 Dnm(T1,T3) ≤Dnm(T1,T2) + Dnm(T2,T3) Dnm satisfies the triangle inequality!

  19. Detecting Variants Missed by AV • We found a malware family with five variants • Virus.Win32.Thorin • Virus.Win32.Thorin.b • Virus.Win32.Thorin.c • Virus.Win32.Thorin.d • Virus.Win32.Thorin.e • A major AV product failed to detect the last variant—Virus.Win32.Thorin.e • MAA correctly flags it as malware—based on abstract signature it derives from the first variant • It detects the other three variants as well—without requiring a separate, dedicated signature for each of them, as AV products often do

  20. Checking for False Positives & Negatives We chose ~500 Win32 samples from families of size 10 or more.

  21. Fivefold Cross Validation Test

  22. Checking Against Benign Samples • We randomly picked a Windows 7 system folder containing over 400 executables, ranging in size from 8 KB to over 70 KB. • The false positives rate decreases predictably as the distance threshold is reduced.

  23. The Need for Identifying Sub Families Bagle Klez Mydoom Mimail Roron Netsky Malware Library Distance Matrix

  24. Request Latency

  25. Request Latency (cont’d)

  26. System Overhead

  27. Next Steps • Liner AESA for Faster Signature Library Generation • Identification of Malware Sub Families • Automated Removal of “Redundant” Variants • Family Specific Distance Thresholds • Malware Fragment Matching • Smart Label Matching

  28. Powerhouse Research. Practical Solutions. Applied Communication Sciences and design logo is a registered trademark of TT Government Solutions, Inc.

  29. Backup Slides

  30. Some malware obfuscations inject redundant, unrelated instructions, which do not affect its external behavior but change its abstract signature Data flow analysis can help detect and eliminate such instructions It can help determine direct- and indirect data dependencies of relevant instructions Programs make such calls to affect their environment—files, network, registry, other processes, etc. Malware must make such calls to accomplish its goals Nodes that do not have an influence on any system/library call can then be removed from abstract signatures Incorporate Data Flow in Signatures

  31. The Need for Data Flow Analysis a if x u = ...; v = ...; y = 1024; if !x u = -u; v = -v; while (y > 0) y -= u*v; write(u*v); if x u = ...; v = ...; else u = -u; v = -v; write(u*v); a b b m d n c b c c p d q d m,d Original Malware b c q Obfuscated Code Control Flow Based AbstractSignature of Original Malware Control Flow Based AbstractSignature of Malware Variant

  32. TheNeed for Data Flow Analysis (cont’d) if x u = ...; v = ...; y = 1024; if !x u = -u; v = -v; while (y > 0) y -= u*v; write(u*v); a d d b d b b c c m n c p m,d q b c q d Obfuscated Code Control & Data Flow Based AbstractSignature of Malware Variant Control Flow Based AbstractSignature of Malware Variant Control & Data Flow Based AbstractSignature of Original Malware

  33. Determining the Nearest Neighbor Goal: Given (1) a database, D, of malware binaries, whose distances from one another are known, and (2) a new, query binary, q, which is not in D, find the nearest neighbor, n, of q in D. q D Naïve Method: (1) Compute q’s distance from every malware in D. (2) The one with the shortest distance is the nearest neighbor of q. Problem: Computing edit distances between graphical structures, including trees, is an expensive operation. Solution: Exploit the triangle inequality to avoid many distance computations.

  34. Determining the Nearest Neighbor (cont’d) ⁞ New, query binary Malware binary whose distance from the query binary has been computed Malware binary whose distance from the query binary is, currently, unknown Malware binary whose distance from the query binary is to be computed next Currently known nearest malware neighbor of the query binary Malware binary that has been removed from further consideration as it cannot possibly be the nearest neighbor of the query binary in D

  35. Exploiting Triangle Inequality Each side of a triangle is ≤ the sum of the other two sides! Pick any two sides, say a & b b a p Suppose b ≥ a b ≤ a + c b – a ≤ c c ≥ b – a c ≥ |b – a| Otherwise (b < a) a ≤ b + c a – b ≤ c c ≥ a – b c ≥ |b – a| c r q p1 p2 pi Each side of a triangle is ≥ the difference between the other two sides! qr ≥ |qpi – pi r| for all pi ⁞ qr ≥ max |qpi – pi r| for all pi q r lowerbound(qr) Iflowerbound(qr) ≥ qn, then qr ≥ qn There is no need to compute qr! n

  36. Exploiting Triangle Inequality (cont’d) pj pj pk pk pk pi d d d q q q n n n Associate a LowerBound with each pi, and initialize it to zero. Every time piq is computed, Update nearest neighbor, n, and the corresponding d Update LowerBound(pjq) for all j as: LowerBound(pjq) = max(LowerBound(pjq), |pjn – nq|) if (lowerBound(pjq)>d) then Remove pj from further consideration!

  37. Powerhouse Research. Practical Solutions. Applied Communication Sciences and design logo is a registered trademark of TT Government Solutions, Inc.

More Related