A Statistical Approach for Efficient Crawling of Rich Internet Applications

A Statistical Approach for Efficient Crawling of Rich Internet Applications M. E. Dincturk, S. Choudhary, G.v.Bochmann, G.V. Jourdan, I.V. Onut University of Ottawa, Canada Presentation given at the Intern. Conf. on Web Engineering ICWE, Berlin, 2012

Abstract • Modern web technologies, like AJAX result in more responsive and usable web applications, sometimes called Rich Internet Applications (RIAs). Traditional crawling techniques are not sufficient for crawling RIAs. We present a new strategy for crawling RIAs. This new strategy is designed based on the concept of “Model-Based Crawling” introduced in [3] and uses statistics accumulated during the crawl to select what to explore next with a high probability of uncovering some new information. The performance of our strategy is compared with our previous strategy, as well as the classical Breadth-First and Depth-First on two real RIAs and two test RIAs. The results show this new strategy is significantly better than the Breadth-First and the Depth-First strategies (which are widely used to crawl RIAs), and outperforms our previous strategy while being much simpler to implement.

SSRG Members University of Ottawa • Prof. Guy-Vincent Jourdan • Prof. Gregorv. Bochmann • SuryakantChoudhary(Master student) • EmreDincturk (PhD student) • Khaled Ben Hafaiedh (PhD student) • Seyed M. Mir Taheri (PhD student) • Ali Moosavi (Master student) In collaboration with Research and Development, IBM® Rational®AppScan®Enterprise • IosifViorel Onut (PhD)

Overview • Background • The evolving Web - Why crawling • RIA crawling • State identification - State equivalence • Crawling strategies • Crawling objectives - Breadth-first and Depth-first - Model-based strategies • Statistical approach • Probabilistic crawling strategy - Experimental results • On-going work and conclusions

The evolving Web • Traditional Web • static HTML pages identified by an URL • HTML pages dynamically created by server, identified by URL with parameters • Rich Internet Applications (Web-2) • pages contain executable code (e.g. JavaScript, Silverlight, Adobe Flex...); executed in response to user interactions, or time-outs (so-called events); script may change displayed page (“state” of the application changes) – with same URL. • AJAX: script may interact asynchronously with the server to update the page.

Why crawling • Objective A: find all (or all “important”) pages • for content indexing • for security testing • for accessibility testing • Objective B: find all links between pages • for ranking pages, e.g. Google ranking in search queries • for building a graph model of the application • pages (or application states) are nodes • links (or events) are edges between nodes

IBM Security Solutions IBM Rational AppScan Enterprise Edition Product overview

9 IBM Rational AppScan Suite Comprehensive Application Vulnerability Management SECURITY CODE REQUIREMENTS BUILD PRE-PROD QA PRODUCTION AppScan Enterprise AppScan onDemand AppScan Reporting Console Security Requirements Definition AppScan Source AppScan Standard AppScan Tester AppScan Standard AppScan Build Security requirements defined before design & implementation Security & Compliance Testing, oversight, control, policy, audits Outsourced testing for security audits & production site monitoring Security / compliance testing incorporated into testing & remediation workflows Automate Security / Compliance testing in the Build Process Build security testing into the IDE Application Security Best Practices – Secure Engineering Framework

View detailed security issues reports Security Issues Identified with Static Analysis Security Issues Identified with Dynamic Analysis Aggregated and correlated results Remediation Tasks Security Risk Assessment 12

RIA Crawling • Difference from traditional web • The HTML DOM structure returned by the server in response to a URL may contain scripts. • When an event triggers the execution of a script, the script may change the DOM structure – which may lead to a new display and a new set of enabled events – that is a new state of the application. • Crawling means: • finding all URLs that are part of the application, plus • for each URL, find all statesreached (from this “seed” URL) by the execution of any sequence of events • Important note: only the “seed” states are directly accessible by a URL

Difficulties for crawling RIA • State identification • A state can not be identified by a URL. • Instead, we consider that the state is identified by the current DOM in the browser. • Most links (events) do not contain a URL • An event included in the DOM may not explicitly identify the next state reached when this event is executed. • To determine the state reached by such an event, we have to execute that event. • In traditional crawling, the event contains the URL (identification) of the next state reached (but not for RIA crawling)

Important consequence • For a complete crawl (a crawl that ensures that all states of the application are found), the crawler has to execute all events in all states of the application • since for any of these events, we do not know, a priory, whether its execution in the current state will lead to a new state or not. • Note: In the case of traditional web crawling, it is not necessary to execute all events on all pages; it is sufficient to extract the URLs from these events, and get the page for each URL only once.

RIA: Need for DOM equivalence • A given page often contains information that changes frequently, e.g. advertizing, time of the day information. This information is usually of no importance for the purpose of crawling. • In the traditional web, the page identification (i.e. the URL) does not change when this information changes. • In RIA, states are identified by their DOM. Therefore similar states with different advertizing would be identified as different states (which leads to a too large state space). • We would like to have a state identifier that is independent of the unimportant changing information. • We introduce a DOM equivalence, and all states with equivalent DOMs have the same identifier.

DOM equivalence • The DOM equivalence depends on the purpose of the crawl. • In the case of security testing, we are not interested in the textual content of the DOM, • however, this is important for content indexing. • The DOM equivalence relation is realized by a DOM reduction algorithm which produces (from a given DOM) a reduced canonical representation of the information that is considered relevant for the crawl. • If the reduced DOMs obtained from two given DOMs are the same, then the given DOMs are considered equivalent, that is, they represent the same application state (for this purpose of crawl).

Form of the state identifiers • The reduced DOM could be used as state identifier. • however, it is quite voluminous • we have to store the application model in memory during its exploration, each edge in the graph contains the identifiers of the current and next states. • Condensed state identifier: • A hash of the reduced DOM • used to check whether a state obtained after the execution of some event is a new state or a known one • The crawler also stores for each state the list of events included in the DOM, and whether they are executed or not • used to select the next event to be executed during the crawl

Crawling Strategies for RIA • Most work on crawling RIA do not intend to build a complete model of the application • Some consider standard strategies, such as Depth-First and Breadth-First, for building complete models • We have developed more efficient strategies based on the assumed structure of the application (“model-based strategies”, see below)

Disadvantages of standard strategies • Breadth-First: • No longer sequences of event executions • Very many Resets • Depth-First: • Advantage: has long sequences of event executions • Disadvantage: when reaching a known state, the strategy takes a path back to a specific previous state for further event exploration. This path through known edges is often long and may involve a Reset (overhead) – going back to another state with non-executed events may be much more efficient.

Comparingcrawlingstrategies • Objectives • Complete crawl: Given enough time, the strategy terminates the crawl when all states of the application have been found. • Efficiency of finding states - “finding states fast”: If the crawl is terminated by the user before a complete crawl is attained, the number of discovered state should be as large as possible. • For many applications, a complete crawl cannot be obtained within a reasonable length of time. • Therefore the second objective is very important.

Comparingefficiency of finding states ```` Cost (number of eventexecutions + reset cost) 28710 19128 11803 11233 Note: log scale 164 Number of states discovered This is for a specific application suchcomparisonsshouldbedone for manydifferent types of applications Total: 129

Model-basedCrawling Idea: • Meta-model: assumed structure of the application • Crawling strategy is optimized for the case that the application follows these assumptions • Crawling strategy must be able to deal with applications that do not satisfy these assumptions

State and transition exploration phases • State exploration phase • finding all states assuming that the application follows the assumptions of the meta-model • Transition exploration phase • executing all remaining events in all known states (that have not been executed during the state exploration phase) • Order of execution • Start with state exploration; then transition exploration • If new states are discovered during transition phase, go back to the state exploration phase, etc.

Threemeta-models • Hypercube • The state reached by a sequence of events from the initial state is independent of the order of the events. • The enabled events at a state are those at the initial state minus those executed to reach that state. • Menu model • Probability model Example: 4-dim. Hypercube

Probability strategy • Like in the menu model, we use event priorities. • The priority of an event is based on statistical observations (during the crawl of the application) about the number of new states discovered when executing the given event. • The strategy is based on the belief that an event which was often observed to lead to new states in the past will be more likely to lead to new states in the future.

Probability strategy: state exploration • Probability of a given event e finding a new state from the current state is P(e) = ( S(e) + pS ) / ( N(e) + pN) • N = number of executions • S = number of new states found • Bayesian formula, with pS= 1 and pN= 2 gives initial probability = 0.5 • From current state s, find a locally non-executed event e from state s’ such that P(e) is high and the path from s to s’ is short • Note: the path from s to s’ is through events already executed • Question: How to find e and s’

Choosing an event to explore • Def: P(s) = max (P(e)) for all non-executed events at state s Example:hat • has high probability of discovering a state and • has a small distance (in terms of event executions) from the current state e1 Currently, we are here s1 Should we explore e1 or e2 next? Note that although P(e1) is higher, the distance to reach e1 (from current state) is also higher P(e1) = P(s1) = 0.8 s e2 s2 Solid edges are already explored events, dashed ones are yet to be explored e1 and e2 are the unexplored events with max prob. in s1 and s2, respectively P(e2) = P(s2) = 0.5

Choosing an event to explore (2) • It is not just to compare P(e1) and P(e2) directly since the distances (the number of event executions required to explore e1 and e2) are different. • The number of events executed is the same in the following two cases • option 1 : explore e1 • option 2 : explore e2 and if a known state is reached explore one more event • The probabilities for finding new states of these options are: • option 1 : P = P(e1) • option 2 : P = P(e2) + (1 – P(e2))Pavg • where Pavgis the probability of discovering a new state averaged over all knows states (we do not know which state would be reached). e1 s1 current state P(e1) = 0.8 s e2 s2 P(e2) = 0.5

Choosing an event to explore (3) • In general, two events e1 and e2 (where e2 requires k more steps to be reached) are compared by looking at • P(e1) • 1 - (1 - P(e2)) (1 - Pavg)k • this value is 1 – (probability of not discovering a state by exploring e2 and k more events) • Using this comparison, we decide on a state schosen where the next event should be explored • We use an iterative search to find schosen • Initalizeschosento be the current state • At iteration i • Let s be the state with max probability at distance i from the current state • If s is more preferable to schosen, update schosen to be s

Choosing an event to explore (4) • When do we stop the iteration ? • When it is not possible to find a better state than the current schosen • How do we know that it is not possible to find a better state ? • We know the maximum probability, Pbest, among all unexplored events. • We can stop at a distance d from schosen if we have 1 – (1- P(schosen))(1 - Pavg) d ≥ Pbest • That is, if we cannot find a better state in d steps after the last value of schosen, then no other state can be better (since even the best event would not be preferable)

Experiments • We did experiments with the different crawling strategies using the following web sites: • Periodic table (Local version: http://ssrg.eecs.uottawa.ca/periodic/) • Clipmarks(Local version: http://ssrg.eecs.uottawa.ca/clipmarks/) • TestRIA( http://ssrg.eecs.uottawa.ca/TestRIA/ ) • Altoro Mutual (http://www.altoromutual.com/ )

State Discovery Efficiency – Periodic Table • Cost = number of event executions + R * number of resets Plots are in logarithmic scale. Cost of reset for this application is 8.

State Discovery Efficiency – Clipmarks Plots are in logarithmic scale. Cost of reset for this application is 18.

State Discovery Efficiency – TestRIA Plots are in logarithmic scale. Cost of reset for this application is 2.

State Discovery Efficiency – Altoro Mutual Plots are in logarithmic scale. Cost of reset for this application is 2.

Results: Transition exploration(cost of exploring all transitions) Cost for a complete crawl • Cost = number of event executions + R * number of resets • R = 18 for the Clipmarks web site

On-going work • Exploring regular page structures with widgets • reducing the exponential blow-up of combinations • Exploring the structure of mobile applications • Applying similar crawling principles to the exploration of the behavior of mobile applets • Concurrent crawling • For increasing the performance of crawling, consider coordinated crawling by many crawlers running on different computers, e.g. in the cloud Cost = n

Conclusions • RIA crawling is quite different from traditional web crawling • Model-based strategies can improve the efficiency of crawling • We have developed prototypes of these crawling strategies, integrated with the IBM Apscan product

References Background; • MESBAH, A., DEURSEN, A.V. AND LENSELINK, S., 2011. Crawling Ajax-based Web Applications through Dynamic Analysis of User Interface State Changes. ACM Transactions on the Web (TWEB), 6(1), a23. Our Papers: • Dincturk, M.E., Choudhary, S., Bochmann, G.v., Jourdan, G.-V. and Onut, I.V., A Statistical Approach for Efficient Crawling of Rich Internet Applications, in Proceedings of the 12th International Conference on Web engineering (ICWE 2012), Berlin, Germany, July 2012. 8 pages - [pdf]. A longer version of the paper is also available (15 pages) [pdf] • Choudhary, S., Dincturk, M.E., Bochmann, G.v., Jourdan, G.-V., Onut, I.V. and Ionescu, P., Solving Some Modeling Challenges when Testing Rich Internet Aplications for Security, in The Third International Workshop on Security Testing (SECTEST 2012), Montreal, Canada, April 2012. 8 pages - [pdf]. • Benjamin, K., Bochmann, G.v., Dincturk, M.E., Jourdan, G.-V. and Onut, I.V., A Strategy for Efficient Crawling of Rich Internet Applications, in Proceedings of the 11th International Conference on Web engineering (ICWE 2011), Paphos, Cyprus, July 2011. 15 pages - [pdf]. • Benjamin, K., Bochmann, G.v., Jourdan, G.-V. and Onut, I.V., Some Modeling Challenges when Testing Rich Internet Applications for Security, in First International Workshop on Modeling and Detection of Vulnerabilities (MDV 2010), Paris, France, April 2010. 8 pages - [pdf]. • Dincturk, M.E., Jourdan, G.-V. , Bochmann, G.v. and Onut, I.V., A Model-Based Approach for Crawling Rich Internet Applications, submitted to a journal.

Questions ??Comments ?? These slides can be downloaded from http://????/RIAcrawlingProb.pptx *** ssrg server was down

A Statistical Approach for Efficient Crawling of Rich Internet Applications