Crawling Rich Internet Applications: The State of the Art

Software Security Research Group (SSRG) University of Ottawa In collaboration with IBM Crawling Rich Internet Applications:The State of the Art Suryakant Choudhary, M. Emre Dincturk, Seyed M. Mirtaheri, Ali Moosavi, Gregor von Bochmann, Guy-Vincent Jourdan, Iosif Viorel Onut CASCON 2012 November 5, 2012

Overview • Introduction • The evolution of the Web applications and Crawling • Crawling RIAs • Challenges and Common Assumptions • Research on Crawling RIAs • Crawling for Indexing • Crawling for Testing • Research on Crawling Strategies • Greedy Strategy • Model-based Crawling • Hypercube Strategy • Probability Strategy • Menu Strategy • Experimental Results • Future of RIA Crawling

Introduction - Traditional Web Applications • HTML pages identified by a URL • Synchronous communication

Introduction - Rich Internet Applications (1) • Client-side code (JavaScript) execution • The page can be modified by the client-side code. • Document Object Model (DOM): A tree data structure to represent the page in the client. • Events : Occurrences that cause code execution (mouse click, timeout etc.)

Introduction - Rich Internet Applications (2) • Asynchronous Communication (AJAX)

Introduction - Crawling (1) • Crawling: Exploring an application automatically • Motivations • Content indexing (by search engines) • Testing (for security, accessibility, functionality) • Objectives • Find all (or ‘important’) pages • Find the connections between the pages (obtaining a complete model of the application, for example for page ranking)

Introduction - Crawling (2) • Crawling extracts “a model” of the application • States are the “distinct” pages • Transitions are the connections between the states

Crawling RIAs • RIAs have events that change the page without changing the URL. • URL –> Many States • The aim is to find all the states reachable from a given URL by executing events. • The Initial State: The state reached by loading the URL • Reset: Loading the URL to go back to the initial state. • An event’s behaviour may depend on the state it is executed. We have to execute in each state all the enabled events of the state.

Crawling RIAs – Challenges and Assumptions • State Identification • A state needs to be identified by its DOM. • A DOM Equivalence Relation is needed. • Event Identification • Assumption: No Server-side States • Assumption: Finite Representative User Inputs • Intermediate States • Efficiency of Crawling Strategies

Crawling RIAs for Indexing • Duda et al. [1][2][ 3] • Uses a Breadth-First crawling strategy • Introduced AjaxRank [3]: Adaptation of PageRank to RIAs to sort the results of a search query • Mesbah et al. [4][5] introduced “Crawljax” • uses a strategy similar to the Depth-First • outputs static HTML snapshots of the discovered DOMs which can be indexed by the search engines

Crawling RIAs for Testing • Crawljax is also used for testing RIAs • Regression Testing of Ajax Applications [6] • Security Testing of web widget interactions [7] • Invariant-based Testing of Ajax Applications [8] • Marchetto et al. [9] testing to reveal faulty behaviour • Combines analysis of user traces, static analysis of the code and human validation to produce a model of an application • Amalfitano et al. [10] [11] [12] [13] • Modeling and testing based on user execution traces obtained by • User sessions and/or • Automated trace generation using a Depth-First strategy

Crawling Strategies for RIAs • Crawling Strategy: an algorithm that decides what event should be explored next. • An efficient strategy discovers the states as soon as possible (our definition) • Time to find all the states ~ the number of events executed and the resets used during crawling • The standard strategies used in the mentioned research, the Breadth-First and the Depth-First, are not efficient for RIAs. • No predictions for the event outcomes. • A strict order of state exploration: Leads to increased number of event executions and resets (used to transfer from the current state to the currently explored state).

Research on Crawling Strategies for RIAs • Greedy Strategy [14] • A simple strategy that gives priority to the event closest to the current state • Tries to minimize the transfer sequences but still no prediction of event outcomes • Model-Based Crawling Strategies • Hypercube Strategy [15] • Probability Strategy [16] • Menu Strategy [17]

Model-Based Crawling • Meta-model: assumed structure of the application • Crawling strategy is optimized for the case that the application follows these assumptions • Adaptation of the strategy: the crawling strategy must be able to deal with applications that do not satisfy these assumptions

The Hypercube Strategy • The Hypercube Meta-Model anticipates the application to have a hypercube model. • Hypercube strategy is an “optimal” strategy for this meta-model. Example: 4-Dimensional Hypercube

The Probability Strategy • Prioritizes events based on their probability of discovering a new state • N(e) = number of executions • S(e) = number of new states found • Bayesian formula • pS= 1 and pN= 2 -> initial probability = 0.5 • Aim: Choose an event e to explore such that • P(e) is high • The transfer sequence from the current state to a state where e is unexecuted is short

The Menu Strategy • The Menu Meta-Model defines three categories of events: • 1. Menu-Event: Leads to the same state independent of where it is executed. (e1 and e2) • 2. Self-Loop Event: Do not cause any state change. (e3) • 3. Other Event: An event that is neither of the above. Simple example:

Experimental Results - Strategies • We compare the performance of the model-based strategies with • The Optimized Breadth-First Strategy • The Optimized Depth-First Strategy • The Greedy Strategy (explore the event closest to the current state) • Optimal (calculated when the model is known)

Experimental Results - Applications • Real Applications • Periodic Table (Local version: http://ssrg.eecs.uottawa.ca/periodic/) • Clipmarks (Local version: http://ssrg.eecs.uottawa.ca/clipmarks/) • Test Applications • TestRIA ( http://ssrg.eecs.uottawa.ca/TestRIA/) • Altoro Mutual (http://www.altoromutual.com/)

Experimental Results – Measuring Efficiency • Efficiency of a strategy is measured by the cost of discovering the states, which is based on the number of events executed and the resets used. • Before crawling an application we measure average event execution time and average reset time for the application. • For simplicity, we assume each event has the same unit cost (which is the average event execution time). • The cost of reset is defined in terms of event execution cost. • The cost of a strategy is calculated by (#events executed) +(#resets used) *(cost of reset)

Experimental Results – Crawling Efficiency Cost of Reset 18 Cost of Reset 2 Cost of Reset 8 Cost of Reset 2 Plots are in logarithmic scale.

Future of RIA Crawling • Avoid New States Without New Information • Automatically identify the parts of a page that can be crawled independently to reduce the state space explosion • Adaptive Crawling • Decide the meta-model for the application during the crawling • Greater Diversity • Try to get a bird-eye-view of the application model as soon as possible • Distributed Crawling • Crawl applications using multiple processes running concurrently to reduce crawling time

References [1] C. Duda, G. Frey, D. Kossmann, R. Matter, and C. Zhou, “Ajax crawl: Making ajax applications searchable,” in Proceedings of the 2009 IEEE International Conference on Data Engineering, ICDE ’09, pp. 78–89, IEEE Computer Society, 2009 [2] C. Duda, G. Frey, D. Kossmann, and C. Zhou, “Ajax search: crawling, indexing and searching web 2.0 applications,” Proc. VLDB Endow., vol. 1, pp. 1440– 1443, Aug. 2008. [3] G. Frey, “Indexing ajax web applications,” Master’s thesis, ETH Zurich, 2007 [4] A. Mesbah, E. Bozdag, and A. v. Deursen, “Crawling ajax by inferring user interface state changes,” in Proceedings of the 2008 Eighth International Conference on Web Engineering, ICWE ’08, pp. 122–134, IEEE Computer Society, 2008. [5] A. Mesbah, A. van Deursen, and S. Lenselink, “Crawling ajax-based web applications through dynamic analysis of user interface state changes,” TWEB, vol. 6, no. 1, p. 3, 2012. [6] D. Roest, A. Mesbah, and A. van Deursen, “Regression testing ajax applications: Coping with dynamism.,” in ICST, pp. 127–136, IEEE Computer Society, 2010. [7] A C.-P. Bezemer, A. Mesbah, and A. van Deursen, “Automated security testing of web widget interactions,” in Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering, ESEC/FSE ’09, 2009. [8] A. Mesbah and A. van Deursen, “Invariant-based automatic testing of ajax user interfaces,” in Software Engineering, 2009. ICSE 2009. IEEE 31st International Conference on, pp. 210 –220, may 2009. [9] A. Marchetto, P. Tonella, and F. Ricca, “State-based testing of ajax web applications,” in Proceedings of the 2008 International Conference on Software Testing, Verification, and Validation, ICST ’08, pp. 121–130, IEEE Computer Society, 2008.

References [10] D. Amalfitano, A. R. Fasolino, and P. Tramontana, “Reverse engineering finite state machines from rich internet applications,” in Proceedings of the 2008 15th Work-ing Conference on Reverse Engineering, WCRE ’08, pp. 69–73, IEEE Computer Society, 2008. [11] D. Amalfitano, A. R. Fasolino, and P. Tramontana, “Rich internet application testing using execution trace data,” in Proceedings of the 2010 Third International Conference on Software Testing, Verification, and Validation Workshops, ICSTW ’10, pp. 274–283, IEEE Computer Society, 2010. [12] D. Amalfitano, A. R. Fasolino, and P. Tramontana, “An iterative approach for the reverse engineering of rich internet application user interfaces,” in Proceedings of the 2010 Fifth International Conference on Internet and Web Applications and Services, ICIW ’10, pp. 401–410, IEEE Computer Society, 2010. [13] D. Amalfitano, A. R. Fasolino, A. Polcaro, and P. Tramontana, “Dynaria: A tool for ajax web application comprehension.,” in ICPC, pp. 46–47, IEEE Computer Society, 2010. [14] Z. Peng, N. He, C. Jiang, Z. Li, L. Xu, Y. Li, and Y. Ren, “Graph-based ajax crawl: Mining data from rich inter-net applications,” in Computer Science and Electronics Engineering (ICCSEE), 2012 International Conference on, vol. 3, pp. 590 –594, march 2012. [15] K. Benjamin, G. v. Bochmann, M. E. Dincturk, G.-V. Jourdan, and I. V. Onut, “A strategy for efficient crawling of rich internet applications,” in Proceedings of the 11th international conference on Web engineering, ICWE’11, 2011. [16] M. E. Dincturk, S. Choudhary, G. v. Bochmann, , G.-V. Jourdan, and I. V. Onut, “A statistical approach for efficient crawling of rich internet applications,” in Proceedings of the 12th international conference on Web engineering, ICWE’12, 2012. [17] Choudhary, S., M-crawler: Crawling rich internet applications using menu meta-model. Master’s thesis, EECS - University of Ottawa, 2012.

Thank You

Crawling Rich Internet Applications: The State of the Art

Crawling Rich Internet Applications: The State of the Art

Presentation Transcript

Internet Research

Building Extensible Rich Internet Applications with the Managed Extensibility Framework

Centralized vs. Decentralized Design for Internet Applications

User-Centric Web Crawling

Introducing Microsoft Rich Internet Applications Technologies

Creating Rich Internet Applications with Silverlight

Rich Internet Applications The Next Stage of Application Development

CSE 538 Web Search and Mining Web Crawling

Crawling, extraction and pre-processing

Crawling

MDV, April 2010

Web Crawling Notes by Aisha Walcott

Ajax / Rich Internet Applications

Non-URL-Based Crawling strategy :

Policy Search for Focused Web Crawling

Adaptive Focused Crawling

Advanced Rich Internet Applications (RIAs) done the easy way

Flying/Crawling Wires

Interactive and Robust Web Application Development

Rich Internet Applications