120 likes | 334 Views
The Data Mining Visual Environment. Motivation Major problems with existing DM systems They are based on n on-extensible frameworks .
E N D
The Data Mining Visual Environment Motivation Major problems with existing DM systems They are based on non-extensible frameworks. They provide a non-uniform mining environment - the user is presented withtotally different interface(s) across implementations of different DM techniques. Major needs An overall framework that can support the entire Knowledge Discovery (KD) process (accommodate and integrate all KD phases seamlessly). Placing the user at the center of the entire KD process/in the framework. In fact the corresponding system should provide a consistent, uniform and flexible visual interaction environment that supports the user throughout the entire discovery process.
The Data Mining Visual Environment • Primary layers • User layer • Engine layer • Data layer • Main features • Open • Modular with • well defined • modification/extension points • Possible integration of • different tasks • eg output reuse by another task • User flexibility and enablement to: • process data and knowledge, • drive andguide the entire KD • process System Architecture At present, there is a partial prototype, complete implementation is underway.
The Data Mining Visual Environment • Primary layers • User Interface/GUIContainer • (interacts with specific DM • visual environments) • Developed using Java • Abstract DM Engine/Wrapper of • DM algorithms (interacts with • specific DM algorithms) • Developed using Java • Note • The DM algorithms may be • implemented by third parties • in possibly any language. • DM methods (but not limited to): • MQs, ARs, and clustering System Architecture: The Prototype Stephen Stefano
The Data Mining Visual Environment Visual Environment A consistent, uniform, flexible and intuitiveGUI, with support throughout the whole DM process. The principal focus is to support the user in: Visual construction of the task relevant dataset: The user directly interacts with data. For this task, there are two intuitive interaction spaces. Visual construction of the mining query: The user directly interacts with dataand other parameters (e.g. threshold values) in making queries e.g., in the Metaquery Environment, the user can suggest patterns by linking attributes, while the Association Rule Environment offers ‘visual baskets’. Visual output presentation and interaction: Exploiting relevant effective visualizations and where necessary, we have designed novel visualizations. Planning: E.g.,‘advertising’ relevant prior knowledge. Handling the non-static nature of user’s quest:E.g., enabling user to adjust.
The Data Mining Visual Environment Visual Environment: Overall
The Data Mining Visual Environment Visual Environment: Tree View (‘Progress Companion’) Before user settings After user settings After DM results
The Data Mining Visual Environment Visual Environment: Clustering -Input
The Data Mining Visual Environment Visual Environment: Clustering -Output ... for more on the prototype, demo
The Data Mining Visual Environment Usability Usability heuristics: Done, butregular reference to the same will go on. Mock-up tests: Done with DM experts. The experts gave an encouraging feedback and even suggestions on how to improve the interface. (These tests were done at the end of 2001.) Questionnaire experiments: The experiments involved: the application simulation, a case study, data schema and user tasks corresponding to the case study, and a questionnaire. Positive interface features: consistency, layout/organization, visual exploration. Negative interface features: size of some visual elements small/big (These tests were done in July 2002.) Formal usability tests: In the pipeline.
The Data Mining Visual Environment • The Clustering Engine • Clustering method: Generalizations of three techniques:homogeneity, separation, density. • Clustering based on homogeneity/separation: Homogeneity (separation) is a global measure of the similarity between points belonging to the same cluster (to different clusters) • Clustering based on density: Clusters are regions of the object space where objects are located “most frequently” • Clustering based: The system selects the “best” clustering according to a cost function • For homogeneity/separation-based clustering the cost function is computed by evaluating pointwise, clusterwise, and partitionwise similarity/dissimilarity • For density-based clustering, the cost function is derived from an estimated density function
The Data Mining Visual Environment • Formal Semantics of the Input Environment • Visual language: abstract syntax + semantics • Abstract syntax: defined in terms of multi-graphs • Visual components are vertices of the multi-graph • Spatial relations between visual components are edges of the multi-graph • Semantics: • Clustering: defined by a mapping between multi-graphs and cost functions and predicates expressing optimality • Metaqueries/association rules: defined by a mapping between multi-graphs and rules
The Data Mining Visual Environment • Operational Specification • Concrete, high-level syntax of the tasks proposed in the usability tests • Describes “legal” click-streams allowed to occur during operation • Standard grammar notation • Communication protocol between the abstract clustering engine and the data mining engines • XML DTDs based on PMML 2.0 • Extension of PMML 2.0 to: • Specification of input • Broader spectrum of clustering methods • Concrete semantics of the clustering task by mapping on symbols in the tasks grammar to structures of the communication protocol • Interpretation function recursively defined on the grammar rules of the high-level syntax of tasks • The interpretation of a legal click-stream is an XML document satisfying the DTD of the input specification