330 likes | 626 Views
Crime Data Mining and Visualization. Nemallapudi Chaitanya, Sunkara Anish , ELGammal Mahmoud. Agenda. Overview Objective Motivation Approach Design Mining Visualization. Overview. Implementation of Crime Data Mining and Visualization Application
E N D
Crime Data Mining and Visualization Nemallapudi Chaitanya, SunkaraAnish, ELGammal Mahmoud
Agenda • Overview • Objective • Motivation • Approach • Design • Mining • Visualization
Overview • Implementation of Crime Data Mining and Visualization Application • Use of Mining & Visualization techniques • PostGIS • Google Maps API • WEKA • Client-Server approach using JAVA, JAVASCRIPT, XML
Objective • This project is to implement mining methodologies on crime data. • Provide visualization for better understanding the data • This is based on publicly available dispatch reports of City of Falls church Fairfax county Arlington county
Motivation • Data mining has proven to be a useful methodology in providing analytical data normally unseen by traditional methods. • Because of its ability to draw conclusions based on many perspectives, it can be used to - • Identify crime trends and patterns/series. • Assist law-enforcement agencies in planning of resources • Aid investigation process by giving a different perspective
Approach • Collect publicly available crime data • Parse useful data and load it into database • Use spatial database to get the co-ordinates of the crime locations, criminal location etc., • Use mining algorithms (DBScan, K-Nearest Neighbor and EM) to analyze the trends in the data • Use Google Maps to show the crime data based on location • Use prefuse visualization to show graphs based on the data collected
Input (Data sets) • Quality is an important characteristic for any data • Challenges: Difficult to extract few attributes .(eg…. juvenile BMs 1517yo wearing dark clothing….) Missing values( criminal age is not specified in all the descriptions) In some cases, latitudes and longitude values are swaped. • Develop a crime ontology
Data Preprocessing • Need to implement dimensionality reduction • Reduce amount of time and memory required by data mining algorithms • Allow data to be more easily visualized • May help to eliminate irrelevant features or reduce noise • Implemented aggregation • 863 crime types were reduced to 45 crime types • Classification of crimes (e.g. Burglary Commercial and Burglary Residential are classified as Burglary) • Crime Information Extraction (XML Parsing)
System Architecture Input (Request) Home page Home Page Visualization (Charts, Maps) Servlet Database Data Output Mining Map Server
Design • Data Cleansing • Automated grouping of 863 crime types in raw data into 45 final crime types • Cleaning of some missing information and handling of null and defining data types is done via a parser that reads the data from the file and loads it into the database • Data Model • Indexes are added to some of the most used fields in the queries, for performance improvement.
Data Mining • Different API’s for data mining. • Ex: WEKA, Java Data Mining Package (JDMP), RapidMiner (YALE) • WEKA • WEKA is a Machine Learning and Data Mining software tool written in Java • Open Source, well documented, support for visualization
Data Mining Functions Implemented • WEKA works on a “Attribute-Relation File Format (ARFF).” • Filters: • Supervised : Interface for filters that make use of a class attribute. • Ex: Discretize, NominalToBinary, Resample • Unsupervised: Interface for filters that do not need a class attribute. • Ex: Standardize, StringToNominal, StringToWordVector
Functions Available • Supported Classification • Ex: NaiveBayes, RandomizableClassifier • Supported Association Functions • Ex: Apriori, Associator. • Supported Clustering Algorithms • Ex: Simple K-means, DBScan, EM • Outlier Detection based on location is facilitated by PostGIS
Implementation • Data mining part is implemented separately on the 3 datasets • Due to variations in attributes in data sets. • Results do not reflect anomalies in the datasets. • Ex: 725 records in Falls Church, compared to 11507 records in Fairfax (1705 records dealing with Auto-Theft) • Fetching data • Data is fetched from the data base or CSV files using WEKA functions.
Implementation Continued… • Filtering • Unwanted attributes are filtered (removed) from the working data set. Ex: Criminal Description etc. • Clustering • DBScan, K-means, EM algorithms are implemented using WEKA API. • Simple K-means, • Takes a range of values of K (Say 1 to 45) as we know 45 is the number of different crime types in the database. • Calculates the SSE between the clusters corresponding to the K value and picks the K where SSE is low.
Implementation Continued… • Visualization • The mined data is then sent to the Visualization explorer in WEKA, where different attributes can be graphed and represented. • Examples of Some Visualizations are :
Examples • Will include some graphs….. Arlington, DOW VS Clusters: Auto Theft(C3) is low on Weekends
Example 2 Arlington Data set, data inclined towards Wednesday.
Example 3 Fairfax, Month VS Clusters: Show that in June the data is very sparse
Advantages of this Implementation • Does not depend on one algorithm such as K-means. • Modules can be added seamlessly to the existing code to implement other algorithms or using WEKA API • Open design: Algorithm implementation can be switched with simple parameter changes.
Visualization • Google Maps • Visualization of data is implemented using Google Maps API • WEKA used for histograms and cluster visualization
PostGIS • PostGIS is spatial database extender for the PostGreSQL DBMS. • Adds spatial functions such as distance, area, and specialty geometry data types to the database. • Relies on GiST (Generalized Search Tree) for indexing geometric data.
PostGIS (continued) • Examples of geometry data types: • POINT(2572292.2 5631150.7) • LINESTRING (2566006.4 5633207.9, 2566028.6 5633215.1, 2566062.3 5633227.1) • POLYGON (2568262.1 5635344.1, 2568298.5 5635387.6, 2568261.04 5635276.15, 2568262.1 5635344.1) • Examples of PostGIS functions/operators: • Distance(), Intersetcs(), Within(), Contains(), Length(), Area(), ConvexHull(), Extent(), ... • A ~ B (A contains b?) • A @ B (B contains A?) • A && B (Do A and B overlap?) • ...
Map Visualization • The client UI is implemented in JavaScript. • The Google Maps API was used to view the map and draw all necessary illustrations. • The UI uses asynchronous (AJAX) requests to communicate with the server. • The server replies in JSON (a data-interchange format native to JavaScript). • Request/Reply batching is used to improve performance.
Map Visualization (continued) • The UI can generate map-based visualizations showing the following: • Crime rate in different regions filtered by: dataset, crime status (attempted/committed), year, month, day of week. • As well as: crime type, year, month, day of week with the highest frequency in different regions.
References • PostGIS (http://postgis.refractions.net/) • Google Map API (http://code.google.com/apis/maps/) • JDMP (http://www.jdmp.org) • WEKA (www.cs.waikato.ac.nz/ml/weka/) • API: (http://weka.sourceforge.net/doc/) • Rapid Miner(YALE) (www.rapidminer.com)