Using MongoDB in a Java Enterprise Application

Using MongoDB in a Java Enterprise Application Ellen Kraffmiller - Technical Lead Robert Treacy - Senior Software Architect Institute for Quantitative Social Science Harvard University iq.harvard.edu TS-4656

Agenda What is Consilience? Consilience Architecture Why MongoDB? Data Storage and Access Details Using JPA vs. Mongo API Summary Q & A

Consilience Intro Research Tool- Discovery Grimmer, Justin, and Gary King. 2011. General Purpose Computer-Assisted Clustering and Conceptualization. Proceedings of the National Academy of Sciences. http://j.mp/j4xyav

Consilience History • 2010 - brainstorming, first mockups, prototypes, proof of concept, experimentation • Initially small document sets • Data could be loaded into memory from files • LCE calculations could be done on the fly • Derby used for user accounts, permissions, user history

Workflow • Analyze document terms • Calculate clustering solutions for known methods • Maps solutions to two dimensional space • Generate LCE from user clicks on map • User discovers patterns in data, annotates and labels favorites

Consilience Pre-processing Get word frequency for each document Stemming Term document matrix Stem - words file Txt files MongoDB Run clustering methods Grid point membership Labels Method points Projection to 2D clustering space Labeling Calculate similarity matrix Run K-means Cluster membership matrix

Three stages of Java processing • Document set ingest • Local Cluster Ensemble(LCE) Calculations • Kmeans • Labeling (Mutual Information • Cluster Analysis (Clustering Space page)

Consilience Demo

Data Requirements • Goal – manage 10 million documents/set • Each document set has • Document text files and original files • Text Analysis Data (Term Document Matrix, Stem Data) • Method Clustering solutions – cluster assignments for each method • Local Cluster Ensemble - Clustering solutions

Why MongoDB • Why not only SQL • Needed to consider persisting larger document sets • On the fly LCE calculations became impractical – they needed to be pre-calculated and persisted • Document set metadata could be any type • Need to efficiently handle potentially very large amounts of data • 10 Million documents with associated metadata, cluster memberships, 10,000 pre-calculated clustering solutions (grid points)

Why SQL • Already had working code written • Take advantage of transaction management for frequently updated data • SQL data works well with Web Server Security Realm • Data in SQL database is relatively small & manageable

Data Storage Details • Use Derby for read/write user-related data • user accounts • document set permissions • user workspace history • Use MongoDB for read-mostly document set data and clustering solutions • Document text, metadata, original files • Clustering analysis data • Word counts • Pre-calculated cluster memberships and keywords

Data Storage Details • Combination of mongodb collections and GridFS • Flat files to GridFS makes it easier to have multiple servers working on the data • MapReduce (future)

Derby and MongoDB Data Storage

Data access from Java • Derby JPA Entities • MongoDB  JPA Entities and MongoDB API

@NoSQL JPA Entities GridPoint and Cluster Example

Parent Entity - GridPoint.java @Entity @NoSql(dataFormat = DataFormatType.MAPPED) publicclass GridPoint implementsSerializable{ @Id @GeneratedValue @Field(name = "_id") protectedString id; privatedouble x; privatedouble y; privatedouble distance; privateint numberOfMethods; privateLong rangeId; privatedouble[][] prototypes; @OneToMany(cascade = {CascadeType.PERSIST, CascadeType.REMOVE}) private List<Cluster> clusters; //. . . Rest of class methods }

Child Entity - Cluster.java @Entity @NoSql(dataFormat=DataFormatType.MAPPED) publicclass Cluster implementsSerializable{ @Id @GeneratedValue @Field(name = "_id") privateString id; privateString labels; @ManyToOne MongoGridPoint gridPoint; privateLong rangeId; @ElementCollection private List<MiWord> miWords; ... }

Embedded Entity - MiWord.java @NoSql(dataFormat=DataFormatType.MAPPED) @Embeddable publicclass MiWord implementsSerializable{ privateString label;// most common variation ofthe stem publicdouble mi;// mutual information value publicint wordIndex;// index of the word in wordDocMatrix // getters and setters …. }

@NoSQL Mapping to MongoDB • Id field maps to String, not ObjectId • Bidirectional relationships not allowed (no “mappedBy”) – use two one directional instead • @OneToMany – Id’s of child entities stored in the parent • No @OrderBy – MongoDB saves order from insert • Full description of mapping support • http://wiki.eclipse.org/EclipseLink/UserGuide/JPA/Advanced_JPA_Development/NoSQL/Configuring

SavingGridPointand related Clusters publicvoid savePoint(EntityManager em, List<int[]> clusterDocIds, Long rangeId, GridCoordinate coord ){ GridPoint gridPoint =new GridPoint(); gridPoint.setRangeId(rangeId); //... Set more fields List<MongoCluster> clusters =new ArrayList<>(); for(int i =0; i < clusterDocIds.size(); i++){ Cluster cluster =new MongoCluster(); cluster.setClusterSize(clusterDocIds.get(i).length); cluster.setRangeId(rangeId); cluster.setMiWords(miWordList); clusters.add(cluster); } gridPoint.setClusters(clusters); em.persist(gridPoint); em.flush() saveClusterFiles(clusterDocIds, gridPoint.getId()); }

Calling savePoint() within Transaction publicvoid savePoints(Long rangeId, Coordinate[] gridCoordinates){ EntityManagerFactory emf = Persistence.createEntityManagerFactory("mongo-"+getMongoHost()); try{ for(Coordinate coord: gridCoordinates){ List<int[]> clusterDocIds = calcClustering(coord.x,coord.y); EntityManager em = emf.createEntityManager(); em.getTransaction().begin(); savePoint(em, rangeId, coord.x, coord.y, clusterDocIds); em.getTransaction().commit(); em.close(); } }catch(Exception e){ rollback(rangeId); } }

Persistence.xml: Defining Persistence Units <?xmlversion="1.0"encoding="UTF-8"?> <persistenceversion="2.0"xmlns="http://java.sun.com/xml/ns/persistence"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance” xsi:schemaLocation=http://java.sun.com/xml/ns/persistence http://java.sun.com/xml/ns/persistence/persistence_2_0.xsd> <persistence-unitname="text"transaction-type="JTA"> <provider>org.eclipse.persistence.jpa.PersistenceProvider</provider> <jta-data-source>jdbc/text</jta-data-source> <properties> <propertyname="eclipselink.ddl-generation"value="create-tables"/> <propertyname="eclipselink.cache.shared.default"value="false"/> </properties> </persistence-unit>

Persistence.xml, continued <persistence-unitname="mongo-localhost"transaction-type="RESOURCE_LOCAL"> <class>edu.harvard.iq.text.model.GridPoint</class> <class>edu.harvard.iq.text.model.Cluster</class> <properties> <propertyname="eclipselink.target-database"value="org.eclipse.persistence.nosql.adapters.mongo.MongoPlatform"/> <propertyname="eclipselink.nosql.connection-spec"value="org.eclipse.persistence.nosql.adapters.mongo.MongoConnectionSpec"/> <propertyname="eclipselink.nosql.property.mongo.port"value="27017"/> <propertyname="eclipselink.nosql.property.mongo.host"value="localhost"/> <propertyname="eclipselink.nosql.property.mongo.db"value="mydb"/> <propertyname="eclipselink.logging.level"value="SEVERE"/> </properties> </persistence-unit> <persistence-unitname="mongo-ip-10-205-17-163.ec2.internal"transaction-type="RESOURCE_LOCAL"> . . . <propertyname="eclipselink.nosql.property.mongo.host"value="ip-10-205-17-163.ec2.internal"/> . . . </persistence-unit></persistence>

Composite Persistence Unit We use separate persistence units for MongoDB and Derby, but “create-tables” no longer works Another option – composite persistence unit: <persistence-unitname="composite-pu"transaction-type="RESOURCE_LOCAL"> <provider>org.eclipse.persistence.jpa.PersistenceProvider</provider> <jar-file>\lib\polyglot-persistence-rational-pu-1.0-SNAPSHOT.jar</jar-file> <jar-file>\lib\polyglot-persistence-nosql-pu-1.0-SNAPSHOT.jar</jar-file> <properties> <propertyname="eclipselink.composite-unit"value="true"/> </properties> </persistence-unit> </persistence> With composite persistence unit, you can map JPA relationships between SQL and NoSQL entities More Info: http://java.dzone.com/articles/polyglot-persistence-0

Minor Issues with JPA @NoSQL Have to restart Glassfish after modifying Entity Need to update persistence.xml with classnames JPA Query language not fully supported – sometimes have to revert to native query Overhead of using EntityManagerFactory, EntityManager & transactions

Accessing Data with MongoDB API

Create a Set in Derby and MongoDB @Stateless @Named publicclass DocSetService { @PersistenceContext(unitName ="text") protected EntityManager em; publicvoid create(File parentDir, DocSet docSet){ try{ MongoSetWrapper.ingestData( parentDir, docSet.getSetId()); em.persist(docSet); }catch(Exception e){ MongoSetWrapper.rollbackData(docSet.getSetId()); // other exception handling } }}

Getting a connection to MongoDB publicclass MongoDB{ privatestatic MongoClient mongo; privatestaticString dbName ="mydb”; publicstaticvoid init(){ if(mongo ==null){ try{ mongo =new MongoClient(new ServerAddress(getHostName())); }catch(UnknownHostException e){ //...do exception handling } } } publicstatic DB getMyDB(){ init(); return mongo.getDB(dbName); } }

Create MongoSet with MongoDB API private ObjectId createMongoSet(File setDir, String setId ){ DBCollection coll = MongoDB.getMyDB().getCollection("MongoSet"); BasicDBObject doc =new BasicDBObject("setId", setId); ArrayList<String> summaryFields = readSummaryFieldsFromFile(setDir); BasicDBList list =new BasicDBList(); list.addAll(summaryFields); BasicDBList list2 =new BasicDBList(); list2.addAll(readAllFieldsFromFile(setDir)); doc.append("summaryFields", list); doc.append("allFields", list2); coll.insert(doc); return(ObjectId) doc.get("_id"); }

Create Document with MongoDB API privatevoid createDocument(ObjectId mongoSetId, String doc_id, String filename, File setDir, HashMap metadata, File origFile){ DBCollection coll= MongoDB.getMyDB().getCollection("Document"); BasicDBObject doc =new BasicDBObject("mongoSet_id", mongoSetId); doc.append("doc_id", doc_id); doc.append("filename", filename); doc.append("text", getDocumentText(setDir, filename)); // HashMap can contain different types – difficult to do with JPA doc.put("metadata", new BasicDBObject(metadata)); coll.insert(doc); createOrigDocument( origFile, doc_id, mongoSetId ); } }

Creating a GridFS File privatevoid createOrigDocument(File origFile, String doc_id, ObjectId mongoSetId){ try{ // put the originalfile into GridFS, // with the document id and MongoSet objectId as attributes GridFS gfs =new GridFS(MongoDB.getMyDB()); GridFSInputFile gif = gfs.createFile(origFile); gif.setFilename(origFile.getName()); gif.put(“mongoSet_id”, mongoSetId); gif.put(”doc_id", doc_id); gif.save(); }catch(IOException e){ thrownew ClusterException("Error saving orig doc"+origFile.getName(), e); } }

Read a Document with MongoDB API publicDocument getDocumentByIndex(ObjectId mongoSetId, int index){ BasicDBObject query =new BasicDBObject(); query.put("mongoSet_id", mongoSetId); query.put("doc_id", Integer.toString(index)); DBObject obj= MongoDB.getMyDB().getCollection("Document").findOne(query); Document doc =newDocument(mongoSetId, (String)obj.get("doc_id"), (String)obj.get("filename"), (String)obj.get("text")); Object metadata = obj.get("metadata"); if(metadata!=null){ doc.setMetadata((HashMap)obj.get("metadata")); }else{ // If there is no metadata for this document, just create a single metadata field, "Filename" HashMap filename =newHashMap(); filename.put("filename", doc.getFilename()); doc.setMetadata(filename); } return doc; }

Reading a GridFS File classDocument{ private ObjectId mongoSetId; privateString docId; // .... other fields and methods public GridFSDBFile getOrigDocument(){ GridFS gfs =new GridFS(MongoDB.getMyDB()); BasicDBObject query=new BasicDBObject().append("mongoSet_id",mongoSetId).append("doc_id", docId); GridFSDBFile file = gfs.findOne(query); return file; } } //... Servlet method for displaying original document protectedvoid processRequest(HttpServletRequest request, HttpServletResponse response){ // .... get Document from request parameters GridFSDBFile origFile = doc.getOrigDocument(); BufferedInputStream bis =newBufferedInputStream(origFile.getInputStream()); // ... read from the stream and write to the response output } }

Accessing data with MongoDB API • “Wordier” than JPA access – have to write more code • Finer level of control over how data is stored and retrieved • Other ways to manage marshalling/unmarshalling – JSON.parse(), GSON • Other Mongo Persistence libraries: Morphia, Jongo

Summary • Polyglot storage is useful when you have different types of data - can take advantage of different database features • Using @NoSQL JPA is easier when you are familiar with the API and are new to MongoDB. • There are details of @NoSQL mapping that you need to be aware of. Still need to understand MongoDB to use JPA effectively • MongoDB API is useful when you have fewer entities, and you need more control of storage and access, or you are using GridFS.

Thanks for coming! • Questions, Comments?

Using MongoDB in a Java Enterprise Application

Using MongoDB in a Java Enterprise Application

Presentation Transcript

Enterprise Java Beans

Enterprise Java Beans

Java Enterprise Edition (Java EE) Application Servers

Building a scalable inbox system with MongoDB and Java

Enterprise Java Bean

Enterprise Java Beans

Enterprise Java Beans

Enterprise Java Servlets

MongoDB and using MongoDB with .NET

CPSC 425 Java Enterprise Application Programming

Enterprise Java Beans

Write a Java application using NetBeans Integrated

List of companies using MongoDB

List of Companies using MongoDB

Enterprise Java Beans

Enterprise Java Beans

Build High Quality Java Enterprise Application with Java India Developers