410 likes | 597 Views
Using MongoDB in a Java Enterprise Application. Ellen Kraffmiller - Technical Lead Robert Treacy - Senior Software Architect Institute for Quantitative Social Science Harvard University iq.harvard.edu . TS-4656. Agenda. What is Consilience? Consilience Architecture Why MongoDB?
E N D
Using MongoDB in a Java Enterprise Application Ellen Kraffmiller - Technical Lead Robert Treacy - Senior Software Architect Institute for Quantitative Social Science Harvard University iq.harvard.edu TS-4656
Agenda What is Consilience? Consilience Architecture Why MongoDB? Data Storage and Access Details Using JPA vs. Mongo API Summary Q & A
Consilience Intro Research Tool- Discovery Grimmer, Justin, and Gary King. 2011. General Purpose Computer-Assisted Clustering and Conceptualization. Proceedings of the National Academy of Sciences. http://j.mp/j4xyav
Consilience History • 2010 - brainstorming, first mockups, prototypes, proof of concept, experimentation • Initially small document sets • Data could be loaded into memory from files • LCE calculations could be done on the fly • Derby used for user accounts, permissions, user history
Workflow • Analyze document terms • Calculate clustering solutions for known methods • Maps solutions to two dimensional space • Generate LCE from user clicks on map • User discovers patterns in data, annotates and labels favorites
Consilience Pre-processing Get word frequency for each document Stemming Term document matrix Stem - words file Txt files MongoDB Run clustering methods Grid point membership Labels Method points Projection to 2D clustering space Labeling Calculate similarity matrix Run K-means Cluster membership matrix
Three stages of Java processing • Document set ingest • Local Cluster Ensemble(LCE) Calculations • Kmeans • Labeling (Mutual Information • Cluster Analysis (Clustering Space page)
Data Requirements • Goal – manage 10 million documents/set • Each document set has • Document text files and original files • Text Analysis Data (Term Document Matrix, Stem Data) • Method Clustering solutions – cluster assignments for each method • Local Cluster Ensemble - Clustering solutions
Why MongoDB • Why not only SQL • Needed to consider persisting larger document sets • On the fly LCE calculations became impractical – they needed to be pre-calculated and persisted • Document set metadata could be any type • Need to efficiently handle potentially very large amounts of data • 10 Million documents with associated metadata, cluster memberships, 10,000 pre-calculated clustering solutions (grid points)
Why SQL • Already had working code written • Take advantage of transaction management for frequently updated data • SQL data works well with Web Server Security Realm • Data in SQL database is relatively small & manageable
Data Storage Details • Use Derby for read/write user-related data • user accounts • document set permissions • user workspace history • Use MongoDB for read-mostly document set data and clustering solutions • Document text, metadata, original files • Clustering analysis data • Word counts • Pre-calculated cluster memberships and keywords
Data Storage Details • Combination of mongodb collections and GridFS • Flat files to GridFS makes it easier to have multiple servers working on the data • MapReduce (future)
Data access from Java • Derby JPA Entities • MongoDB JPA Entities and MongoDB API
@NoSQL JPA Entities GridPoint and Cluster Example
Parent Entity - GridPoint.java @Entity @NoSql(dataFormat = DataFormatType.MAPPED) publicclass GridPoint implementsSerializable{ @Id @GeneratedValue @Field(name = "_id") protectedString id; privatedouble x; privatedouble y; privatedouble distance; privateint numberOfMethods; privateLong rangeId; privatedouble[][] prototypes; @OneToMany(cascade = {CascadeType.PERSIST, CascadeType.REMOVE}) private List<Cluster> clusters; //. . . Rest of class methods }
Child Entity - Cluster.java @Entity @NoSql(dataFormat=DataFormatType.MAPPED) publicclass Cluster implementsSerializable{ @Id @GeneratedValue @Field(name = "_id") privateString id; privateString labels; @ManyToOne MongoGridPoint gridPoint; privateLong rangeId; @ElementCollection private List<MiWord> miWords; ... }
Embedded Entity - MiWord.java @NoSql(dataFormat=DataFormatType.MAPPED) @Embeddable publicclass MiWord implementsSerializable{ privateString label;// most common variation ofthe stem publicdouble mi;// mutual information value publicint wordIndex;// index of the word in wordDocMatrix // getters and setters …. }
@NoSQL Mapping to MongoDB • Id field maps to String, not ObjectId • Bidirectional relationships not allowed (no “mappedBy”) – use two one directional instead • @OneToMany – Id’s of child entities stored in the parent • No @OrderBy – MongoDB saves order from insert • Full description of mapping support • http://wiki.eclipse.org/EclipseLink/UserGuide/JPA/Advanced_JPA_Development/NoSQL/Configuring
SavingGridPointand related Clusters publicvoid savePoint(EntityManager em, List<int[]> clusterDocIds, Long rangeId, GridCoordinate coord ){ GridPoint gridPoint =new GridPoint(); gridPoint.setRangeId(rangeId); //... Set more fields List<MongoCluster> clusters =new ArrayList<>(); for(int i =0; i < clusterDocIds.size(); i++){ Cluster cluster =new MongoCluster(); cluster.setClusterSize(clusterDocIds.get(i).length); cluster.setRangeId(rangeId); cluster.setMiWords(miWordList); clusters.add(cluster); } gridPoint.setClusters(clusters); em.persist(gridPoint); em.flush() saveClusterFiles(clusterDocIds, gridPoint.getId()); }
Calling savePoint() within Transaction publicvoid savePoints(Long rangeId, Coordinate[] gridCoordinates){ EntityManagerFactory emf = Persistence.createEntityManagerFactory("mongo-"+getMongoHost()); try{ for(Coordinate coord: gridCoordinates){ List<int[]> clusterDocIds = calcClustering(coord.x,coord.y); EntityManager em = emf.createEntityManager(); em.getTransaction().begin(); savePoint(em, rangeId, coord.x, coord.y, clusterDocIds); em.getTransaction().commit(); em.close(); } }catch(Exception e){ rollback(rangeId); } }
Persistence.xml: Defining Persistence Units <?xmlversion="1.0"encoding="UTF-8"?> <persistenceversion="2.0"xmlns="http://java.sun.com/xml/ns/persistence"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance” xsi:schemaLocation=http://java.sun.com/xml/ns/persistence http://java.sun.com/xml/ns/persistence/persistence_2_0.xsd> <persistence-unitname="text"transaction-type="JTA"> <provider>org.eclipse.persistence.jpa.PersistenceProvider</provider> <jta-data-source>jdbc/text</jta-data-source> <properties> <propertyname="eclipselink.ddl-generation"value="create-tables"/> <propertyname="eclipselink.cache.shared.default"value="false"/> </properties> </persistence-unit>
Persistence.xml, continued <persistence-unitname="mongo-localhost"transaction-type="RESOURCE_LOCAL"> <class>edu.harvard.iq.text.model.GridPoint</class> <class>edu.harvard.iq.text.model.Cluster</class> <properties> <propertyname="eclipselink.target-database"value="org.eclipse.persistence.nosql.adapters.mongo.MongoPlatform"/> <propertyname="eclipselink.nosql.connection-spec"value="org.eclipse.persistence.nosql.adapters.mongo.MongoConnectionSpec"/> <propertyname="eclipselink.nosql.property.mongo.port"value="27017"/> <propertyname="eclipselink.nosql.property.mongo.host"value="localhost"/> <propertyname="eclipselink.nosql.property.mongo.db"value="mydb"/> <propertyname="eclipselink.logging.level"value="SEVERE"/> </properties> </persistence-unit> <persistence-unitname="mongo-ip-10-205-17-163.ec2.internal"transaction-type="RESOURCE_LOCAL"> . . . <propertyname="eclipselink.nosql.property.mongo.host"value="ip-10-205-17-163.ec2.internal"/> . . . </persistence-unit></persistence>
Composite Persistence Unit We use separate persistence units for MongoDB and Derby, but “create-tables” no longer works Another option – composite persistence unit: <persistence-unitname="composite-pu"transaction-type="RESOURCE_LOCAL"> <provider>org.eclipse.persistence.jpa.PersistenceProvider</provider> <jar-file>\lib\polyglot-persistence-rational-pu-1.0-SNAPSHOT.jar</jar-file> <jar-file>\lib\polyglot-persistence-nosql-pu-1.0-SNAPSHOT.jar</jar-file> <properties> <propertyname="eclipselink.composite-unit"value="true"/> </properties> </persistence-unit> </persistence> With composite persistence unit, you can map JPA relationships between SQL and NoSQL entities More Info: http://java.dzone.com/articles/polyglot-persistence-0
Minor Issues with JPA @NoSQL Have to restart Glassfish after modifying Entity Need to update persistence.xml with classnames JPA Query language not fully supported – sometimes have to revert to native query Overhead of using EntityManagerFactory, EntityManager & transactions
Create a Set in Derby and MongoDB @Stateless @Named publicclass DocSetService { @PersistenceContext(unitName ="text") protected EntityManager em; publicvoid create(File parentDir, DocSet docSet){ try{ MongoSetWrapper.ingestData( parentDir, docSet.getSetId()); em.persist(docSet); }catch(Exception e){ MongoSetWrapper.rollbackData(docSet.getSetId()); // other exception handling } }}
Getting a connection to MongoDB publicclass MongoDB{ privatestatic MongoClient mongo; privatestaticString dbName ="mydb”; publicstaticvoid init(){ if(mongo ==null){ try{ mongo =new MongoClient(new ServerAddress(getHostName())); }catch(UnknownHostException e){ //...do exception handling } } } publicstatic DB getMyDB(){ init(); return mongo.getDB(dbName); } }
Create MongoSet with MongoDB API private ObjectId createMongoSet(File setDir, String setId ){ DBCollection coll = MongoDB.getMyDB().getCollection("MongoSet"); BasicDBObject doc =new BasicDBObject("setId", setId); ArrayList<String> summaryFields = readSummaryFieldsFromFile(setDir); BasicDBList list =new BasicDBList(); list.addAll(summaryFields); BasicDBList list2 =new BasicDBList(); list2.addAll(readAllFieldsFromFile(setDir)); doc.append("summaryFields", list); doc.append("allFields", list2); coll.insert(doc); return(ObjectId) doc.get("_id"); }
Create Document with MongoDB API privatevoid createDocument(ObjectId mongoSetId, String doc_id, String filename, File setDir, HashMap metadata, File origFile){ DBCollection coll= MongoDB.getMyDB().getCollection("Document"); BasicDBObject doc =new BasicDBObject("mongoSet_id", mongoSetId); doc.append("doc_id", doc_id); doc.append("filename", filename); doc.append("text", getDocumentText(setDir, filename)); // HashMap can contain different types – difficult to do with JPA doc.put("metadata", new BasicDBObject(metadata)); coll.insert(doc); createOrigDocument( origFile, doc_id, mongoSetId ); } }
Creating a GridFS File privatevoid createOrigDocument(File origFile, String doc_id, ObjectId mongoSetId){ try{ // put the originalfile into GridFS, // with the document id and MongoSet objectId as attributes GridFS gfs =new GridFS(MongoDB.getMyDB()); GridFSInputFile gif = gfs.createFile(origFile); gif.setFilename(origFile.getName()); gif.put(“mongoSet_id”, mongoSetId); gif.put(”doc_id", doc_id); gif.save(); }catch(IOException e){ thrownew ClusterException("Error saving orig doc"+origFile.getName(), e); } }
Read a Document with MongoDB API publicDocument getDocumentByIndex(ObjectId mongoSetId, int index){ BasicDBObject query =new BasicDBObject(); query.put("mongoSet_id", mongoSetId); query.put("doc_id", Integer.toString(index)); DBObject obj= MongoDB.getMyDB().getCollection("Document").findOne(query); Document doc =newDocument(mongoSetId, (String)obj.get("doc_id"), (String)obj.get("filename"), (String)obj.get("text")); Object metadata = obj.get("metadata"); if(metadata!=null){ doc.setMetadata((HashMap)obj.get("metadata")); }else{ // If there is no metadata for this document, just create a single metadata field, "Filename" HashMap filename =newHashMap(); filename.put("filename", doc.getFilename()); doc.setMetadata(filename); } return doc; }
Reading a GridFS File classDocument{ private ObjectId mongoSetId; privateString docId; // .... other fields and methods public GridFSDBFile getOrigDocument(){ GridFS gfs =new GridFS(MongoDB.getMyDB()); BasicDBObject query=new BasicDBObject().append("mongoSet_id",mongoSetId).append("doc_id", docId); GridFSDBFile file = gfs.findOne(query); return file; } } //... Servlet method for displaying original document protectedvoid processRequest(HttpServletRequest request, HttpServletResponse response){ // .... get Document from request parameters GridFSDBFile origFile = doc.getOrigDocument(); BufferedInputStream bis =newBufferedInputStream(origFile.getInputStream()); // ... read from the stream and write to the response output } }
Accessing data with MongoDB API • “Wordier” than JPA access – have to write more code • Finer level of control over how data is stored and retrieved • Other ways to manage marshalling/unmarshalling – JSON.parse(), GSON • Other Mongo Persistence libraries: Morphia, Jongo
Summary • Polyglot storage is useful when you have different types of data - can take advantage of different database features • Using @NoSQL JPA is easier when you are familiar with the API and are new to MongoDB. • There are details of @NoSQL mapping that you need to be aware of. Still need to understand MongoDB to use JPA effectively • MongoDB API is useful when you have fewer entities, and you need more control of storage and access, or you are using GridFS.
Thanks for coming! • Questions, Comments?