370 likes | 560 Views
Arctos at the University of Alaska Museum Insect Collection Derek Sikes 1 Gordon Jarrell 2 Dusty McDonald 1 1 University of Alaska Museum Fairbanks , AK 2 Museum of Southwestern Biology, NM Alaska Entomological Society 5 th Annual Meeting, Anchorage, AK 27-28 Jan 2012.
E N D
Arctos at the University of Alaska Museum Insect Collection Derek Sikes1 Gordon Jarrell2 Dusty McDonald1 1 University of Alaska Museum Fairbanks, AK 2 Museum of Southwestern Biology, NM Alaska Entomological Society 5th Annual Meeting, Anchorage, AK 27-28 Jan 2012
Major repositories using the Arctos database: (43collections of specimens or observations, 1.4M records)
in partnership with which is a member of TeraGrid – A nationwide network of 11 supercomputing facilities which is sponsored by U. S. National Science Foundation’s Office of Cyberinfrastructure
Arctos: A 15 year history • MVZ: 1995 - Hired Stan Blum to develop relational data model (following modeling by Assoc. Systematic Collections). • MVZ: 1997 - Hired John Wieczorek to implement model (desktop application) using Sybase and Versata. Partial implementation (e.g., no loans). • UAM: 1998-2000 - John W. migrated mammal data to Oracle, set up Versata. • UAM: 2002 - Dusty McDonald replaced Versata with ColdFusion, implemented full model (first web-based instance,aka Arctos). • MSB: 2003 – Joined Arctos at UAM (first multi-hosting instance). • MVZ and MCZ: 2005-2007 - Implemented separate instances of Arctos at Berkeley and Harvard (MVZ: first Postgres, then Oracle). • MVZ: 2009 - Moved hosting of data to Alaska (Virtual Private Database version).
Arctos ARCTOS Specimen Catalog label data (and more) Accessions Loans, usage • Specimens (objects) - body parts, tissues, containers, etc. • Images, media (stored at TACC) • Projects, permits, publications • Accessions, loans, usage • Labels, as PDF files • Agents, agent activity Projects contribute and/or use specimens Citations Publications cite specimens The rest of Cyberspace GenBank Federated portals BerkeleyMapper “Media” in TeraGrid
Breadth of Data in Arctos • Fish, amphibians, reptiles, mammals, birds and bird eggs/nests, plants, arthropods, fossils, molluscs AND their parasites • Specimens and observations • Media (images, audio, video) • Publications, fieldnotes Arctos constantly evolving to incorporate new kinds of data, e.g.,: • Better representation of non-publication documents (fieldnotes, correspondence) • Cultural collections (art, anthropology...) Nearly all that is known about an object (or observation) can be included in Arctos.
ECN Session – Arthropod Collections Databases • What is the primary user audience? - large/ small museum management? taxonomic research? is a dedicated IT / programmer required? Single vs multi-user? (annual cost?) • GBIF - does the database provide data to GBIF? • Barcoding - does the database handle batch processing of specimens using barcodes? ( 'speed / ease of use') • Georeferencing - does it conform to the recommended 'best practices' guide published by GBIF? • What is the ease / difficulty of websetup? • Security - can a data entry technician accidentally delete or change (corrupt) large amounts of data? Is/are the database server(s) protected from disaster (eg floods, fires)? • Likes / dislikes & pros/cons
ECN Session – Arthropod Collections Databases 1a) What is the primary user audience? Museums / collections data management(also: observations, Federal collections [USFWS], large private collections associated with public institution] 1b) is a dedicated IT / programmer required? Yes, but the IT staff are shared among all participants. 1c) Single vs multi-user? Multi-user without practical limits. 1d) Annual cost? Negotiated per institution based on size and maintenance needs currently ranging $1,300 - $27,000
ECN Session – Arthropod Collections Databases 2) GBIF - does the database provide data to GBIF? Arctos does this automagically every minute. 3) Barcoding - does the database handle batch processing of specimens using barcodes? ( 'speed / ease of use') Arctos attaches barcodes to “parts.” This lets you track things like tissues, extractions, slides and pinned bodies of each cataloged specimen separately.
ECN Session – Arthropod Collections Databases 4) Georeferencing - does it conform to the recommended 'best practices' guide published by GBIF? Arctos fully supports georeferencing "best practices," in part because the authors of that document and of Arctos' spatial data structure are one and the same. (JohnWieczorek) 5) What is the ease / difficulty of websetup? Acquire password. Enter data. (Arctos is only available via the web).
ECN Session – Arthropod Collections Databases Preservation of specimens and their associated data for perpetuity NSF will help us get our data online but ensuring they stay online forever is a problem that hasn’t been solved
33,090 specimens 28 institutions / private collections 736 images 4,516 bibliographic images 428 users
DMNS Arachnology Data In-house -> NSD -> Crash -> K EMu
Cabinets antiquated wooden damaged = unsafe
Arctos Database home-made weak security mine alone not online = unsafe Specimen Catalog label data (and more) Accessions Loans, usage Projects contribute and/or use specimens Citations Publications cite specimens The rest of Cyberspace GenBank Federated portals BerkeleyMapper “Media” in TeraGrid
ECN Session – Arthropod Collections Databases 6) Security - can a data entry technician accidentally delete or change (corrupt) large amounts of data? No – Data entry technicians enter data into a staging area Data must be vetted before being loaded by someone with more access privileges All non-select transactions are audited. We can (theoretically) roll back to any point in history, or roll any user's updates back to any point in history. We can re-create all actions by all users.
ECN Session – Arthropod Collections Databases 6) Security - Is/are the database server(s) protected from disaster (eg floods, fires)? Yes – running a RAID array Backups – continuous logs to a remote NAS – local drives – Texas Advanced Computing Center – San Diego Supercomputing Center “If we lose all the nightly backups (3 tectonic plates), I'm betting nobody will be overly worried about Arctos data. Or breathing.” – D. McDonald
ECN Session – Arthropod Collections Databases 7) Likes / dislikes & pros/cons DISLIKES: - Learning curve fairly steep -> back to kindergarten - Can’t customize to my heart’s content, each change must be voted on & prioritized by other users - Web access generally slower than I like ( we are all more critical of others than ourselves) - Only available when networked. Field work in remote areas requires special solutions if data are to be accessed. - User interface is ~ garish, clunky, industrial (but works)
ECN Session – Arthropod Collections Databases 7) Likes / dislikes & pros/cons LIKES: - Rock – solid security, the data will outlive me - Web-published - Cutting-edge web integration (mapping, GenBank, etc) - No responsibility on my part to maintain backups, software updates, etc. Need only a networked computer - Arctos programmers & designers are biologists / users who really care about “doing it right”
ECN Session – Arthropod Collections Databases 6) Security - can a data entry technician accidentally delete or change (corrupt) large amounts of data? There are multiple roles and partitions at various levels. A data entry technician has write access to exactly one table, the bulkloader. Additionally, one VPD limits his access to his own collection, another limits access to his own rows, and yet another prevents him from marking records to load. In short, he can only un-do anything he's done, and then only in a "staging area" separate from "real" data. A similar model is used throughout Arctos. We control access at the table and row level, and can easily implement finer-grained control if such becomes necessary. Users (theoretically) get only the rights that they need and have demonstrated an understanding of to the data they need, all the while having full access to shared data (like agents). Data like agents and taxonomy - things where character strings rather than data concepts matter to collections - are trigger-protected based on usage. You can't update an agent name after it's been used as an author, for example. This is pretty basic referential integrity, and Arctos is the only thing that has it. Data and user rules are all handled by the RDBMS, so we can plug in forms written by other people/projects, offer SQL command-line access, webservices, etc., without worrying too much about security or referential integrity. (Specify, for example, cannot safely support such access as all data and access rules live in the application layer.) All non-select transactions are audited. We can (theoretically) roll back to any point in history, or roll any user's updates back to any point in history. We can re-create all actions by all users. In addition to ColdFusion's Application Security, we take full advantage of Oracle security - a breach of one just leads to another layer. Oracle handles things like secondary user access and brute-force password crack attempts. An independent semi-intelligent (and slightly paranoid) security wrapper watches for malicious behavior and blocks IP access if it detects anything anomalous.
ECN Session – Arthropod Collections Databases 6) Security - Is/are the database server(s) protected from disaster (eg floods, fires)? The server is running a RAID array - we can lose a disk or two and not lose any data (or stop working). Rollback logs are continuously written to a remote NAS (Networked Attached Storage) system. Daily backups are stored on the local drives, on the NAS, and on tape in GVEA's "bunker." (They won't tell us what or where that is, but your electric bill and medical records are in there and it makes the Department of Homeland Security happy.) Daily backups are also copied to the Texas Advanced Computing Center at Austin (one copy on disk and another on tape) and to tape at the San Diego Supercomputing Center. We may have another copy going to massively redundant disk at the National Center for Supercomputing Applications (University of Illinois at Urbana- Champaign) by the time you get to Reno. We can recover to the point of failure, or at least to within a couple minutes of it, with one copy of the most recent daily backup and one copy of the rollback logs. (Depending on recent activity, we can usually actually recover from a week-or-so old daily + the rollbacks.) We'll lose <24H of data if if we lose all the rollbacks - the sever and the NAS. Those are in two buildings, both with serious security, separated by about a hundred yards of gravel parking lot. If we lose all the nightly backups (3 tectonic plates), I'm betting nobody will be overly worried about Arctos data. Or breathing. There are a couple dozen probes per day - I think it's fairly safe to say that Arctos security has been tested. (Actual attacks are now kind of hard to detect due to the aforementioned paranoid IP killer, which generally shuts them off at the first probe, but we used to get one per week or so.) A big DDoS attack would easily take us down, but (1) we're too boring to attract such a thing, and (2) so what? - those things just eat servers, not data.
ECN Session – Arthropod Collections Databases 6) Security - Is/are the database server(s) protected from disaster (eg floods, fires)? We've lost a few disks over the years, but never lost data or had a server go down due to it. (We've had lots of downtime, just not equipment-related.) Our biggest threat is probably a disgruntled employee with too much access and a long-term plan, but we could probably (with expensive consultant help) even recover from that, and there's no lack of tools to detect such behavior. That might all be a little overkill - I'd settle for daily backups on 2 major tectonic plates if absolutely necessary – but I certainly think that you have an obligation to do more than install [database X] on some junker computer and maybe buy a tape drive when you take public money to create or curate digital data. [database X] may be free, but supporting it takes a real commitment in hardware, infrastructure, and expertise that most Universities are poorly equipped to make. I don't know of a single large project that hasn't at some point lost digital data. - Dusty McDonald, Arctos programmer
Lessons Learned 1) Proprietary software is generally a bad idea unless you have guaranteed, sustained budget for staff and upgrades. 2) Back-ups cannot merely be performed/scripted with the assumption that the job is done. 3) Back-ups should NOT be incremental, MUST be stored offsite, and MUST include separate images of operating system and databases 4) Restoration from bare metal must be fully documented and periodically performed to verify that the process DOES work. 5) Source code must be in a distributed public repository like Github. - D. Shorthouse
University of Connecticut Bird Collection data were found... on a single floppy 2031 records in a flat file
University of Connecticut Bird Collection data were found... and made available on-line