1 / 37

Scientific data cloud infrastructure and services in Chinese Academy of Sciences

Scientific data cloud infrastructure and services in Chinese Academy of Sciences. Jianhui LI( lijh@cnic.cn ), Yuanke Wei( weiyuanke@cnic.cn ) Yuanchun Zhou( zyc@cnic.cn ) Computer Network Information Center Chinese Academy of Sciences. Outline. About us

celine
Download Presentation

Scientific data cloud infrastructure and services in Chinese Academy of Sciences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scientific data cloud infrastructure and services in Chinese Academy of Sciences Jianhui LI(lijh@cnic.cn), Yuanke Wei(weiyuanke@cnic.cn) Yuanchun Zhou(zyc@cnic.cn) Computer Network Information Center Chinese Academy of Sciences

  2. Outline • About us • CAS (Chinese Academy of Sciences) • CNIC(Computer Network Information Center), CAS • SDC(Scientific Data Center), CNIC, CAS • About Scientific Data Cloud of CAS • Data Challenge • Architecture • Infrastructure Service • Middleware Service • Data Service • Conclusion

  3. CAS is a leading academic institution and comprehensive research and development center in natural science, technological science and high-tech innovation in China. • It was founded in Beijing on 1st November 1949 on the basis of the former Academia Sinica (Central Academy of Sciences) and Peiping Academy of Sciences.

  4. a public support institution for consistent construction, operation and services of information infrastructure of CAS. • a pioneer, promoter and participator for informtion of domestic scientific research and scientific research management

  5. Operation and Services in CNIC —— Provided by 7 Business Departments Respectively

  6. Scientific Data Center • Scientific Data Center (SDC) is the support facility in charge of the construction, management, operation and maintenance of CAS Informatization Data Application Environment, and has been taking the lead in implementing the CAS Scientific Database Project for more than 20 years. • SDC provides storage services, data services and related application technology services for the entire CAS • SDC hosts the Secretariat of Committee on Data for Science and Technology (CODATA) and the CAS Secretariat for World Wide Web Consortium (W3C). • The vision of SDC is striving to become an important facilitator of exchange and application of scientific data resources, key technology supplier during lifecycle of scientific data, and leader in transforming scientific data into knowledge service.

  7. Outline • About us • CAS (Chinese Academy of Sciences) • CNIC(Computer Network Information Center), CAS • SDC(Scientific Data Center), CNIC, CAS • About Scientific Data Cloud of CAS • Data Challenge • Architecture • Infrastructure Service • Middleware Service • Data Service • Conclusion

  8. Hotter and hotter in data research • Mar.29, 2012, the Obama Administration “ Big Data Research and Development Initiative ”($200 Million) : improving our ability to extract knowledge and insights from large and complex collections of digital data • Feb. 11, 2011, 《Science》issued a Special Online Collection: “Dealing with Data” • Sep., 2009, 《Nature》 issued “Data’s shameful neglect”: Research cannot flourish if data are not preserved and made accessible. All concerned must act accordingly. • The Second International Symposium on Dataology & Data Science was held 3 days ago in China • Difficult to discover • Difficult to access • Being lost

  9. Data Driven Scientific Discovery • Data is regarded as the most valuable thing. Scientific discovery based on data intensive computing is now considered as the ''fourth paradigm'' after theoretical, experimental, and computational science. “The impact of Jim Gray’s thinking is continuing to get people to think in a new way about how data and software are redefining what it means to do science." — Bill Gates

  10. Over Moore’s Law in Data • IDC: Data doubles less every 18 months • Huge volume • Rapid increase • Various types and formats

  11. Data Challenge • Scientists are being overwhelmed with exploding scientific data. • Much scientific research needs data distributed in different locations. • There is a growing gap between ability of modern scientific instruments and that of scientists. • It has been a great challenge to view, manipulate, store, move, share, and interpret the massive data.

  12. Scientific Data Deluge in CAS • Large scientific facilities produce huge data • +20 being operation • +20 under construction • Long-Term field observation stations • +100 stations including Ecology, Environment, Space, etc. • Long-Term Research data need to be archived and shared • 100+ institutes Field observation stations Large Scientific facilities

  13. Data Advanced CI for Data Lifecycle in CAS Generation &Collection Storage &Curation 1.Field observation stations 2.Large scientific facilities 3.others Information Stream Data intensive e-Science activities and Applications High Speed Network -CSTNET -CSTNET-CNGI -GLORIAD Information Stream Trans-mission Computing &Analysis Application Information Stream Information Stream Supercomputing Grid -Computing -Analysis -Mining -visualization Data Centers -storage &preservation • Curation • Sharing and Service Information Stream

  14. Cloud Computing • It is mixed evolution of grid computing, distributed computing, parallel computing, utility computing, network storage technologies, virtualization, and etc. • It has the characteristics of large-scale, virtualization, high reliability, generality, expandability, on-demand service, extremely cheap, which enables it a popular computing paradigm. • It can bridge the scientists and massive data. • Chinese Academy of Sciences Scientific Data Cloud (CASSDC) is focused on cloud technology to provide facilitated ways for scientists to make use of powerful information infrastructure, massive scientific data and rich scientific software.

  15. Services of CASSDC

  16. Scientific databases Scientific Data infrastructure Application enabled environments and typical e-science practice Software and Toolkits (scientific data collection, curation, and publishing, data analyzing and visualization…) Middle ware (Scientific data grid middleware, internet-based storage service middleware…) Massive storage system Data-intensive computing facility High speed network

  17. Data Centers Distribution of CASSDC • Scientific Data • ~1PB • Above 60 institutions • Multiple Disciplines • Storage Capacity • ~ 22PB(50PB) • 1major center • 1archive center • 12 middle-size center • Computing Capacity • ~ 5000(10000) CPUcores • Dedicated design for DIC

  18. System Ach. Of Major Center

  19. Enabling Technology: Infrastructure • Global File System of Cloud Storage

  20. Enabling Technology: Infrastructure • On fly provision of a computing cluster

  21. Scientific Databases (SDB) • A Long-term mission started in 1986 which funded by CAS • many institutes involved • long-term, large-scale collaboration • data from research, for research • Collecting multi-discipline research data and promoting data sharing • More than 350 research databases and 500 datasets by 61 institutes • Over 200TB data available to open access and download http://www.csdb.cn

  22. Scientific Databases (cont.) focusing on data integration and improving research database to be resource database and even reference database) Application oriented database Reference database Resource database Research database Research database

  23. Scientific Databases (cont.) • 8 Resource databases • Geo-Science • Biodiversity • Chemistry • Astronomy • Space Science • Micro biology and virus • Material science • Environment • 2 Reference databases • China Species • compound • 4 application-Oriented databases • High Energy (ITER) • Western Environment Research • Ecology research • Qinghai Lake Research

  24. Scientific Databases (cont.) 37 research databases Physics & Chemistry, Geosciences, Biosciences, Atmospheric & Ocean Science, Energy Science, Material Science, Astronomy & Space Science

  25. CAS Scientific Data Grid SDG is built upon the Scientific Database, supporting to find and access large scale, distributed and heterogeneous scientific data uniformly and conveniently in a SECURE and proper way Building scientific data application grid according to domain requirements Integrate distributed data, analysis tools and storage and computing facilities, providing a uniform data service interface 4 pilot grids bioscience grid geoscience grid Chemistry grid Astronomy and space science grid

  26. Scientific Data Grid-Architecture Organization Architecture of SDG

  27. SDG-Platform && Middleware • Platform • SDGIM: Information Management • SDGOM: Operation Management • SDGSA: Storage Service • SDGMS: Monitor && Statistic • Middelware • SDGDD: Data Publish • SDGDT:Data Transfer Toolkit • SDGDC: Data Compress Toolkit • SDGMM:MetaData Management • SDGJS: Job Scheduler

  28. Tools for data management and service

  29. An Integrated Case on Geography Supported by CASSDC • Project: High Precision Display of Earth Surface • Data and computing resource are both distributed • Model is from CAS scientist • Adopted Middleware: • Data search • Data transport • On-fly computing provision • Job scheduler • It solves massive data computing while some commercial geometric software can’t work

  30. An Integrated Case on Biography Supported by CASSDC • Gene Alignment Project • Data: • Microbiology Institute • World Data Center for Microorganisms • Wuhan Virus Institute • Computing: • CNIC • Microbiology Institute • Adopted Middleware: • Data search • Data transport • Job scheduler • User athentication

  31. An Integrated Case on Biography Supported by CASSDC

  32. Cooperation • International Organization Membership

  33. Cooperation with Europe ITER CSTNET provide network support for the data transmission between Europe and China CERN LHC: ATLAS & CMS ARGO-Yangbajing Global Earth Observation System of Systems

  34. Challenges • On-demand Linking multi-disciplinary data based on semantic • Big Data processing • High scalable, Low cost, high Throughput • On-demand flexible data processing • Integrate data, storage, computing, analysis model and etc. as a whole system driven by one specific scientific problem • Making infrastructure invisible for scientists

  35. Conclusion • Science discovery has increasingly become data intensive, and it calls for reliable and easily accessible scientific data infrastructure • CAS is always promoting to build scientific data infrastructure and data intensive e-Science practices • Seeking potential cooperation in data intensive e-Science and data cloud

  36. Thank you!

More Related