1 / 17

大数据管理与数据质量 - 美国金融业中的对策

大数据管理与数据质量 - 美国金融业中的对策. 汪时奇 (博士) 处理速度 容量限制 数据质量. Overview. 数据 <= Data = 信息 ( 并非数字集合 ) 数据科学 ( 约 )= 信息科学 为何研究大数据 ? 因为相关产品 ( 如硬盘 , memory, CPU 等 ) 价格指数下降 因为信息爆炸 因为大数据导致许多新问题 大数据研究是多学科的综合 (IT, DM, BI, BA, …) 实业界对大数据问题的对策 ( 见下文 ). 1. 数据库策略. 1.1 Database (DB) performance

Download Presentation

大数据管理与数据质量 - 美国金融业中的对策

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 大数据管理与数据质量- 美国金融业中的对策 汪时奇 (博士) 处理速度 容量限制 数据质量

  2. Overview • 数据 <= Data = 信息 (并非数字集合) • 数据科学 (约)= 信息科学 • 为何研究大数据? • 因为相关产品(如硬盘,memory, CPU等)价格指数下降 • 因为信息爆炸 • 因为大数据导致许多新问题 • 大数据研究是多学科的综合(IT, DM, BI, BA, …) • 实业界对大数据问题的对策 (见下文)

  3. 1. 数据库策略 • 1.1 Database (DB) performance • 1.2 DB space

  4. 1.1 DB performance • Auditing – 2 tables: a small active & a huge passive • Partition • Index (good/bad; Cluster; Global/Local) • Lock type (when apply row lock) • Transaction: 1-phase or 2-phase • Normalization • Internal optimization (e.g. Execution Plan=> hint in Oracle) • Constraints (e.g. Check) usage to replace trigger • Tricks (e.g. Date function; Search small table first; …)

  5. 1.2 DB space • Space arrangement for even distribution(e.g. 1 huge table uses a few data files) • Cleaning procedure with defragment • Partition design with cleaning plan

  6. 2. Applications (软件)(Java example) • Using advanced language (e.g. Java or C#) • 2.1 Memory(内存) • 2.2 Disk/network space • 2.3 Performance • 2.4 Maintainability

  7. 2.1 Memory • Minimize big objects creation and coexistence • GC (Garbage Collection) or null big objects once out of scope • Choose appropriate GC type • gc() • Try to split one big object to small objects • Use mutable class for frequently changed big objects (e.g. StringBuilder, instead of String)

  8. 2.2 Disk/network space • Smart clean and archive processese.g. archive zipped old or not used files to low speed network space and delete very old files from that space • Smart logging settings • e.g. log4j size rolling • e.g. Avoid duplicated or trivial logging info • Monitor for spaces

  9. 2.3 Performance • Avoid redundant treatment (in big loops)Maximize reuse • Multi-threading • DB accessing • Logging -- avoid slow options (e.g. line #)

  10. 2.4 Maintainability • SOA principles Lose coupling, reusability, granularity, modularity, composability, componentization, interoperability, … • JEE patterns (DAO, DTO, Biz Delegation, …) • Design patterns (23) and MVC • Creation • Structure • Behavior (e.g. Visitor) • OOP principles • Abstraction, encapsulation, polymorphism, … • Open/Close

  11. 3. 数据质量控制 • 3.1 Business • 3.2 Process A. Failover & DR (Disaster Recovery) B. QA (Quality Assurance)(see <软件质量管理点滴>for details) C. UAT (User Acceptance Test) • 3.3 Technology

  12. 3.1 Business • Reduce manual work; Increase automation • Complete approval system for manual workE.g. 1 level => 2 levels or 3 levels approval • Extend view points to confirm data quality • Reduce redundancy systems (e.g. due to merge, due to vendors) • Schedule Cleansing (see details) • Enhance Reconciliation (see details) • Build Trust level (see details) • Try to cover all rare cases

  13. 3.1.E Cleansing • When • At system merge • At major change • How • Develop detection applications • Deliver mismatch reports to IT & business • Find solutions on both IT & business

  14. 3.1.F Reconciliation • Where • 1+ subsystems have data for same contents. • 1+ subsystems have independent date change functionality. • What • Run & improve recon. app. routinely. • Categorize reports by urgency. • Analyze reports. • Debug or adjust biz rule or apply Cleansing.

  15. 3.1.G Trust level • When • At 1+ fixed data inputs • Inputs are independent • Must decide final details from inputs • How (based on) • Provider level (for a detailed data group) • Data history • Samples: Bloomberg, Reuter, Telekurs, DTCC, …; Moody, S&P, Fitch.

  16. 3.2.A Failover & DR • Failover • DB: 2+ at diff. locations; real-time replication • App • Active-Active: Cluster with Load Balancing • Active-Passive • Auto (via SAN) • Manual + Auto • DR • DB: e.g. daily or hourly or real-time replication • App: Manual switch

  17. 3.3 Technology • DB design • Constraint ‘Check’ (for sensitive table values) • Normalization (to reduce duplications) • Validation processes (to find conflict data) • Application design • Data integration check • E.g. cryptography signature • E.g. CRC check • Data display (e.g. Excel missing leading 0, date=>num)

More Related