Effective design and analysis of bioinformation Unit 3

Effective design and analysis of bioinformationUnit 3 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD igabashvili@yahoo.com

Course availability • Lectures & Lab: every Wednesday, Duncan Hall, Room 550, 6:00 pm to 9:45 pm • Office hours: Wednesday, 4pm-6pm (Room 554, phone: 92404831) and by appointment • Lecture notes will be posted at: http://home.comcast.net/~igabashvili/221T.htm Or the SJSU page -- • The user name is “ewok\biostudents” (don’t enter quotation mark) • And the password is “4biolecture” (don’t enter quotation mark).

In the News Consumer genomics gets crowded http://www.seqwright.com/SoliD, ABI http://www.decodeme.com/Illumina https://www.23andme.com/Illumina http://www.navigenics.com/Affymetrix http://www.knome.com/ABI,Amersham,Illumina

https://www.23andme.com/experts/letters/science/

List from DeCODE genetics Our current list of diseases includes: Age-related Macular Degeneration, Asthma, Alzheimer's Disease, Atrial Fibrillation, Breast Cancer, Celiac Disease, Colorectal Cancer, Exfoliation Glaucoma XFG, Crohn's Disease, Multiple Sclerosis, Myocardial Infarction, Obesity, Prostate Cancer, Psoriasis, Restless Legs, Rheumatoid Arthritis, Type 1 Diabetes and Type 2 Diabetes.

Three important sub-disciplines within bioinformatics • the development of new algorithms and statistics with which to assess relationships among members of large data sets • the analysis and interpretation of various types of data including nucleotide and amino acid sequences, protein domains, and protein structures • the development and implementation of tools that enable efficient access and management of different types of biological information.

Main tasks of biomedical informatics • Storage, Analysis, Visualization and Management of biomedical data • Mining for new knowledge, hypothesis formulation and testing • Development of tools and resources for the above

Brief History of Bioinformatics 1920 - term genome was introduced by H. Winkler to denote the complete set of chromosomal and extra chromosomal genes 1933 - A new technique, electrophoresis, is introduced by Tiselius for separating proteins in solution. 1951 - Pauling and Corey propose the structure for the alpha-helix and beta-sheet

Brief History of Bioinformatics 1953 - Watson & Crick propose the double helix model for DNA (data by Franklin & Wilkins) 1954 - Perutz's group develop methods to solve the phase problem in protein crystallography. 1955 - The sequence of the first protein to be analyzed, bovine insulin, announced by F.Sanger. 1956 - The first protein sequence reported was that of bovine insulin, consisting of 51 residues

Brief History of Bioinformatics 1962 - Pauling's theory of molecular evolution 1965 – M.Dayhoff’s Atlas of Protein Sequences 1970 - Needleman-Wunsch algorithm 1972: The Protein DataBank 1980 - The first complete gene sequence for an organism (FX174):5,386 bp, nine proteins. 1981 - The Smith-Waterman algorithm IBM introduces its PC to the market. The concept of a sequence motif ( Doolittle )

Brief History of Bioinformatics 1983: Sequence DB searching (Wilbur-Lipman) 1986 - Human Genome Initiative announcement 1987: SWISSPROT protein sequence database 1988 - NCBI created at NIH/NLM (databases) 1988 - FASTA by Pearson and LupmanEMBL establish sequence database network 1990 - BLAST by Altschul,et.al. 2003 -Human Genome Project Completion

biomedical informatics The data of Public & Private Databases store biological data in various formats • Sequences DNA, RNA, proteins • Structures: X-ray, NMR, microscopy • Expression: microarrays, gels • Interaction: 2 hybrid, mass spec • Metabolism: GC-MS, NMR • Physiology: medical images, PK/PD

Search Engines • AND, OR, NOT • Specifying database fields (Organism, Author) • Order of words,: neonatal pre/3 screening (neonatal at least 3 words before screening • Spaces: wom?n cat*s

Search & Download • Entrez: integrated, text-based search and retrieval system for PubMed, Nucleotide and Protein Sequences, Protein Structures, Complete Genomes, Taxonomy, etc + batch download http://www.ncbi.nlm.nih.gov/sites/batchentrez term [field] OPERATOR term [field] 1:10[ESTC] AND Homo sapiens[ORGN] AND deafness[dis] (BSND: Bartter syndrome, infantile, with sensorineural deafness) http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=search&DB=unigene More on the course’s website

DATA FORMATS AND DATA INTEGRATION • It is widely recognized that successful data integration is one of the keys to improved productivity in biopharmaceutical R&D • Success in most bioinformatics-related activities, from functional characterization of genomic sequences to prioritization of drug targets, requires an integrated view of all relevant data in a drug discovery R&D program • Bioinformatics data sources often have large, complex data structures, reflecting the richness of the scientific concepts they model. Many bioinformatics data sources cover similar domains, such as genes, proteins, sequence annotations or microarray results.

Database design links http://www.devx.com/ibm/Article/20702 http://www.campus.ncl.ac.uk/databases/design/ http://www.dbazine.com/mullins_datamodel.shtml http://www.extropia.com/tutorials/sql/toc.html http://www.surfermall.com/relational/lesson_1.htm

Database: Definition • A collection of data that: • is organized • usually computer-based • represents repetitive information implicitly • supports retrieval • A set of rules to manipulate data • A method to mold information into knowledge

Database: applications • Who uses Computerized Databases: • Stores to keep track of inventory • Hospitals – to track of patient info • Travel agents – to keep up with their customers and reservations • Biologists – to efficiently manage and manipulate their data DATA  INFORMATION  KNOWLEDGE

Paper Database as Expert System

HISTORY • 1960's: Two main data models are developed: network model (CODASYL) and hierarchical (IMS). A user would need to know the physical structure of the database in order to query for information. SABRE IBM/AA. • 1970-72: E.F. Codd proposed relational model He disconnects the schema (logical organization) of a database from the physical storage methods.

HISTORY • 1970's: • Ingres: UCB  Ingres Corp., Sybase, MS SQL Server, Britton-Lee, Wang's PACE. This system used QUEL as query language. • System R: IBM  IBM's SQL/DS & DB2, Oracle, HP's Allbase, Tandem's Non-Stop SQL. This system used SEQUEL as query language. • The term Relational Database Management System (RDBMS) is coined

HISTORY • 1976: P. Chen proposed the Entity-Relationship (ER) model for database design • Early 1980's: Commercialization of relational systems begins as a boom… • Mid-1980's: SQL (Structured Query Language) becomes "intergalactic standard". DB2 becomes IBM's flagship product. Network and hierarchical models fade into the background

HISTORY • Early 1990's: Application and personal productivity tool development: PowerBuilder (Sybase), Oracle Developer, VB (Microsoft), Excel/Access (MS) and ODBC. First Object Database Management Systems (ODBMS) prototypes. • Mid-1990's: Internet/WWW. Web/DB grows exponentially, usable for average users

HISTORY • Late-1990's: Boom for Web/Internet/DB connectors. Open source solution with widespread use of gcc, cgi, Apache, MySQL, etc. Online Transaction processing (OLTP) and online analytic processing (OLAP) comes of age • Early 21st century: Burst of.com but solid growth of DB applications. PDAs, POS transactions, IBM, Microsoft, Oracle.

FUTURE • Terabyte and Petabyte databases of everything • Mobile databases • Semantic Web • Object Oriented Everything, includes databases • Object Database Management Group (ODMG) standards are proposed and accepted • Security issues

Database: advantages An advantage of a database program is: Can find a specific file quickly Can easily add records Can alphabetize and sort data faster than most people Is as accurate as the data that is entered Can make many different types of reports Is invaluable for large amounts of data

Database: Parts Parts of a relational database: Fields-categories of information <table> Entry = data in a field Record = all of the information about one item (row) File = document of all of the records To sort – field, ascend or descend (Excel, Works)

Database types • Flat (spreadsheet) • Hierarchical • Network (two fundamental constructs, called records and sets) • Relational

Relational Databases • Relational databases started to get to be a big deal in the 1970's, and they're still a big deal today, which is a little peculiar, because they're a 1960's technology. • A relational database is a bunch of rectangular tables. Each row of a table is a record about one person or thing; the record contains several pieces of information called fields.

Entities and Relationships Entities – things we store information about Relationships – links between the entities Many-to-many One-to-one One-to-many …

A Table is a Relation Columns, Fields, Attributes; Rows, Records, Tuples, Entities. records of data, comprised of fields, stored in tables

Keys and Functional Dependencies Key field (superkey, key) - a field that uniquely identifies a record If there is a functional dependency between column A and column B in a given table, (A  B), then the value of column A determines the value of column B. (employeeID  name)

Schema • Database schema is the structure or design of the database, a blueprint for the data in the database. employee(employeeID, name, job, cube, departmentID) • What information needs to be stored? (things or entities) • What questions will we ask of the database? (queries.)

Flawed schemas This Schema design leads to redundancies Employee(employee ID, name, job, department ID Department(Department ID, Department name)

Flawed schemas Insertion Anomaly Deletion Anomaly Update Anomaly

Avoid Null Values

Normalization Unnormlized table: lists instead of atomic numbers. This violates the rules of first normal form

Normalization This schema is in first normal form, 1NF

Second Normal Form, 2NF 2NF: Attributes must depend on the whole key

3NF and BCNF (Boyce-Codd) 3NF: Attributes must depend on nothing but the key BCNF: all the functional dependencies must have a superkey on the left side

Concepts Entities are things, and relationships are the links between them. Relations or tables hold a set of data in tabular form. Columns belonging to tables describe the attributes that each data item possesses. Rows in tables hold data items with values for each column in a table. Keys are used to identify a single row. Functional dependencies identify which attributes determine the values of other attributes. Schemas are the blueprints for a database.

Design Principles Minimize redundancy without losing data. Insertion, deletion, and update anomalies are problems that occur when trying to insert, delete, or update data in a table with a flawed structure. Avoid designs that will lead to large quantities of null values.

Normalization Normalization is a formal process for improving database design. First normal form (1NF) means atomic column or attribute values. Second normal form (2NF) means that all attributes outside the key must depend on the whole key. Third normal form (3NF) means no transitive dependencies. Boyce-Codd normal form (BCNF) means that all attributes must be functionally determined by a superkey.

Hierarchical Databases 1234567 Sandiego, Carmen 123 Main Street Labs Chem7 Chem7 K 3.9 Na142 K 4.3 Na136

Hierarchical Databases • Easy to use • Efficient storage • “Tree walking” is fast • Queries across trees are slow • Flexible • Too flexible: chaos is allowed • Too easy to modify • Difficult to document complex structures

Hierarchical Databases ÊMR(1234567)=“Sandiego, Carmen” ÊMR(1234567, “Address”)=“123 Main Street” ÊMR(1234567, “Chem7”, “2/2/02”, “Na”)=136 ÊMR(1234567, “Chem7”, “2/2/02”, “K”)=4.3 ÊMR(1234567, “Chem7”, “2/3/02”, “Na”)=142 ÊMR(1234567, “Chem7”, “2/3/02”, “K”)=3.9

Hierarchical Chaos 1234567 Admissions Admission 1 Admit Date: 2/2/02 Primary DX: CHF Other DX AODM A Fib Flag: S Flag: P

Network Databases 1234567 Gyn Clinic 2 Main St. Sandiego 305-2500 Secretary Gyn Clinic 8AM-5PM Ms Smith 305-1000 Service Pap Gyn Visit Dr. Jones Beeper 34

Extensible Markup Language (XML) Databases • SGML is a metalanguage • SGML is used to write Document Type Definitions (DTDs) that define languages • HTML is a language with an SGML DTD • Tags are for formatting/presentation syntax • XML is a proper subset of SGML • XML defines tags that convey semantics • We could write “Health Markup Language” (“HML”) in XML (if we could agree on the semantics and tags) • Tags may or may not be stored with data

<document> </document> • <document.id>CXR001</document.id> • <doc. date>19991101</doc. date> • <document.type> • </document.type> • <document.body> • <document.body> <identifier>P5-00010</identifier> <text>Chest X-Ray</text> <findings>No infiltrate, cardiac shadow not enlarged...</findings> <impression>Normal X-ray</impression>

Effective design and analysis of bioinformation Unit 3

Effective design and analysis of bioinformation Unit 3

Presentation Transcript

Unit 3 Image Analysis

Unit 3 SUBSTRUCTURE DESIGN - FOUNDATIONS

Unit 3: Biological Level of Analysis

Unit 3: Biological Level of Analysis

Unit 3: Biological Level of Analysis

Unit 3 – Design and the User Interface

Chapter 3 Algorithm Design and Analysis

COSC 3101A - Design and Analysis of Algorithms 3

Geography 409 Advanced Spatial Analysis and GIS Principles of Effective Cartographic Design - 3 -

The forensic use of bioinformation

Unit 3: Engineering Design

Developing the Three Modes of Communication: Effective Unit and Lesson Design

TOPIC 3: DESIGN AND ANALYSIS OF SHALLOW FOUNDATION

Unit 3 – Foreign Policy Analysis

UNIT – I NETWORK ANALYSIS ARCHITECTURE AND DESIGN

Unit 3 Financial Analysis

Unit 3 SUBSTRUCTURE DESIGN - FOUNDATIONS

Analysis and design of algorithm Unit-4

UNIT 3: Statistics, Data Analysis, and Probability

Taking bioinformation

UNIT 3: COURSE DESIGN

COSC 3101A - Design and Analysis of Algorithms 3