Data Basics

Data Basics

Data Matrix • Many datasets can be represented as a data matrix. • Rows corresponding to entities • Columns represents attributes. • N: size of the data • D: dimensionality of the data • Univariate analysis: the analysis of a single attribute. • Bivariate analysis: simultaneous analysis of two attributes. • Multivariate analysis: simultaneous analysis of multiple attributes.

Example for Data Matrix

Attributes • Categorical Attributes • composed of a set of symbols • has a set-valued domain • E.g., Sex with domain(Sex) = {M, F}, Education with domain(Education) = { High School, BS, MS, PhD}. • Two types of categorical attributes • Nominal • values in the domain are unordered • Only equality comparisons are allowed • E.g. Sex • Ordinal • Values are ordered • Both equality and inequality comparisons are allowed • E.g. Education

Attributes Cont. • Numeric Attributes • Has real-valued or integer-valued domain • E.g. Age with domain (Age) = N, where N denotes the set of natural numbers (non-negative integers). • Two types of numeric attributes • Discrete: values take on finite or countably infinite set. • Continuous: values take on any real value • Another Classification • Interval-scaled • for attributes only differences make sense • E.g. temperature. • Ratio-scaled • Both difference and ratios are meaningful • E.g. Age

Algebraic View of Data • If the d attributes in the data matrix D are all numeric • each row can be considered as a d-dimensional point • or equivalently, each row may be considered a d-dimensional column vector • Linear combination of the standard basis vectors

Example of Algebraic View of Data

Geometric View of Data

Distance of Angle

Example of Distance and Angle

Mean and Total Variance

Centered Data Matrix • The centered data matrix is obtained by subtracting the mean from all the points

Orthogonality • Two vectors a and b are said to be orthogonal if and only if • It implies that the angle between them is 90◦ or π/2 radians.

Orthogonal Projection P: orthogonal projection of b on the vector a; R: error vector between points b and p

Example of Projection

Linear Independence and Dimensionality • : the set of all possible linear combinations of the vectors. • If then we say that v1, · · · , vk is a spanning set for .

Row and Column Space • The column space of D, denoted col(D) is the set of all linear combinations of the d column vectors or attributes • The row space of D, denoted row(D), is the set of all linear combinations of the n row vectors or points • Note also that the row space of D is the column space of

Linear Independence

Dimension and Rank • Let S be a subspace of Rm. • A basis for S: a set of linearly independent vectors v1, · · · , vk , and span(v1, · · · , vk) = S. • orthogonal basisfor S: If the vectors in the basis are pair-wise orthogonal • If in addition they are also normalized to be unit vectors, then they make up an orthonormal basis for S. • For instance, the standard basis for Rm is an orthonormal basis consisting of the vectors

Any two bases for S must have the same number of vectors. • Dimension: The number of vectors in a basis for S, denoted as dim(S). • For any matrix, the dimension of its row and column space are the same, and this dimension is also called as the rank of the matrix.

Data: Probabilistic View • Assumes that each numeric attribute Xj is a random variable, defined as a function that assigns a real number to each outcome of an experiment. • Given as Xj : O → R, where O, the domain of Xj , called as the sample space • R, the range of Xj , is the set of real numbers. • If the outcomes are numeric, and represent the observed values of the random variable, then Xj : O → O is simply the identity function: Xj (v) = v for all v ∈ O.

Data: Probabilistic View • A random variable X is called a discrete random variable if it takes on only a finite or countably infinite number of values in its range. • X is called a continuous random variable if it can take on any value in its range.

Example • Be default, consider the attribute X1 to be a continuous random variable, given as the identity function X1(v) = v, since the outcomes are all numeric. • On the other hand, if we want to distinguish between iris flowers with short and long sepal lengths, we define a discrete random variable A as follows • In this case the domain of A is [4.3, 7.9]. The range of A is {0, 1}, and thus A assumes non-zero probability only at the discrete values 0 and 1.

Example: Bernoulli and Binomial Distribution • only 13 irises have sepal length of at least 7cm • In this case we say that A has a Bernoulli distribution with parameter p ∈ [0, 1]. p denotes the probability of a success, whereas 1− p represents the probability of a failure

Example: Bernoulli and Binomial Distribution • Let us consider another discrete random variable B, denoting the number of irises with long sepal lengths in m independent Bernoulli trials with probability of success p. • B takes on the discrete values [0,m], and its probability mass function is given by the Binomial distribution • For example, taking p = 0.087 from above, the probability of observing exactly k = 2 long sepal length irises in m = 10 trials is given as

full probability mass function for different values of k

Probability Density Function • If X is continuous, its range is the entire set of real numbers R. • probability density function: specifies the probability that the variable X takes on values in any interval [a, b] ⊂ R

Cumulative Distribution Function • For any random variable X, whether discrete or continuous, we can define the cumulative distribution function (CDF) F : R → [0, 1], that gives the probability of observing a value at most some given value x

The following examples are from Andrew Moore

Probability Density Function f(x) • What is P(X=x) when x is on a real domain f(x) >=0 and

Normal Distribution • Let us assume that these values follow a Gaussian or normal density function, given as

Data Basics

Data Basics

Presentation Transcript

Data Communication Basics

Data Communication Basics

Data Guard Basics

DATA COLLECTION BASICS

Data Compression Basics

Data Modeling Basics

Data Model Basics

Data Access Basics

Data Mining Basics: Data

Data Converter Basics

ACNielsen Data Basics

Aerosol Data Assimilation Basics

Basics of Data Compression

AQS Basics Retrieving data

Data Converter Basics

AQS Basics Loading data

Basics of Data Transmission

Data Modeling Basics

PPS BASICS: DATA