Table of Contents¶

1 Vectors (BV Chapters 1, 3, 5)

1.1 Notation

1.2 Some special vectors

1.3 Vector operations

1.4 Computational complexity of vector operations

1.5 Norm, distance, angle

1.6 Linear independence

1.7 Basis

1.8 Orthonormal basis

1.9 Gram-Schmidt algorithm

using Pkg
Pkg.activate(pwd())
Pkg.instantiate()

  Activating environment at `~/Documents/github.com/ucla-biostat216-2021fall.github.io/slides/02-vector/Project.toml`

using LinearAlgebra, Random, Statistics
Random.seed!(216)

MersenneTwister(216)

Vectors (BV Chapters 1, 3, 5)¶

Notation¶

A column vector $\mathbf{x} \in \mathbb{R}^n$ $$ \mathbf{x} = \begin{pmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{pmatrix}. $$ $n$ is the dimension or length or size of the vector. $x_i$ is the $i$-th entry or element or component of $\mathbf{x}$.
A row vector $\mathbf{x}' = (x_1 \ldots x_n)$.

# a (column) vector in Julia
x = [1, -1, 2, 0, 3]

5-element Vector{Int64}:
  1
 -1
  2
  0
  3

# row vector
x'

1×5 adjoint(::Vector{Int64}) with eltype Int64:
 1  -1  2  0  3

Subvector: $\mathbf{x}_{2:4} = (x_2, x_3, x_4)'$.

x[2:4]

3-element Vector{Int64}:
 -1
  2
  0

Stacked vector: $$ \mathbf{x} = \begin{pmatrix} \mathbf{y} \\ \mathbf{z} \end{pmatrix} = \begin{pmatrix} y_1 \\ \vdots \\ y_m \\ z_1 \\ \vdots \\ z_n \end{pmatrix}. $$

y = [1, 2, 3]
z = [4, 5]
x = [y; z]

5-element Vector{Int64}:
 1
 2
 3
 4
 5

Some special vectors¶

Zero vector: $$ \mathbf{0}_n = \begin{pmatrix} 0 \\ \vdots \\ 0 \end{pmatrix}. $$ The subscript denotes the length of the vector; it sometimes is omitted if obvious from context.

zeros(5)

5-element Vector{Float64}:
 0.0
 0.0
 0.0
 0.0
 0.0

One vector: $$ \mathbf{1}_n = \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix}. $$

ones(5)

5-element Vector{Float64}:
 1.0
 1.0
 1.0
 1.0
 1.0

Unit vector, $\mathbf{e}_i$. It has all zero entries except the $i$-th entry being 1. Note the length of $\mathbf{e}_i$ is often implied by context. $$ \mathbf{e}_1 = \begin{pmatrix} 1 \\ 0 \\ 0 \\ \vdots \\ 0 \end{pmatrix}, \quad \mathbf{e}_2 = \begin{pmatrix} 0 \\ 1 \\ 0 \\ \vdots \\ 0 \end{pmatrix}, \quad \ldots, \quad \mathbf{e}_n = \begin{pmatrix} 0 \\ 0 \\ \vdots \\ 0 \\ 1 \end{pmatrix}. $$

# define a unit vector by comprehension
e3 = [i == 3 ? 1 : 0 for i in 1:5]

5-element Vector{Int64}:
 0
 0
 1
 0
 0

Vector operations¶

Vector addition and vector substraction. For two vectors $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$ (of same length) $$ \mathbf{x} + \mathbf{y} = \begin{pmatrix} x_1 + y_1 \\ \vdots \\ x_n + y_n \end{pmatrix}, \quad \mathbf{x} - \mathbf{y} = \begin{pmatrix} x_1 - y_1 \\ \vdots \\ x_n - y_n \end{pmatrix}. $$

x = [1, 2, 3, 4, 5]
y = [6, 7, 8, 9, 10]
x + y

5-element Vector{Int64}:
  7
  9
 11
 13
 15

x - y

5-element Vector{Int64}:
 -5
 -5
 -5
 -5
 -5

Scalar-vector multiplication. For a scalar $\alpha \in \mathbb{R}$ and a vector $\mathbf{x} \in \mathbb{R}^n$, $$ \alpha \mathbf{x} = \begin{pmatrix} \alpha x_1 \\ \alpha x_2 \\ \vdots \\ \alpha x_n \end{pmatrix}. $$

α = 0.5
x = [1, 2, 3, 4, 5]
α * x

5-element Vector{Float64}:
 0.5
 1.0
 1.5
 2.0
 2.5

Elementwise multiplication or Hadamard product. For two vectors $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$ (of same length) $$ \mathbf{x} \circ \mathbf{y} = \begin{pmatrix} x_1 y_1 \\ \vdots \\ x_n y_n \end{pmatrix}. $$

# in Julia, dot operation is elementwise operation
x = [1, 2, 3, 4, 5]
y = [6, 7, 8, 9, 10]
x .* y

5-element Vector{Int64}:
  6
 14
 24
 36
 50

For scalars $\alpha_1, \ldots, \alpha_k \in \mathbb{R}$ and vectors $\mathbf{x}_1, \ldots, \mathbf{x}_k \in \mathbb{R}^n$, the linear combination $$ \alpha_1 \mathbf{x}_1 + \cdots + \alpha_k \mathbf{x}_k $$ is a sum of scalar-vector products.

x = [1, 2, 3, 4, 5]
y = [6, 7, 8, 9, 10]
1 * x + 0.5 * y

5-element Vector{Float64}:
  4.0
  5.5
  7.0
  8.5
 10.0

The inner product or dot product between two vectors $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$ is $$ \mathbf{x}' \mathbf{y} = x_1 y_1 + \cdots + x_n y_n. $$ Other notations for inner products: $\langle \mathbf{x}, \mathbf{y} \rangle$.

x = [1, 2, 3, 4, 5]
y = [6, 7, 8, 9, 10]
x'y

130

# dot function is avaiable from the standard library Linear Algebra
dot(x, y)

130

Properties of inner product:

$\mathbf{x}' \mathbf{x} = x_1^2 + \cdots + x_n^2 \ge 0$.
$\mathbf{x}' \mathbf{x} = 0$ if and only if $\mathbf{x} = 0$.
Commutative: $\mathbf{x}' \mathbf{y} = \mathbf{y}' \mathbf{x}$.
Associative with scalar multiplication: $(\alpha \mathbf{x})' \mathbf{y} = \alpha (\mathbf{x}' \mathbf{y}) = \mathbf{x}' (\alpha \mathbf{y})$.
Distributive with vector addition: $(\mathbf{x} + \mathbf{y})' \mathbf{z} = \mathbf{x}' \mathbf{z} + \mathbf{y}' \mathbf{z}$.

Examples of inner product:

Inner product with unit vector: $\mathbf{e}_i' \mathbf{x} = x_i$.
Differencing: $(\mathbf{e}_i - \mathbf{e}_j)' \mathbf{x} = x_i - x_j$.
Sum: $\mathbf{1}' \mathbf{x} = x_1 + \cdots + x_n$.
Average: $(\frac 1n \mathbf{1})' \mathbf{x} = (x_1 + \cdots + x_n) / n$.

Vector transpose: $$ \mathbf{x}' = \begin{pmatrix} x_1 \\ \vdots \\ x_n \end{pmatrix}' = (x_1 \, \ldots \, x_n). $$

Properties:
1. $(\mathbf{x}')' = \mathbf{x}$.
2. Transposition is a linear operator: $(\alpha \mathbf{x} + \beta \mathbf{y})' = \alpha \mathbf{x}' + \beta \mathbf{y}'$.

Superscript $^t$ or $^T$ is also commonly used to denote transpose: $\mathbf{x}^t$, $\mathbf{x}^T$.

Computational complexity of vector operations¶

One floating point operation (flop) is one basic arithmetic operation in $\mathbb{R}$ or $\mathbb{C}$: $+, -, *, /, \sqrt, ...$
The flop count of operation count is the total number of flops in an algorithm.
A (very) crude predictor of run time of the algorithm is $$ \text{run time} \approx \frac{\text{flop count}}{\text{computer speed}}. $$
Dominant term: the highest-order term in the flop count. $$ \frac 13 n^3 + 100 n^2 + 10n + 5 \approx \frac 13 n^3. $$
Order: the power in the dominant term. $$ \frac 13 n^3 + 100 n^2 + 10n + 5 = \text{order } n^3 = O(n^3). $$

Assume vectors are all of length $n$.
1. Vector addition and substraction: $n$ flops.
2. Scalar multiplication: $n$ flops.
3. Elementwise multiplication: $n$ flops.
4. Inner product: $2n - 1$ flops.
These vector operations are all order $n$ algorithms.

# info about my computer 
versioninfo()

Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin19.6.0)
  CPU: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 4

n = 10^7
x = randn(n)
sum(x) # compile
@time sum(x);

  0.004684 seconds (1 allocation: 16 bytes)

Norm, distance, angle¶

The Euclidean norm or L2 norm of a vector $\mathbf{x} \in \mathbb{R}^n$ is $$ \|\mathbf{x}\| = \|\mathbf{x}\|_2 = (\mathbf{x}'\mathbf{x})^{1/2} = \sqrt{x_1^2 + \cdots + x_n^2}. $$
The L1 norm of a vector $\mathbf{x} \in \mathbb{R}^n$ is $$ \|\mathbf{x}\|_1 = |x_1| + \cdots + |x_n|. $$
Properties of L2 norm.
1. Positive definiteness: $\|\mathbf{x}\| \ge 0$ for any vector $\mathbf{x}$. $\|\mathbf{x}\| = 0$ if and only if $\mathbf{x}=\mathbf{0}$.
2. Homogeneity: $\|\alpha \mathbf{x}\| = |\alpha| \|\mathbf{x}\|$ for any scalar $\alpha$ and vector $\mathbf{x}$.
3. Triangle inequality: $\|\mathbf{x} + \mathbf{y}\| \le \|\mathbf{x}\| + \|\mathbf{y}\|$ for any $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$. Proof: use Cauchy-Schwartz inequality.
4. Cauchy-Schwartz inequality: $|\mathbf{x}' \mathbf{y}| \le \|\mathbf{x}\| \|\mathbf{y}\|$ for any $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$. The equality holds when (1) $\mathbf{x} = \mathbf{0}$ or $\mathbf{y}=\mathbf{0}$ or (2) $\mathbf{x} \ne \mathbf{0}$, $\mathbf{y} \ne \mathbf{0}$, and $\mathbf{x} = \alpha \mathbf{y}$ for some $\alpha \ne 0$.
  
  Proof: The function $f(t) = \|\mathbf{x} - t \mathbf{y}\|_2^2 = \|\mathbf{x}\|_2^2 - 2t (\mathbf{x}' \mathbf{y}) + t^2\|\mathbf{y}\|_2^2$ is minimized at $t^\star =(\mathbf{x}'\mathbf{y}) / \|\mathbf{y}\|^2$ with minimal value $f(t^\star) = \|\mathbf{x}\|^2 - (\mathbf{x}'\mathbf{y})^2 / \|\mathbf{y}\|^2 \ge 0$.

# check triangular inequality
@show x = randn(5)
@show y = randn(5)
@show norm(x + y)
@show norm(x) + norm(y)
@show norm(x + y) ≤ norm(x) + norm(y)

x = randn(5) = [0.7738530709037769, 1.344750883612216, -0.01000755421055196, -1.543182965649529, 1.3848416572338424]
y = randn(5) = [0.7823014256853303, -1.4645887807781848, -1.3848949741410768, 0.6386517857683485, -0.25131514633683166]
norm(x + y) = 2.546525085039811
norm(x) + norm(y) = 4.858168254537617
norm(x + y) ≤ norm(x) + norm(y) = true

true

# check Cauchy-Schwartz inequality
@show x = randn(5)
@show y = randn(5)
@show abs(x'y)
@show norm(x) * norm(y)
@show abs(x'y) ≤ norm(x) * norm(y)

x = randn(5) = [1.0646056965408952, 0.7848733126530085, 1.1509742958697582, -0.503293236154652, 1.4424561169317982]
y = randn(5) = [-0.6624673653939196, 0.16075565612441853, -0.5029725911423841, 0.24792071486232836, -1.4234612064756527]
abs(x' * y) = 3.336059373941377
norm(x) * norm(y) = 3.895118899741072
abs(x' * y) ≤ norm(x) * norm(y) = true

true

The (Euclidean) distance between vectors $\mathbf{x}$ and $\mathbf{y}$ is defined as $\|\mathbf{x} - \mathbf{y}\|$.
Property of distances.
1. Nonnegativity. $\|\mathbf{x} - \mathbf{y}\| \ge 0$ for all $\mathbf{x}$ and $\mathbf{y}$. And $\|\mathbf{x} - \mathbf{y}\| = 0$ if and only if $\mathbf{x} = \mathbf{y}$.
2. Triangular inequality: $\|\mathbf{x} - \mathbf{y}\| \le \|\mathbf{x} - \mathbf{z}\| + \|\mathbf{z} - \mathbf{y}\|$.

The average of a vector $\mathbf{x}$ is $$ \operatorname{avg}(\mathbf{x}) = \bar{\mathbf{x}} = \frac{x_1 + \cdots + x_n}{n} = \frac{\mathbf{1}' \mathbf{x}}{n}. $$
The rooted mean square (RMS) of a vector is $$ \operatorname{rms}(\mathbf{x}) = \sqrt{\frac{x_1^2 + \cdots + x_n^2}{n}} = \frac{\|\mathbf{x}\|}{\sqrt n}. $$
The standard deviation of a vector $\mathbf{x}$ is $$ \operatorname{std}(\mathbf{x}) = \sqrt{\frac{(x_1 - \bar{\mathbf{x}})^2 + \cdots + (x_n - \bar{\mathbf{x}})^2}{n}} = \operatorname{rms}(\mathbf{x} - \bar{\mathbf{x}} \mathbf{1}) = \frac{\|\mathbf{x} - (\mathbf{1}' \mathbf{x} / n) \mathbf{1}\|}{\sqrt n}. $$
Theorem: $\operatorname{avg}(\mathbf{x})^2 + \operatorname{std}(\mathbf{x})^2 = \operatorname{rms}(\mathbf{x})^2$.

This result underlies the famous bias-variance tradeoff in statistics.

Proof: TODO in class. $\operatorname{std}(\mathbf{x})^2 = \frac{\|\mathbf{x} - (\mathbf{1}' \mathbf{x} / n) \mathbf{1}\|^2}{n} = ... = \operatorname{rms}(\mathbf{x})^2 - \operatorname{avg}(\mathbf{x})^2$.

x = randn(5)
@show mean(x)^2 + std(x, corrected = false)^2
@show norm(x)^2 / length(x)
# floating point arithmetics is not exact
@show mean(x)^2 + std(x, corrected = false)^2 ≈ norm(x)^2 / length(x)

mean(x) ^ 2 + std(x, corrected = false) ^ 2 = 0.27681813422052637
norm(x) ^ 2 / length(x) = 0.2768181342205265
mean(x) ^ 2 + std(x, corrected = false) ^ 2 ≈ norm(x) ^ 2 / length(x) = true

true

Angle between two nonzero vectors $\mathbf{x}, \mathbf{y}$ is $$ \theta = \operatorname{arccos} \left(\frac{\mathbf{x}'\mathbf{y}}{\|\mathbf{x}\|\|\mathbf{y}\|}\right). $$ This is the unique value of $\theta \in [0, \pi]$ that satisifies $$ \mathbf{x}'\mathbf{y} = \|\mathbf{x}\|\|\mathbf{y}\| \cos \theta. $$
Cauchy-Schwartz inequality guarantees that $$ -1 \le \cos \theta = \frac{\mathbf{x}'\mathbf{y}}{\|\mathbf{x}\|\|\mathbf{y}\|} \le 1. $$
Terminology

Angle	Sign of inner product	Terminology
$\theta = 0$	$\mathbf{x}'\mathbf{y}=\|\mathbf{x}\|\|\mathbf{y}\|$	$\mathbf{x}$ and $\mathbf{y}$ are aligned or parallel
$0 \le \theta < \pi/2$	$\mathbf{x}'\mathbf{y} > 0$	$\mathbf{x}$ and $\mathbf{y}$ make an acute angle
$\theta = \pi / 2$	$\mathbf{x}'\mathbf{y} = 0$	$\mathbf{x}$ and $\mathbf{y}$ are orthogonal, $\mathbf{x} \perp \mathbf{y}$
$\pi/2 < \theta \le \pi$	$\mathbf{x}'\mathbf{y} < 0$	$\mathbf{x}$ and $\mathbf{y}$ make an obtuse angle
$\theta = \pi$	$\mathbf{x}'\mathbf{y} = -\|\mathbf{x}\|\|\mathbf{y}\|$	$\mathbf{x}$ and $\mathbf{y}$ are anti-aligned or opposed

TODO: Picture.

Linear independence¶

A set of vectors $\mathbf{a}_1, \ldots, \mathbf{a}_k \in \mathbb{R}^n$ are linearly dependent if there exist constants $\alpha_1, \ldots, \alpha_k$, which are not all zeros, such that $$ \alpha_1 \mathbf{a}_1 + \cdots + \alpha_k \mathbf{a}_k = \mathbf{0}. $$ They are linearly independent if they are not linearly dependent. That is if $\alpha_1 \mathbf{a}_1 + \cdots + \alpha_k \mathbf{a}_k = \mathbf{0}$ then $\alpha_1 = \cdots = \alpha_k = 0$.
Theorem: Unit vectors $\mathbf{e}_1, \ldots, \mathbf{e}_n \in \mathbb{R}^n$ are linearly independent.

Proof: TODO in class.
Theorem: If $\mathbf{x}$ is a linear combination of linearly independent vectors $\mathbf{a}_1, \ldots, \mathbf{a}_k$. That is $\mathbf{x} = \alpha_1 \mathbf{a}_1 + \cdots \alpha_k \mathbf{a}_k$. Then the coefficients $\alpha_1, \ldots, \alpha_k$ are unique.

Proof: TODO in class.
Independence-dimension inequality or order-dimension inequality. If the vectors $\mathbf{a}_1, \ldots, \mathbf{a}_k \in \mathbb{R}^n$ are linearly independent, then $k \le n$.

In words, a linearly independent set of vectors in $\mathbb{R}^n$ can have at most $n$ elements. Or any set of $n+1$ or more vectors in $\mathbb{R}^n$ is linearly dependent.

Proof (optional): We show this by induction. Let $a_1, \ldots, a_k \in \mathbb{R}^1$ be linearly independent. We must have $a_1 \ne 0$. This means that every element $a_i$ of the collection can be expressed as a multiple of $a_i = (a_i / a_1) a_1$ of the first element $a_1$. This contradicts the linear independence thus $k$ must be 1.
Induction hypothesis: suppose $n \ge 2$ and the independence-dimension inequality holds for $k \le n$. We partition the vectors $\mathbf{a}_i \in \mathbb{R}^n$ as $$ \mathbf{a}_i = \begin{pmatrix} \mathbf{b}_i \\ \alpha_i \end{pmatrix}, \quad i = 1,\ldots,k, $$ where $\mathbf{b}_i \in \mathbb{R}^{n-1}$ and $\alpha_i \in \mathbb{R}$.
First suppose $\alpha_1 = \cdots = \alpha_k = 0$. Then the vectors $\mathbf{b}_1, \ldots, \mathbf{b}_k$ are linearly independent: $\sum_{i=1}^k \beta_i \mathbf{b}_i = \mathbf{0}$ if and only if $\sum_{i=1}^k \beta_i \mathbf{a}_i = \mathbf{0}$, which is only possible for $\beta_1 = \cdots = \beta_k = 0$ because the vectors $\mathbf{a}_i$ are linearly independent. The vectors $\mathbf{b}_i$ therefore form a linearly independent collection of $(n-1)$-vectors. By the induction hypothesis we have $k \le n-1$ so $k \le n$.
Next we assume the scalars $\alpha_i$ are not all zero. Assume $\alpha_j \ne 0$. We define a collection of $k-1$ vectors $\mathbf{c}_i$ of length $n-1$ as follows: $$ \mathbf{c}_i = \mathbf{b}_i - \frac{\alpha_i}{\alpha_j} \mathbf{b}_j, \quad i = 1, \ldots, j-1, \mathbf{c}_i = \mathbf{b}_{i+1} - \frac{\alpha_{i+1}}{\alpha_j} \mathbf{b}_j, \quad i = j, \ldots, k-1. $$ These $k-1$ vectors are linealy independent: If $\sum_{i=1}^{k-1} \beta_i c_i = 0$ then $$ \sum_{i=1}^{j-1} \beta_i \begin{pmatrix} \mathbf{b}_i \\ \alpha_i \end{pmatrix} + \gamma \begin{pmatrix} \mathbf{b}_j \\ \alpha_j \end{pmatrix} + \sum_{i=j+1}^k \beta_{i-1} \begin{pmatrix} \mathbf{b}_i \\ \alpha_i \end{pmatrix} = \mathbf{0} $$ with $\gamma = - \alpha_j^{-1} \left( \sum_{i=1}^{j-1} \beta_i \alpha_i + \sum_{i=j+1}^k \beta_{i-1} \alpha_i \right)$. Since the vectors $\mathbf{a}_i$ are linearly independent, all coefficients $\beta_i$ and $\gamma$ are all zero. This in turns implies that the vectors $\mathbf{c}_1, \ldots, \mathbf{c}_{k-1}$ are linearly independent. By the induction hypothesis $k-1 \le n-1$, we have established $k \le n$.

Basis¶

A set of $n$ linearly independent vectors $\mathbf{a}_1, \ldots, \mathbf{a}_n \in \mathbb{R}^n$ is called a basis for $\mathbb{R}^n$.
Fact: the zero vector $\mathbf{0}_n$ cannot be a basis vector in $\mathbb{R}^n$. Why?
Any vector $\mathbf{x} \in \mathbb{R}^n$ can be expressed as a linear combination of basis vectors $\mathbf{x} = \alpha_1 \mathbf{a}_1 + \cdots + \alpha_n \mathbf{a}_n$ for some $\alpha_1, \ldots, \alpha_n$, and these coefficients are unique. This is called expansion of $\mathbf{x}$ in the basis $\mathbf{a}_1, \ldots, \mathbf{a}_n$.

Proof of existence by contradition. Suppose $\mathbf{x}$ can NOT be expressed as a linear combination of basis vectors $\mathbf{x} = \alpha_1 \mathbf{a}_1 + \cdots + \alpha_n \mathbf{a}_n$. Suppose an arbitrary linear combination $\alpha_1 \mathbf{a}_1 + \cdots + \alpha_n \mathbf{a}_n + \beta \mathbf{x} = \mathbf{0}$. Then $\beta = 0$ otherwise it contradictions with our assumption. Also $\alpha_1 = \cdots = \alpha_n = 0$ by linear independence of $\mathbf{a}_1, \ldots, \mathbf{a}_n$. Therefore we conclude $\alpha_1 = \cdots = \alpha_n = \beta = 0$. Thus $\mathbf{a}_1, \ldots, \mathbf{a}_n, \mathbf{x}$ are linearly independent, contradicting with the independence-dimension inequality.

Proof of uniqueness: TODO in class.
Example: Unit vectors $\mathbf{e}_1, \ldots, \mathbf{e}_n$ form a basis for $\mathbb{R}^n$. Expansion of a vector $\mathbf{x} \in \mathbb{R}^n$ in this basis is $$ \mathbf{x} = x_1 \mathbf{e}_1 + \cdots + x_n \mathbf{e}_n. $$

Orthonormal basis¶

A set of vectors $\mathbf{a}_1, \ldots, \mathbf{a}_k$ are (mutually) orthogonal if $\mathbf{a}_i \perp \mathbf{a}_j$ for any $i \ne j$. They are normalized if $\|\mathbf{a}_i\|=1$ for all $i$. They are orthonormal if they are both orthogonal and normalized.

Orthonormality is often expressed compactly by $\mathbf{a}_i'\mathbf{a}_j = \delta_{ij}$, where $$ \delta_{ij} = \begin{cases} 1 & \text{if } i = j \\ 0 & \text{if } i \ne j \end{cases} $$ is the Kronecker delta notation.
Theorem: An orthonormal set of vectors are linearly independent.

Proof: TODO in class. Expand $\|\alpha_1 \mathbf{a}_1 + \cdots + \alpha_k \mathbf{a}_k\|^2 = (\alpha_1 \mathbf{a}_1 + \cdots + \alpha_k \mathbf{a}_k)'(\alpha_1 \mathbf{a}_1 + \cdots + \alpha_k \mathbf{a}_k)$.
By the independence-dimension inequality, must have $k \le n$. When $k=n$, $\mathbf{a}_1, \ldots, \mathbf{a}_n$ are called an orthonormal basis.
Examples of orthonormal basis:
1. Unit vectors $\mathbf{e}_1, \ldots, \mathbf{e}_n$ in $\mathbb{R}^n$.
2. The 3 vectors in $\mathbb{R}^3$: $$ \begin{pmatrix} 0 \\ 0 \\ 1 \end{pmatrix}, \quad \frac{1}{\sqrt 2} \begin{pmatrix} 1 \\ 1 \\ 0 \end{pmatrix}, \quad \quad \frac{1}{\sqrt 2} \begin{pmatrix} 1 \\ -1 \\ 0 \end{pmatrix}. $$

TODO: Picture.

a1 = [0, 0, 1]
a2 = (1 / sqrt(2)) * [1, 1, 0]
a3 = (1 / sqrt(2)) * [1, -1, 0]
@show norm(a1), norm(a2), norm(a3)
@show a1'a2, a1'a3, a2'a3

(norm(a1), norm(a2), norm(a3)) = (1.0, 0.9999999999999999, 0.9999999999999999)
(a1' * a2, a1' * a3, a2' * a3) = (0.0, 0.0, 0.0)

(0.0, 0.0, 0.0)

Orthonormal expansion. If $\mathbf{a}_1, \ldots, \mathbf{a}_n \in \mathbb{R}^n$ is an orthonormal basis, then for any vector $\mathbf{x} \in \mathbb{R}^n$, $$ \mathbf{x} = (\mathbf{a}_1'\mathbf{x}) \mathbf{a}_1 + \ldots + (\mathbf{a}_n'\mathbf{x}) \mathbf{a}_n. $$

Proof: Take inner product with $\mathbf{a}_i$ on both sides.

@show x = randn(3)
@show (a1'x) * a1 + (a2'x) * a2 + (a3'x) * a3
@show x ≈ (a1'x) * a1 + (a2'x) * a2 + (a3'x) * a3

x = randn(3) = [2.0627778719698555, 1.414919144471015, -0.45520068621714055]
(a1' * x) * a1 + (a2' * x) * a2 + (a3' * x) * a3 = [2.0627778719698555, 1.414919144471015, -0.45520068621714055]
x ≈ (a1' * x) * a1 + (a2' * x) * a2 + (a3' * x) * a3 = true

true

Gram-Schmidt algorithm¶

Given vectros $\mathbf{a}_1, \ldots, \mathbf{a}_k \in \mathbb{R}^n$. G-S algorithm generates an orthonormal basis $\mathbf{q}_1, \mathbf{q}_2, \ldots$.
For $i=1,\ldots,k$:
1. Orthogonalization: $\tilde{\mathbf{q}}_i = \mathbf{a}_i - (\mathbf{q}_1' \mathbf{a}_i) \mathbf{q}_1 - \cdots - (\mathbf{q}_{i-1}' \mathbf{a}_i) \mathbf{q}_{i-1}$.
2. Test for linear independence: if $\tilde{\mathbf{q}}_i = \mathbf{0}$, quit.
3. Normalization: $\mathbf{q}_i = \tilde{\mathbf{q}}_i / \|\tilde{\mathbf{q}}_i\|$.

If G-S does not stop early (in step 2), $\mathbf{a}_1, \ldots, \mathbf{a}_k$ are linearly independent.
If G-S stops early in iteration $i=j$, then $\mathbf{a}_j$ is a linear combination of $\mathbf{a}_1, \ldots, \mathbf{a}_{j-1}$; so $\mathbf{a}_1, \ldots, \mathbf{a}_{j-1}$ are linearly independent.

# n = 5, k = 3
@show a1 = randn(5)
@show a2 = randn(5)
@show a3 = randn(5);

a1 = randn(5) = [0.3794474614022756, -1.8588962079349254, 1.1184926614886757, 0.018548652538052303, -0.16650209737984434]
a2 = randn(5) = [0.386337896419044, -0.6124240184915484, 0.40703118404375854, 0.4987380584188193, 1.4437657000426827]
a3 = randn(5) = [-2.1799967231158695, 1.6267724851759806, 0.5422451069937892, 0.4868174173021037, -2.310038880156914]

# for i = 1
# orthogonalization
@show q̃1 = copy(a1)
# test for linear independence
@show norm(q̃1) ≈ 0
# normalization
@show q1 = q̃1 / norm(q̃1);

q̃1 = copy(a1) = [0.3794474614022756, -1.8588962079349254, 1.1184926614886757, 0.018548652538052303, -0.16650209737984434]
norm(q̃1) ≈ 0 = false
q1 = q̃1 / norm(q̃1) = [0.1717929968742402, -0.8416062378152109, 0.5063921249832146, 0.008397812428932689, -0.07538301663429839]

# for i = 2
# orthogonalization
@show q̃2 = a2 - (q1'a2) * q1
# test for linear independence
@show norm(q̃2) ≈ 0
# normalization
@show q2 = q̃2 / norm(q̃2);

q̃2 = a2 - (q1' * a2) * q1 = [0.2689585617873683, -0.03738788771764001, 0.061033548664228776, 0.49300016657686413, 1.4952719226248516]
norm(q̃2) ≈ 0 = false
q2 = q̃2 / norm(q̃2) = [0.16821910046553434, -0.02338411091423074, 0.03817319882396163, 0.3083450625248457, 0.9352120865896744]

# for i = 3
# orthogonalization
@show q̃3 = a3 - (q1'a3) * q1 - (q2'a3) * q2
# test for linear independence
@show norm(q̃3) ≈ 0
# Normalization
@show q3 = q̃3 / norm(q̃3);

q̃3 = (a3 - (q1' * a3) * q1) - (q2' * a3) * q2 = [-1.5554755602603014, 0.48444174163375003, 1.2872928987631225, 1.2359361903751251, -0.16813936962635756]
norm(q̃3) ≈ 0 = false
q3 = q̃3 / norm(q̃3) = [-0.6421710176235804, 0.19999957193298795, 0.5314530243342332, 0.5102506406196355, -0.06941557479532266]

# test for orthonormality of q1, q2, q3
@show norm(q1), norm(q2), norm(q3)
@show q1'q2, q1'q3, q2'q3;

(norm(q1), norm(q2), norm(q3)) = (1.0, 0.9999999999999999, 0.9999999999999999)
(q1' * q2, q1' * q3, q2' * q3) = (-9.71445146547012e-17, 2.6020852139652106e-18, 4.163336342344337e-17)

Show by induction that $\mathbf{q}_1, \ldots, \mathbf{q}_i$ are orthonormal:

Assume it's true for $i-1$.
Orthogonalization step ensures that $\tilde{\mathbf{q}}_i \perp \mathbf{q}_1, \ldots, \tilde{\mathbf{q}}_i \perp \mathbf{q}_{i-1}$. To show this, take inner product of both sides with $\mathbf{q}_j$, $j < i$ $$ \mathbf{q}_j' \tilde{\mathbf{q}}_i = \mathbf{q}_j' \mathbf{a}_i - (\mathbf{q}_1' \mathbf{a}_i) (\mathbf{q}_j' \mathbf{q}_1) - \cdots - (\mathbf{q}_{i-1}' \mathbf{a}_i) (\mathbf{q}_j' \mathbf{q}_{i-1}) = \mathbf{q}_j' \mathbf{a}_i - \mathbf{q}_j' \mathbf{a}_i = 0. $$
So $\mathbf{q}_1, \ldots, \mathbf{q}_i$ are orthogonal. The normalization step enures $\mathbf{q}_i$ is normal.

Suppose G-S has not terminated by iteration $i$, then

$\mathbf{a}_i$ is a combination of $\mathbf{q}_1, \ldots, \mathbf{q}_i$, and
$\mathbf{q}_i$ is a combination of $\mathbf{a}_1, \ldots, \mathbf{a}_i$.

Computational complexity of G-S algorithm:

Step 1 of iteration $i$ requires $i-1$ inner products, $\mathbf{q}_1' \mathbf{a}_i, \ldots, \mathbf{q}_{i-1}' \mathbf{a}_i$, which costs $(i-1)(2n-1)$ flops.
$2n(i-1)$ flops to compute $\tilde{\mathbf{q}}_i$.
$3n$ flops to compute $\|\tilde{\mathbf{q}}_i\|$ and $\mathbf{q}_i$.
Assuming no early termination, total computational cost is: $$ \sum_{i=1}^k ((4n-1) (i - 1) + 3n) = (4n - 1) \frac{k(k-1)}{2} + 3nk \approx 2nk^2, $$ using $\sum_{i=1}^k (i - 1) = k(k-1)/2$.