LyX Document

1 Tactics

Telescoping, and similar adjacent cancellations.
Symmetry. May be multidimensional. E.g. $\sum x, \sum x^{2}, \sum x^{3}, \dots$
Make both sides the same form. E.g. solve $x^{x} = 2^{x + 4}$ .
Log-derivative trick: $\nabla log f (x) = \frac{\nabla f (x)}{f (x)}$ . E.g. many places in RL, e.g. policy gradient derivation:
Shortcut for substitutions: add/subtract equations

2 Identities, approximations, limits

Identity: ${lim}_{x \to \infty} {(1 + \frac{a}{x})}^{x} = e^{a}$
$e^{x} > 1 + x$ for $x > 0$ and $e^{x} \approx 1 + x$ for $- .1 < x < .1$
Euler's identity: $e^{i π} + 1 = 0$ (from $e^{i x} = cos x + i sin x$ )
Telescoping series: series with terms $t_{n} = a_{n + 1} - a_{n}$ , e.g. for solving $\sum_{n = 1}^{\infty} \frac{1}{n (n + 1)}$
Golden ratio $φ = \frac{1 + \sqrt{5}}{2}$

3 Combinatorics

Binomial coefficient aka $k$ -combinations: $(\binom{n}{k}) = \frac{n!}{(n - k)! k!}$
Combinations with repetition aka multicombination aka multisubset: ( (nk) ) =( n+k-1 k) ways to choose k sets.
- Imagine $n$ types of objects to instantiate $k$ times—place $n - 1$ dividers among the $k$ objects, so there are $n - 1 + k$ slots for placing either dividers or objects. (Order of types, objects doesn't matter.)
Multinomial coefficient aka multiple bins: (n k 1 , k 2 , … , k m )= n! k 1 ! k 2 ! … k m !
- 2 bins is equivalent to binomial coefficient: $(\binom{n}{k, n - k}) = (\binom{n}{k})$

4 Calculus

\begin{matrix} \frac{d}{d x} [z = f (y = g (x))] & = \frac{d z}{d y} \frac{d y}{d x} \\ \frac{d}{d x} \frac{1}{f (x)} & = - {(f (x))}^{- 2} \frac{d f (x)}{d x} \\ \frac{d}{d x} e^{x} & = e^{x} \end{matrix}

5 General

geometric mean ( ∏ i x i ) 1 n is exp of arith mean of logs, exp ( 1 n ∑ i log x i )
- eg annualizing compounding: given annual growths $a, b, c > 1$ and initial price $p_{0}$ , $p_{3} = a b c p_{0} = μ^{3} p_{0}$ where geometric mean $μ = \sqrt[3]{a b c}$
harmonic mean ( 1 n ∑ i=1 n x i -1 ) -1
- if $x_{i}$ subject to (arithmetic-)mean-preserving spread, harmonic mean decreases
- preferable way to avg multiples, e.g. P/E ratio
- vs arith mean
  - A travels 20mph for 1h then 30mph for 1h, avg speed is arith mean
  - A travels 20mph for 1mi then 30mph for 1mi, avg speed is harmonic mean
- F-1 score is harmonic mean of precision & recall
power mean M r ( { x i } ) = ( 1 n ∑ i x i r ) 1 r
- $r = - 1$ harmonic, $r = 0$ geom, $r = 1$ arith, $r = 2$ quadratic (root mean square), $r = - \infty$ min, $r = \infty$ max
Stirling's approx
- $n! \approx \sqrt{2 π n} {(\frac{n}{e})}^{n}$ . Prove with gamma function, Taylor approximation, and Gaussian integral.
- $ln n! = n ln n - n + O (log n)$ where last term is $\frac{1}{2} ln (2 π n)$
- ${lim}_{n \to \infty} \frac{n!}{\sqrt{2 π n} {(\frac{n}{e})}^{n}} = 1$
Taylor series: represent the function using derivatives around some location a . Intuition: capture the position, slope, curvature, etc. within some locality.
- $f (x) = f (a) + \frac{f' (a)}{1!} (x - a) + \frac{f'' (a)}{2!} {(x - a)}^{2} + \frac{f^{(3)} (a)}{3!} {(x - a)}^{3} + \dots = \sum_{n = 0}^{\infty} \frac{f^{(n)} (a)}{n!} {(x - a)}^{n}$
- Maclaurin series is Taylor series at $a = 0$
- Taylor polynomial: the first $n + 1$ terms of the series
- common Maclaurin series
  - of polynomial is the polynomial
  - Geometric series: ${(1 - x)}^{- 1} = 1 + x + x^{2} + x^{3} + \dots$
  - Integral of above is $ln (1 - x) = - x - \frac{1}{2} x^{2} - \frac{1}{3} x^{3} - \dots$ , and $ln (1 + x) = x - \frac{x^{2}}{2} + \frac{x^{3}}{3} - \dots$
  - $e^{x} = 1 + x + \frac{x^{2}}{2!} + \frac{x^{3}}{3!} + \dots$

6 Information Theory

surprisal: - log P( x ) = log 1 P( x ) ; in bits; additive; used in entropy, KLIC, etc.
- $P (x) = \frac{1}{n} ⟹ - log P (x) = n$
entropy H( X ) = E [ I( X ) ] = ∑ x p( x ) lg 1 p( x ) =- ∑ x p( x ) lg p( x ) ≥ 0 (expected information content)
- lower prob events have higher information content
- measured in bits, hence use lg
- e.g. 25% rain vs 75% sunny - knowing the weather gives you $- \frac{1}{4} lg \frac{1}{4} - \frac{3}{4} lg \frac{3}{4} \approx 0.81$ bits
- Intuition: how random is $X$ ? How large is log prob in expectation under itself?
- for Bernoulli, max entropy when $p$ is half:
- More random is more entropy:
mutual information I( X;Y ) = ∑ y ∑ x p X,Y ( x,y ) log p X,Y ( x,y ) p X ( x ) p Y ( y ) ≥ 0
- self-information is entropy: $I (X; X) = H (X)$
- $I (X; Y) = H (X) - H (X ∣ Y) = H (Y) - H (Y ∣ X) = H (X) + H (Y) - H (X, Y) = H (X, Y) - H (X ∣ Y) - H (Y ∣ X)$
- symmetric uncertainy $U (X, Y) = 2 \frac{I (X; Y)}{H (X) + H (Y)} \in [0, 1]$
- It is the KL divergence of the outer product distribution based on the true distribution: $I (X; Y) = D_{K L} ({Pr}_{X, Y} | {Pr}_{X} \otimes {Pr}_{Y})$ where outer product distribution just assigns ${Pr}_{X} (x) {Pr}_{Y} (y)$ to each $x, y$ .
- relationship to correlation
  - MI measures general dependence, correlation measures linear dependence; MI is better for measuring dependence
  - MI applicable to symbolic sequences; correlation applicable only to numerical sequences; but MI must estimate continuous distributions
  - http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=D065413DAA29F4C500219B28221E904A?doi=10.1.1.15.672&rep=rep1&type=pdf
cross-entropy H( p,q ) =- ∑ i p i lg q i
- $p$ is “true” class distribution, $q$ is predicted class distribution
- Use this to measure the expected number of bits used by given coding scheme
- E.g., given coding scheme for reporting type of weather, see what's the expected number of bits, with respect to the true weather distribution—hopefully use few bits for most common types of weather. Can compare to the entropy of the weather, which is the “optimal”/”truth”.
Kullback–Leibler divergence aka KL divergence aka KLIC: non-symmetric measure of difference btwn dists P,Q
- Cross-entropy is entropy + KL: $H (p, q) = H (p) + D_{K L} (p | q)$
- expected # extra bits to code samples from $P$ when using code based on $Q$ rather than on $P$
- alt intuition: avg log likelihood of data distributed as P given Q as model: D KL ( P | Q ) =- log L ‾ where L= Pr [ X ∼ P ∣ Q ]
  - e.g., consider multinomial classification, with true distribution having one-hot encoding—common to use cross entropy over softmax as a loss
  - Minimizing cross entropy loss is just minimizing the KL loss
- $D_{K L} (P | Q) = E_{P (x)} [log \frac{P (x)}{Q (x)}] = \sum_{i} P (i) log \frac{P (i)}{Q (i)} = \sum_{i} P (i) (log Q (i) - log P (i))$ ; integral for continuous
- $D_{K L} \geq 0$ ; $D_{K L} = 0$ for $P = Q$ ; asymmetric.
- mutual information $I (X; Y) = D_{K L} (Pr [X, Y] | Pr [X] Pr [Y])$
- If $q$ is the model, $D_{K L} (p | q)$ is “forward” and will learn mode-seeking solutions, $D_{K L} (q | p)$ is “reverse” and will learn ones that spread out to cover most of $p$ . https://andrewcharlesjones.github.io/journal/klqp.html
- MLE equates to minimizing KL divergence (there's a proof)
- http://www.snl.salk.edu/~shlens/kl.pdf
perplexity (PPL): PPL ( p ) = 2 H( p ) = ∏ x p ( x ) -p( x ) , i.e. it's the exp of (cross) entropy
- Since it's 2 entropy = 2 avg # bits , i.e. entropy is expected bits to encode information in a random var, then PPL must be expected number of choices
  - If average sentence can be encoded in 100 bits, model perplexity is $2^{100}$ per sentence
- In language models
  - PPL of a string of words ${s_{i}}$ is inverse probability (likelihood) of words, normalized to number of words $m$ : $\begin{matrix} P {(s_{1}, \dots, s_{m})}^{- \frac{1}{m}} & = \sqrt[m]{\frac{1}{P (s_{i}, \dots, s_{m})}} \\ = \sqrt[m]{\prod_{i = 1}^{m} \frac{1}{P (s_{i} ∣ s_{i}, \dots, s_{i - 1})}} \end{matrix}$
  - As exp of entropy: $e^{C E (X)} = exp (- \frac{1}{t} \sum_{i}^{t} log p (x_{i} ∣ x_{< i}))$
  - PPL is just reciprocal of harmonic mean of joint prob
  - Taking the harmonic mean means you can compare across different lengths of sentences
Smaller PPL means better model. Higher likelihood is better. Smaller cross entropy is better.
normalized compression distance (NCD): $N C D (x, y) = \frac{C (x y) - min {C (x), C (y)}}{max {C (x), C (y)}}$

7 Finance

rate of return (ROR) aka return on investment (ROI) aka return
- let $V_{f}$ be final value, $V_{i}$ be initial value
- ratio: $r = \frac{V_{f}}{V_{i}}$
- arithmetic return aka yield: $r_{a r i t h} = \frac{V_{f} - V_{i}}{V_{i}} = r - 1$
- logarithmic/continuous compound return: $r_{log} = ln \frac{V_{f}}{V_{i}} = ln (1 + r)$
- compound annual growth rate (CAGR): ${(\frac{V_{f}}{V_{i}})}^{\frac{1}{n}} - 1$ where $n$ is # years
- annual percentage rate (APR)

8 Signal Processing

DFT: X k = ∑ n=0 N-1 x n exp ( - 2 π i N kn )
- IDFT: $X_{k} = \frac{1}{N} \sum_{n = 0}^{N - 1} x_{n} exp (i 2 π k \frac{n}{N})$ (normalized, changed exp sign)
- interesting presentation: strength of freq $k$ is distance from origin of the midpoint of your signal's points as the signal are spun around a circle http://altdevblogaday.org/2011/05/17/understanding-the-fourier-transform/
IIR, FIR: TODO

9 Probability

PMF for discrete, PDF for continuous

9.1 Distributions

Binomial: # successes in n Bernoulli trials each with success prob p
- $Pr [X = k] = (\binom{n}{k}) p^{k} {(1 - p)}^{n - k}$
- $E [X] = n p$
- $V a r [X] = n p (1 - p)$
Geometric: # trials until Bernoulli success with prob p
- $Pr [X = k] = {(1 - p)}^{k - 1} p$
- $E [X] = \frac{1}{p}$
- $V a r [X] = \frac{1 - p}{p^{2}}$
Hypergeom: # successes in n draws from population of N containing m successes
- $Pr [X = k] = \frac{(\binom{m}{k}) (\binom{N - m}{n - k})}{(\binom{N}{m})}$
- $E [X] = n \frac{m}{N}$
- $V a r [X] = n \frac{m}{N} \frac{(N - m)}{N} \frac{N - n}{N - 1}$
Negative binomial: # successes in n Bernoulli trials before r failures (generalization of geom)
- $Pr [X = k] = (\binom{k + r - 1}{k}) {(1 - p)}^{r} p^{k}$
- $E [X] = \frac{p r}{1 - p}$
- $V a r [X] = \frac{p r}{{(1 - p)}^{2}}$
Poisson: # arrivals in sliver of time (infinite-granularity binomial) assuming mean λ arrival rate
- $Pr [X = k] = \frac{λ^{k}}{k!} e^{- λ}$
- $E [X] = λ$
- $V a r [X] = λ$
- Simple interesting proof from binomial
Normal/Gaussian: mean μ , standard deviation σ
- $f (x) = \frac{1}{\sqrt{2 π σ^{2}}} exp (- \frac{{(x - μ)}^{2}}{2 σ^{2}}) = \dots exp (- \frac{Z^{2}}{2})$
- Multivariate: $f (x) = \frac{1}{\sqrt{{(2 π)}^{k} | Σ |}} exp (- \frac{1}{2} {(x - μ)}^{T} Σ^{- 1} (x - μ))$
- $E [X] = μ$
- $V a r [X] = σ^{2}$
- Empirical rule: z-scores of 1/2/3 span 68%/95%/99.7%
- Is its own Fourier transform
Beta: density shape over ( 0,1 )
- uniform dist is a beta dist
- params $a, b$ s.t. $b e t a [a, b] (θ) = α θ^{a - 1} {(1 - θ)}^{b - 1}$
- $E [X] = \frac{a}{a + b}$ : higher $a$ suggests $Θ$ closer to 1 than 0
- conjugate prior for Bernoulli/binomial dists
Exponential: time btwn Poisson process events
- $f (x) = {\begin{matrix} λ e^{- λ x}, & x \geq 0 \\ 0, & x < 0 \end{matrix}$ ; $Pr [X < x] = {\begin{matrix} 1 - e^{- λ x}, & x \geq 0 \\ 0, & x < 0 \end{matrix}$
- $E [X] = \frac{1}{λ}$
- $V a r [X] = \frac{1}{λ^{2}}$
- memoryless: $Pr [X > s ∣ X > t] = Pr [X > s - t]$ / constant event rate $λ$ / constant hazard $λ$
Gamma: scale θ and shape k
- models waiting times: sum of $k$ indep exponentially distributed RVs, each with mean $θ$
- also the sample variance of normal data
- conjugate prior for many dists TODO
Student's t-distribution: a more spread-out normal distribution
- for when sample size is small, population SD unknown
- converges to normal as DF (corresponds to sample size) increases
Chi-square, chi distributions
- chi-square: sum of squares of $k$ normal RVs $\sum_{i} {(\frac{X_{i} - μ_{i}}{σ_{i}})}^{2}$
- chi: length of vector of $k$ normal components $\sqrt{\sum_{i} {(\frac{X_{i} - μ_{i}}{σ_{i}})}^{2}}$

9.2 Conjugate prior relationships

Likelihood $Pr [x ∣ θ]$	Conjugate prior $Pr [θ]$	Posterior $Pr [θ ∣ x]$
Gaussian	Gaussian	Gaussian
Binomial $(N, θ)$	Beta $(r, s)$	Beta $(r + n, s + N - n)$
Poisson $(θ)$	Gamma $(r, s)$	Gamma $(r + n, s + 1)$
Multinomial $(θ_{1}, \dots, θ_{k})$	Dirichlet $(α_{1}, \dots, α_{k})$	Dirichlet $(α_{1} + n_{1}, \dots, α_{k} + n_{k})$

9.3 General definitions and properties

Union bound aka Boole's inequality: $Pr [A \cup B] \leq Pr [A] + Pr [B]$
Bonferroni's inequality: $Pr [A \cap B] \geq Pr [A] + Pr [B] - 1$
$Pr [A ∣ B] > Pr [A] ⟺ Pr [B ∣ A] > Pr [B]$
Linearity: $E [a X + b Y] = a E [X] + b E [Y]$
$E [X | Y = y] = \sum_{x} x \cdot Pr [X = x | Y = y]$
Iterated: $E [E [X | Y]] = E [X]$
$V a r [X] = E [{(X - μ)}^{2}] = E [X^{2}] - {(E [X])}^{2}$
$V a r [a X + b] = V a r [a X] = a^{2} V a r [X]$
$V a r [a X + b Y] = a^{2} V a r [X] + b^{2} V a r [Y] + 2 a b C o v [X, Y]$
$V a r [X + Y] = V a r [X] + V a r [Y]$ if $X, Y$ indep/uncorrelated
$C o v [X, Y] = E [(X - E [X]) (Y - E [Y])] = E [X Y] - E [X] E [Y]$
Pearson's product-moment coefficient is a “normalized” covariance: $ρ_{X, Y} = \frac{C o v [X, Y]}{σ_{X} σ_{Y}} \in [- 1, 1]$
Law of total variance: $V a r [X] = E [V a r [X ∣ Y]] + V a r [E [X ∣ Y]]$ (unexplained and explained components)
coefficient of variation aka unitized risk aka variation coefficient: c= σ | μ |
- normalized measure of dispersion of a distribution
signal to noise ratio (SNR): μ σ
- reciprocal of coefficient of variation; only sensical for positive variables
Markov's inequality: Pr [ f( X ) ≥ t ] ≤ E [ f( X ) ] /t , for any non-neg function f
- corollary: Chebyshev's inequality, $Pr [| X - E [X] | \geq a] \leq \frac{V a r [X]}{a^{2}}$
Hoeffding's inequality: upper bound on prob that sum of RVs deviates from expected value
- for Bernoullis (important special case): $Pr [\sum_{i} X_{i} \leq (p - ε) n] \leq E [- 2 ε^{2} n]$ where $X_{i} \sim Bernoulli (p)$
Jensen's inequality: for any convex function f:f''( x ) >0 , E X [ f( X ) ] ≥ f( E X [ X ] )
Transformation of a discrete RV X via a 1:1 function f (say, X 2 , considering only X ≥ 0 ) doesn't change the PMF, since each x mass maps 1:1 to another f( x ) mass. (Unless not 1:1, where multiple masses can be folded into the same mass.) But transforming a continuous RV does change the PDF, because it stretches/squeezes the density.
- https://math.stackexchange.com/questions/3650976/continuous-random-variable-transformations-vs-discrete

9.4 Betting

Kelly criterion
- problem: in even odds, fixed-fraction betting (win/lose full bet) where $p = Pr [win]$ , $q = Pr [loss] = 1 - p$ , and start with $V_{0}$ , how much to bet (fraction $ℓ$ )?
- betting everything ( $ℓ = 1$ ) maximizes expected outcome after some finite $N$ steps ( $E [V_{N}]$ ), but this is skewed by rare all-win streak: $V_{N} = {(2 q)}^{N} V_{0}$ ; with $N \to \infty$ , will hit zero at some point (short- vs long-term); ${lim}_{N \to \infty} q^{N} = {lim}_{N \to \infty} V_{N} = 0$
- instead, how much to bet such that as $N \to \infty$ , will make more than any other bet?
- outcome after $W$ wins and $L$ losses: $V_{N} = {(1 + ℓ)}^{W} {(1 - ℓ)}^{L} V_{0}$
- maximize rate of growth $G = {lim}_{N \to \infty} \frac{1}{N} log \frac{V_{N}}{V_{0}}$ (log geometric mean ${(\frac{V_{N}}{V_{0}})}^{\frac{1}{N}}$ ) aka avg return per step over all time
- $G = {lim}_{N \to \infty} [\frac{W}{N} log (1 + ℓ) + \frac{L}{N} log (1 - ℓ)] = q log (1 + ℓ) + p log (1 - ℓ)$
- maximizing $G$ yields $ℓ = 2 p + 1$
- more generally for continuous non-binary returns: let X i ∼ X be returns for year i , then V N = V 0 ∏ i=1 N ( 1+ X i )
  - $G = {lim}_{N \to \infty} \frac{1}{N} [\sum_{i = 1}^{N} log (1 + X_{i})] = E [log (1 + X)]$
  - setting to constant return ${(1 + r)}^{N}$ we get $r = exp (E [log (1 + X)]) - 1$
- more involved proof to show maximizing $G$ maximizes $V_{\infty}$ more than any other strategy (fixed-fraction or not)
- general formula for uneven odds $b$ -to-1: $ℓ = \frac{edge}{odds} = \frac{p - m p}{1 - m p} = \frac{expected net winnings}{net winnings if you win} = \frac{b p - q}{b}$ (TODO)
- http://www.lucent.com/bstj/vol35-1956/articles/bstj35-4-917.pdf
- http://www.stat.berkeley.edu/~aldous/157/Writeups/finance1.pdf

10 Algorithms

10.1 LSH

usually use shingling
minhash: for clustering sets by similarity
- $h_{min} (x) = {min}_{x \in X} h (x)$
- for two sets of numbers $A, B$ , $Pr [min (A) = min (B)] = J (A, B) = \frac{| A \cap B |}{| A \cup B |}$
- with k hash fns:
  - $Pr [⋀_{i = 1}^{k} h_{min}^{(i)} (A) = h_{min}^{(i)} (B)] = {(\frac{| A \cap B |}{| A \cup B |})}^{k}$ (low false positives)
  - $Pr [⋁_{i = 1}^{k} h_{min}^{(i)} (A) = h_{min}^{(i)} (B)] = 1 - {(1 - \frac{| A \cap B |}{| A \cup B |})}^{k}$ (low false negatives)
  - estimate $J (A, B)$ as ratio of matching hash fns
- with k smallest hashes:
  - $Pr [all match] = \frac{(\binom{|}{A \cap B})}{(\binom{|}{A \cup B})} \approx {(\frac{| A \cap B |}{| A \cup B |})}^{n}$ for $n ≪ | A \cap B |$
  - estimate $J (A, B)$ as ratio of matching hashes
simhash: similar documents have low Hamming distance between their simhashes (Moses Charikar, STOC02)
- $V = [0] \times 64$ for 64-bit simhash
- for each item, if bit $i$ of $h (x)$ set, increment $V [i]$ , else decrement $V [i]$
- bit $i$ of simhash is 1 if $V [i] > 0$ else 0
- patented by Google

11 Statistics

11.1 Basics

coefficient of variation (CV) aka unitized risk: $\frac{σ}{μ}$ (to scale/normalize SDs)

11.2 Moments

nth moment: $E [X^{n}]$
nth central moment (around the mean): $μ_{n} = E [{(X - μ)}^{n}]$
nth standardized aka normalized moment (normalized by deviation): $E [\frac{{(X - μ)}^{n}}{σ^{n}}]$
Mean is 1st raw moment: $E [X]$
Variance is 2nd central moment: $E [{(X - μ)}^{2}]$
Skewness is 3rd central moment: $E [{(\frac{X - μ}{σ})}^{3}]$
(Pearson's) kurtosis: fourth standardized moment: E [ ( X- μ σ ) 4 ] = μ 4 σ 4
- measure of peakedness, but it's argued this really measures heavy tails

11.3 Measures

image: 5_Users_yang_Library_CloudStorage_GoogleDrive-y___ve_Yang_consolidated_personal_notes_pasted7.png

11.4 Fisherian tests TODO

z-test/z-statistic: approximations are OK when n>30
- one-sample: $z = \frac{\overline{x} - μ_{\overline{x}}}{σ_{\overline{x}}}, σ_{\overline{x}} = \frac{σ}{\sqrt{n}} \approx \frac{S}{\sqrt{n}}$
- two-sample: $z = \frac{(\overline{X} - \overline{Y}) - (μ_{\overline{X} - \overline{Y}} = 0)}{σ_{\overline{X} - \overline{Y}} = \sqrt{\frac{σ_{X}^{2}}{n} + \frac{σ_{Y}^{2}}{m}} \approx \sqrt{\frac{S_{X}^{2}}{n} + \frac{S_{Y}^{2}}{m}}}$ (just z test over the difference of independent variables, thus variance is sum)
t-test/t-statistic: same as z-test but refer to student's t-distribution; use when n ≤ 30 and don't know the population variance (usually true)
- one-sample: $t = \frac{\overline{x} - μ_{\overline{x}}}{S / \sqrt{n}}$ (same quantity as in z-test), DOF is $n - 1$
- two-sample: $t = \frac{\overline{X} - \overline{Y} - (μ_{\overline{X} - \overline{Y}} = 0)}{S_{\overline{X} - \overline{Y}} = \sqrt{\frac{S_{X}^{2}}{n} + \frac{S_{Y}^{2}}{m}}}$
chi-square test: X 2 = ∑ ij ( O ij - E ij ) 2 E ij = ∑ i ( X i - μ i σ i ) 2 , where ij enumerate cells in contingency table
- $E_{i j}$ are usually computed from marginal (global) frequencies of table
- for multinomials?
- DOF is $n - 1$ where $n$ is number of cells
- compare freqs of a sample against theoretical dist
- asymptotically approaches $χ^{2}$ distribution
- uses normal approximation of multinomial distribution
multinomial test
Fisher's exact test

11.5 Neymann-Pearson tests

Wald test: reject iff 2 ⋅ | θ ˆ - θ 0 | σ > z α
- depends on representation of $θ$ (eg log-scaling); LR works with any monotonic transformation
- uses 2 approximations (know standard error, dist is $χ^{2}$ ); LR only assumes dist is $χ^{2}$ (if using that test)
- only deals with scalars; LR can deal with vector params
likelihood ratio test: for comparing fit of 2 models/hyps, where null is special case of the alternative
- model with more params will be better fit, but is it significantly better?
- likelihood ratio test statistic for simple hyps $θ_{0}, θ_{1}$ : $Λ (x) = \frac{L (θ_{0} ∣ x)}{L (θ_{1} ∣ x)}$
- log-likelihood ratio test statistic: D=-2 ln Λ ( x ) =-2 ln ( L( θ 0 ∣ x ) L( θ 1 ∣ x ) ) =-2 ln ( L( θ 0 ∣ x ) ) +2 ln ( L( θ 1 ∣ x ) )
  - approx. distributed as $χ^{2}$ with dof $d f_{1} - d f_{0}$
  - G-test is more accurate than $χ^{2}$
- for composite hyps: $Λ (x) = \frac{sup {L (θ ∣ x) : θ \in Θ_{0}}}{sup {L (θ ∣ x) : θ \in Θ}}$ (compare MLEs)
- Z-test, F-test, chi-square, G-test are tests for nested models and can be phrased as (approximations of) log-likelihood ratios
F-test: compare models; ANOVA; TODO
G-test: for log-likelihood ratio tests; G=2 ∑ ij O ij ln ( O ij E ij ) TODO
- chi-square tests are approximations; useful before computers
- G-test much better where for any contingency cell $| O_{i} - E_{i} | > E_{i}$
- for small samples, use multinomial test, Fisher's exact test, or even Bayesian hyp selection
Bayes factors: vs likelihood ratio tests

11.6 Hypothesis testing

effect size: basically, how big a difference; fluffy notion/many formulations
- absolute: e.g., “difference between groups is 30lb”
- standardized difference of means: e.g., $\frac{\overline{X} - \overline{Y}}{S}$ , where $S$ is SD of either or both groups
problem: can get significance (low p -value) with big effect and small sample or big sample and small effect
- hence, report effect/sample size with $p$ -value
power analysis: prob of rejecting false H 0 (not making a type II error, false neg)
- power aka sensitivity is $1 - β$ where $β$ is false negative rate
- can use to find min sample size to likely detect given effect size, or min effect size likely detected by given sample size
- e.g., more powerful experiments may have more subjects

11.7 Unsorted

kalman filter: TODO understand
- untrusted predictions, untrusted measurements; combine them
- optimal estimate $\hat{y} = p r e d i c t i o n + (K a l m a n g a i n) (m e a s u r e m e n t - p r e d i c t i o n)$
- predict using previous data, measure, fuse/correct prediction and measurement
- usually no control signals ${\vec{u}}_{k}$
- http://www.swarthmore.edu/NatSci/echeeve1/Ref/Kalman/ScalarKalman.html
Mahalanobis distance: similarity of a sample to a distribution
- $D_{M} = \sqrt{{(x - μ)}^{T} S^{- 1} (x - μ)}$ where $x$ is new sample, $μ$ is dist mean, $S$ is covar matrix
- same as normalized Euclidian dist if $S = I$ : $D_{M} = \sqrt{\sum_{i} \frac{{(x_{i} - μ_{i})}^{2}}{s_{i}^{2}}}$

11.8 Smoothing

additive smoothing aka Laplace smoothing: θ i ˆ = n i +k n+dk , i=1, … ,d
- called “rule of succession” for $k = 1$
Good-Turing estimation: complex

11.9 Estimation

90% confidence interval: in repeated samplings, the computed intervals $I (X)$ would contain the true param 90% of the time (0.1 miss rate); $Pr [\hat{θ} \in I (X)] = .9$ ( $θ$ is const, $I (X)$ is RV)
90% credible interval $C (X)$ : $Pr [θ \in C (X)] = .9$ ( $θ, C (X)$ are RV); “Bayesian confidence interval”
statistic: any function of data $δ (X)$
estimator: any statistic used to estimate an (unknown) param $θ$ ; usu denoted $\hat{θ} = δ (X)$
bias: $E [\hat{θ} - θ] = E [\hat{θ}] - θ$ (unbiased if 0 as $n \to \infty$ )
variance: $V a r [\hat{θ} - θ]$
unbiased estimator: converges to true param over repeatedly sampling
- e.g.: sample variance
  - unbiased sample variance ( $E [s^{2}] = σ^{2}$ ) is $s^{2} = \frac{1}{n - 1} \sum_{i} {(X_{i} - \overline{X})}^{2}$ (Bessel's correction)
  - biased is $s_{n}^{2} = \frac{1}{n} \sum_{i} {(X_{i} - \overline{X})}^{2}$
  - note: $s$ is not unbiased for SD
- can be terrible; they may average to true value but individual estimates may be ridiculous
  - e.g. for Poisson $X$ , estimator $δ (X = x)$ of statistic $Pr {[X = 0]}^{2} = e^{- 2 λ} = E [δ (X)]$ is ${(- 1)}^{x}$ , which is nonsense
  - MLE is $e^{- 2 x}$ , which is always positive and has smaller MSE
  - besides bias, look also at efficiency, the MSE of individual estimates
consistent: lim n → ∞ δ ( X n ) = θ
- bias & variance must go to 0
- biased but consistent estimate of mean: $\frac{1}{n} \sum_{i} x_{i} + \frac{1}{n}$
- unbiased but inconsistent estimate of mean: $x_{1}$
maximum likelihood estimation (MLE): arg max θ Pr [ X ∣ θ ]
- find the peak in the likelihood function (pdf, but function of parameters rather than data; over params, it is not a probability distribution that sums to 1)
- example: estimate uniform distribution range $[0, d]$ given some samples $X$ . What would maximize likelihood of seeing those samples is the smallest possible range, thus $\hat{d} = max X$ .
- Fisher popularized this by showing that typically MLE is unbiased, consistent, & asymptotically the lowest variance estimator
- least squares: in OLS, if errors are normal, then least squares estimate is MLE
- now clear that MLE sometimes bad in practice, and finite sample behavior not samp esa asymptotic behavior, leading to Bayesian strategies
- disregards the uncertainty (“spread” in the likelihood function, i.e. Fisher information)
- property: $f (\hat{θ})$ is MLE of $f (θ)$ for any $f$
- minimizes KL divergence
maximum a posteriori (MAP): arg max θ ( Pr [ θ ∣ X ] = Pr [ X ∣ θ ] Pr [ θ ] )
- the Bayesian approach
- since we have a (posterior) prob dist over $θ$ , can extract whatever we want: mean, median, mode, intervals
Estimation: intervals, consistency, bias, MLE, MAP
- choose the single most probable $θ$ given $X$ (mode of posterior)
- note: prediction using MAP is approx to Bayes prediction using all $θ$ and their probabilities
- equiv: arg min θ ( - log Pr [ θ ∣ X ] =- log Pr [ X ∣ θ ] - log Pr [ θ ] )
  - $- log Pr [θ]$ is bits to describe $θ$ , $- log Pr [X ∣ θ]$ is add'l bits to describe data
  - hence, MAP chooses $θ$ that provides max compression, or MDL
Bayes estimation: given loss $L (θ, δ)$ (eg sq err), minimize Bayes risk $E [L (θ, δ)]$
https://github.com/johnmyleswhite/JAGSExamples/blob/master/slides/

11.10 Exponential smoothing for time series analysis

exponential moving average (EMA) aka exponentially weighted moving average (EWMA) aka single exponential smoothing (SES) S t = ( 1- α ) ⋅ S t-1 + α ⋅ y t-1 S 2 = 1
- Does not spot trends
double exponential moving average (DEMA) aka linear exponential smoothing (LES) $\begin{matrix} S_{t} & = & α y_{t} + (1 - α) (S_{t - 1} + b_{t - 1}), & 0 \leq α \leq 1 \\ b_{t} & = & γ (S_{t} - S_{t - 1}) + (1 - γ) b_{t - 1}, & 0 \leq γ \leq 1 \end{matrix}$
triple exponential moving average (TEMA) aka triple exponential smoothing (SES)
http://www.itl.nist.gov/div898/handbook/pmc/section4/pmc4.htm

11.11 Applied

recency, frequency, monetary value (RFM): simple customer behavior analysis
- segment customers along these 3 axes into discrete bins
- identify highest-value intersections of bins

11.12 Time series

autoregressive model: $X_{t} = c + \sum_{i = 1}^{p} ϕ_{i} X_{t - i} + ϵ_{t}$
moving avg model: $X_{t} = μ + ϵ_{t} + \sum_{i = 1}^{q} θ_{i} ϵ_{t - i}$
autoregressive moving avg (ARMA) aka Box-Jenkins model: X t = μ + a 1 X t-1 + … + a k X t-p + ϵ t + b 1 ϵ t-1 + … + b q ϵ t-q where E[ ϵ t, ϵ s ] =0 ∀ t ≠ s and ϵ t ∼ N( 0, σ 2 )
- one of most common univariate time series models
- Kalman filter can calculate exact log-likelihood, but “conditional” likelihood is easier/more commonly used
vector autoregression (VAR) model: X t = A 1 X t-1 + … + A p X t-p + ϵ t where each X i is a vector, A i is a matrix, and ϵ t ∼ N( 0, Σ )
- multivariate time series generalization of univariate AR models
- widely used, esp. in macroecon
http://www.slideshare.net/wesm/scipy-2011-time-series-analysis-in-python

12 Time Series

12.1 Processes

second-order stationary process: $μ, σ^{2}$ are time-indep
hazard rate at time $t$ event rate at $t$ conditional on survival until $t$ ; eg bathtub curve; eg constant (in exponential dists)

12.2 Autocorrelation

autocorrelation function (ACF): for second-order stationary process, R( τ ) = E [ ( X t - μ ) ( X t+ τ - μ ) ] σ 2 ( τ is lag)
- when normalized by mean and variance, called autocorrelation coefficient
partial ACF (PACF): TODO

12.3 Models

Order- $q$ moving avg model MA( $q$ ): $X_{t} = μ + ϵ_{t} + \sum_{i = 1}^{q} θ_{i} ϵ_{t - i}$
Autoregressive model: $X_{t} = c + \sum_{i = 1}^{p} ϕ_{i} X_{t - i} + ϵ_{t}$
Autoregressive moving avg (ARMA) aka Box-Jenkins model: $X_{t} = c + ϵ_{t} + \sum_{i = 1}^{p} ϕ_{i} X_{t - i} + \sum_{i = 1}^{q} θ_{i} ϵ_{t - i}$
Autoregressive integrated moving avg (ARIMA): TODO

13 Linear Algebra

13.1 Some references

https://www.youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab

13.2 Vectors

$a \cdot b = ∥ a ∥ ∥ b ∥ cos θ$ . Larger when more parallel, thus measures “similarity.”
$∥ a \times b ∥ = ∥ a ∥ ∥ b ∥ sin θ = area of parallelogram with sides a, b$ (direction given by right hand rule). Larger when less parallel, thus measures “orthogonality.”
Cauchy-Schwartz inequality: $| a \cdot b | \leq ∥ a ∥ ∥ b ∥$
Triangle inequality: $∥ a + b ∥ \leq ∥ a ∥ + ∥ b ∥$
Tensor product aka outer product $\vec{a} \otimes \vec{b} = \vec{a} {\vec{b}}^{T}$ is $| \vec{a} | \times | \vec{b} |$ matrix

13.3 Matrices

Elementary matrix: matrix that differs from $I$ by one elementary row op
Think of matrices as transformations. Its columns are where unit vectors get mapped.
Transposes
- ${(A B)}^{T} = B^{T} A^{T}$
Symmetry
- Comes up frequently. Hessians, covariance matrices, kernel matrices, etc.
- $A$ is symmetric iff $A^{T} = A$
- $A$ is anti-symmetric iff $A^{T} = - A$
- Besides writing that $A : R^{n \times n}$ , can more specifically say it's a symmetric matrix with notation $S^{n}$
- For any matrix $A$ , $A + A^{T}$ is symmetric and $A - A^{T}$ is anti-symmetric (easy)
- Any matrix $A$ is sum of symmetric and anti-symmetric $A = \frac{1}{2} (A + A^{T}) + \frac{1}{2} (A - A^{T})$
Let A be n × n over field K (e.g. R ). Then these are equiv:
- $A$ is nonsingular aka nondegenerate
- $det A \neq 0$
- A is invertible: A -1 A=I=A A -1
  - ${(A B)}^{- 1} = B^{- 1} A^{- 1}$
  - ${(k A)}^{- 1} = k^{- 1} A^{- 1}$ for $k \neq 0$
  - $det (A^{- 1}) = det {(A)}^{- 1}$
  - ${(A^{- 1})}^{T} = {(A^{T})}^{- 1}$
  - The linear transformation mapping $x$ to $A x$ is a bijection from $K^{n}$ to $K^{n}$
  - $N (A) = {0}$ : only the trivial nullspace containing just the zero vector
- A as system of equations has single solution
  - $A x = 0$ has only one trivial sol $x = 0$ (i.e. $N (A) = {0}$ )
  - $A x = b$ has exactly one sol for each $b \in K^{n}$ ( $x \neq 0$ )
- rank A=n , i.e. full rank
  - $A$ is row-equiv and col-equiv to $I$ (can be changed to reduced row echelon form $I$ using elementary row ops)
  - $A$ has $n$ pivot positions, by the above
  - rows/cols of $A$ are lin indep
  - rows/cols of $A$ span $K^{n}$ (i.e. $C o l s (A) = K^{n}$ )
  - rows/cols of $A$ form a basis of $K^{n}$
- $A$ can be expressed as product of finitely many elementary matrices
- A is definite (positive definite or negative definite)
  - 0 is not an eigenvalue of $A$ , i.e. every eigenvalue is nonzero
Fundamental theorem of linear algebra
- Relates the 4 subspaces:
- $dim C (A) = dim C (A^{T}), dim C (A) + dim N (A) = n$
- $N (A) ⊥ C (A^{T})$
- Orthonormal bases exist for both $C (A), C (A^{T})$
- $A$ is diagonal with respect to the orthonormal bases (TODO)
Trace tr A = ∑ i=1 n A ii
- For any $A, B$ where $A B$ is square, $t r A B = t r B A$ (easy)
- For any $A, B, C$ where $A B C$ is square, $t r A B C = t r B A C = t r C B A = t r C A B = t r A C B$ and so on
Basics
- ${(A B)}^{T} = B^{T} A^{T}$
Norms
- $x^{T} x = x \cdot x = \sum x_{i}^{2}$
- $ℓ^{2}$ norm: ${∥ x ∥}_{2} = \sqrt{\sum x_{i}^{2}}$
- $ℓ^{1}$ norm: ${∥ x ∥}_{1} = \sum_{i} | x_{i} |$
- $ℓ^{p}$ norm: ${∥ x ∥}_{p} = {(\sum_{i} {| x_{i} |}^{p})}^{\frac{1}{p}}$ (note the abs.!)
- $ℓ^{\infty}$ norm: ${∥ x ∥}_{\infty} = {max}_{i} | x_{i} |$ (just find the max element)
- $x$ is normalized iff ${∥ x ∥}_{2} = 1$ . Normalize with $\frac{x}{{∥ x ∥}_{2}}$
- Frobenius norm: simply ∑ ij A ij 2
  - Useful for a “squared error” loss over a matrix, e.g. for low rank approximation, want to find a $\hat{A}$ that minimizes $∥ A - \hat{A} ∥$
Linear independence and rank
- Set of vectors is linearly independent if no vector can be represented as linear combination of others
- Column rank of $A$ is size of largest subset of columns of $A$ that is LI. Similar for row rank. Both are actually the same(!), so just rank of $A$ .
- Rank is the number of pivots in row echelon form
- $r a n k (A) = r a n k (A^{T})$
- $r a n k (A B) \leq min (r a n k (A), r a n k (B))$
- $r a n k (A + B) \leq r a n k (A) + r a n k (B)$ (equals if vectors are all independent)
- Full rank matrices are transforms that preserve all dimensions. Lower rank matrices lose dimensions, mapping (say) a 3D shape to a 2D/1D/0D space. Hence they are not invertible.
  - Hence a 3x2 matrix can still be full rank (that takes 2D points into a plane in 3D space), but not a 2x3 matrix that takes 3D points to a plane.
Inverse: A -1 A=I=A A -1
- ${(A B)}^{- 1} = B^{- 1} A^{- 1}$
- ${(k A)}^{- 1} = k^{- 1} A^{- 1}$ for $k \neq 0$
- $det (A^{- 1}) = det {(A)}^{- 1}$
- ${(A^{- 1})}^{T} = {(A^{T})}^{- 1}$
Orthogonality
- Vectors $x, y$ are orthogonal iff $x^{T} y = 0$
- [Unfortunate naming] Matrix $A$ is orthogonal iff its columns are orthogonal and normalized (orthonormal).
- If $U$ is not square — i.e., $U \in R^{m \times n}, n < m$ — but its columns are still orthonormal, then $U^{T} U = I$ but $U U^{T} \neq I$
- Intuition: orthonormal transforms are rotations with no stretching/squishing, preserve dot products
- Properties
  - If $A$ is orthogonal then $A^{T} = A^{- 1}$ , i.e. $A^{T} A = I = A A^{T}$
  - $det A = \pm 1$
  - ${∥ A x ∥}_{2} = {∥ x ∥}_{2}$ , i.e. no change to norm
Range and nullspace
- Span span ( { x 1 , … , x n } ) ={ v:v= ∑ i a i x i , a i ∈ R }
  - If $x_{i}$ are LI, then $s p a n ({x_{i}}) = R^{n}$
- Range of $A$ is span of its column vectors: $R (A) = {v \in R^{m} : v = A x, x \in R^{n}}$
- Projection of vector onto range of A Proj ( y;A ) = arg min v ∈ R ( A ) ∥ v-y ∥ 2 =A ( A A T ) -1 A T y
  - Project $y$ onto vector $v$ : $\frac{v v^{T}}{v^{T} v} y = \frac{v}{∥ v ∥} {\frac{v}{∥ v ∥}}^{T} y = \tilde{v} {\tilde{v}}^{T} y$ where $\tilde{v}$ is the normalized $v$
  - Project $u$ onto $v$ : $\frac{u v^{T}}{u^{T} v}$
  - E.g. find matrix $P$ that projects any vector onto line in direction of $(2, 1, 3)$ : $A = [\begin{matrix} 2 \\ 1 \\ 3 \end{matrix}], P = A {(A^{T} A)}^{- 1} A^{T}$
  - Projection matrix eigenvalues are always 0 or 1
- Nullspace or kernel is set of vectors that are zero when multiplied by A : N ( A ) ={ x ∈ R n ;Ax=0 } . Exists iff determinant is 0 / non-full-rank!
  - $N (A) = {C (A^{T})}^{⊥}$ : orthogonal complement to the row space
  - $N (A^{T}) = {C (A)}^{⊥}$ : left null space is orthogonal complement to the column space
  - Calculating this means solving when is $det A = 0$
  - [TRASH] Assuming $A \vec{x} = \vec{b}$ has solution, solution set will be $P r o j (\vec{b}; A) + N (A)$ . For unique solution, nullspace is just $\vec{0}$ .
- Basis of a space is minimal spanning set for that space
- Full rank: only $\vec{0}$ maps to $\vec{0}$ , but lower rank squishes many things to the (lower-dimensional) $\vec{0}$ . E.g. starting from a plane, a rank-1 transform will map many vectors to $\vec{0}$ —all the normal ones to the resulting spanned line. E.g. starting from a volume, a rank-2 transform will map many vectors to $\vec{0}$ —all along the normal line to the resulting spanned plane. E.g. starting fdrom a volume, a rank-1 transform will map a whole plane to $\vec{0}$ .
- Dot product with a vector $v$ is a rank-1 transform that maps other vectors to a number line. Where it maps unit vectors to is its elements $v_{i}$ .
Determinant
- Sometimes written $det A = | A |$
- Absolute value of determinant is the volume of the subspace ( S ) formed by all linear combinations of row vectors where coeff are between 0 and 1 (cross product): S={ v ∈ R n :v= ∑ i=1 n α i a i ,0 ≤ α i ≤ 1,i=1, … ,n }
  - For 2D matrices, $S$ is parallelogram. In 3D, “parallelepiped”
- Intuition: determinant of a transform is how much the unit square area/volume/etc. grows (can be negative if flipping)
  - Hence lower rank matrices have determinant 0, since volume flattens into area, area flattens into length, etc.
  - Hence determinant 0 matrices have non-empty nullspace
  - Hence determinant of matrix product is product of determinants
  - Hence is also the $\frac{v o l u m e o f o u t p u t s h a p e}{v o l u m e o f i n p u t s h a p e}$
- $det I = 1$
- $(det A) (det B) = det A B$
- $det A^{- 1} = {(det A)}^{- 1}$
- Scaling a single row of $A$ by $t$ yields determinant of $t det A$
- Swapping two rows of $A$ yields determinant of $- det A$
- Determinant is 0 iff $A$ is singular / non-invertible / non-full-rank / linearly dependent. I.e. subspace is “flat sheet” and thus no volume.
Quadratic form
- Scalar $x^{T} A x = \sum_{i} x_{i} {(A x)}_{i} = \sum_{i} x_{i} (\sum_{j} A_{i j} x_{j}) = \sum_{i} \sum_{j} A_{i j} x_{i} x_{j}$
- $x^{T} A x = {(x^{T} A x)}^{T} = x^{T} A^{T} x = x^{T} (\frac{1}{2} A + \frac{1}{2} A^{T}) x$
- Only the symmetric part of $A$ contributes to quadratic form, so just assume that it's symmetric
- Definiteness
  - Symmetric matrix $A$ is positive definite if for all non-zero vectors $x \in R^{n}$ , $x^{T} A x > 0$ . Denoted $A ≻ 0$
  - Similarly, positive semidefinite: $A ⪰ 0$ if $x^{T} A x \geq 0$
  - Similar for negative definite, negative semidefinite, indefinite
  - $A ≻ 0 ⟺ - A ≺ 0$ , so on
- Positive definite means all eigenvalues are positive, etc.
  - Note: as an example, rotation has no real eigenvectors but is positive definite if $- \frac{π}{2} < θ < \frac{π}{2}$
- Intuitions
  - It's inner product (angle) between input and output of transformation: $x^{T} (A x)$
  - Eigenvectors are always scaled by either positively, negatively, or zeroed eigenvalues
    - If positive, then still in the same direction. That and nearby inputs would have <90deg angle with their outputs, i.e. $x^{T} A x > 0$ and positive eigenvalue.
    - If negative, then flipped. That and nearby inputs would have >90deg angle with their outputs, i.e. $x^{T} A x < 0$ and negative eigenvalue
    - If zeroed, then something is getting flattened to lower dimension?
    - What about indefinite?
- Gram matrix $G = A^{T} A$ is always positive semidefinite for any $A \in R^{m \times n}$ , and positive definite for any $A \in R^{n \times n}$
- Useful for quadratic approximation of multivariable function in vector form
Eigenvalues/eigenvectors
- Given square matrix $A$ , say that $λ$ is an eigenvalue of $A$ and $x \in C^{n}$ is a corresponding eigenvector if $A x = λ x, x \neq 0$
- Multiplying by $A$ results in new vector pointing in same direction but scaled by $λ$
- “The” eigenvector for some $λ$ usu. means a unit vector (but can still be either $x$ or $- x$ )
- To find:
  - Get values by solving $det (λ I - A) = 0$ (the det is the characteristic polynomial). E.g. if 2x2 you'll get 2 solutions to a quadratic.
  - Get vectors by solving $(λ I - A) x = 0, x \neq 0$ , for each of the $λ$
- Properties
  - $t r A = \sum_{i} λ_{i}$
  - $det A = \prod_{i} λ_{i}$ . Since determinant is how much things are scaled.
  - Eigenvalues of $A^{k}$ are $λ_{1}^{k}, \dots$
  - Eigenvalues of $A^{- 1}$ are $λ_{1}^{- 1}, \dots$
  - Matrix is singular iff any $λ = 0$
  - Eigenvalues of diag are the diag entries
- Examples/cases, considering 2D
  - Non-uniform scaling: 2 eigenvectors
  - Pure shear: 1 eigenvectors
  - Horizontal shear and vertical scale: 2 eigenvectors, unlike pure shear
  - Rotation: 0 eigenvectors
  - Uniform scaling: infinite eigenvectors
  - 180 degree rotation: infinite eigenvectors
    - In 3D: rotation can still leave eigenvector if only rotating around that axis
- Spectrum of matrix is set of its eigenvalues
- Spectral theorem: matrix has all real eigenvalues & real orthonormal eigenvectors iff it's real symmetric (see also framing within SVD)
  - Sum of eigenvalues is the trace
- Eigenvectors of a transform are the principal axes of the resulting shape from transforming unit sphere. This is because the points along that vector will not change direction, only scale (by eigenvalue).

13.4 Some algorithms

Gaussian elim: algo for solving system of equations
- phase 1, forward elim: use elementary row ops to get row echelon form (REF), where pivots move to right as you move down
- phase 2, back substitution: solve for unknowns using simplified row echelon form
Gauss-Jordan elimination: extension of Gaussian elimination to get reduced row echelon form (RREF), where each row is has just one 1 and others are zeros, and then back-substitute
Reduced row echelon form
- $[\begin{matrix} 1 & 2 & 0 & 3 \\ 0 & 0 & 1 & - 2 \\ 0 & 0 & 0 & 0 \end{matrix}]$ with pivot variables $x_{1}, x_{3}$ and free vars $x_{2}, x_{4}$ has inf solutions
- $[\begin{matrix} 1 & 2 & 0 & 3 \\ 0 & 0 & 1 & - 2 \\ 0 & 0 & 0 & - 4 \end{matrix}]$ has no solutions ( $0 = - 4$ )
- Identity means there's a unique solution
Inverting matrix: either by Gauss-Jordan elimination, or by minors/cefactors/adjugate
- To invert $A$ with elimination: $[A I] \Rightarrow [I A^{- 1}]$
To convert some vectors into orthogonal vectors: for each vector $v$ , compute $v - P r o j (v; V)$ as a new orthogonal vector, where $V$ are the already-computed orthogonal vectors. (Take the first one just as-is.)

13.5 Matrix calculus

Function	Example	1st derivative	2nd derivative
$f : R \to R$	$x^{2}$	$R$	$R$
$f : R^{n} \to R$	Loss function	Gradient $\nabla f : R^{n}$	Hessian $\nabla^{2} f : S^{n}$ (symmetric $R^{n \times n}$ )
$f : R^{d} \to R^{p}$	NN linear layer	Jacobian $J f : R^{d \times p}$	$R^{d \times p \times p}$ (higher order tensor)

(Assume looking at zero gradients) Positive definite Hessian means bowl-shaped (same as 1D case, except now infinitely many directions for x T Ax ), positive semidefinite means “valley,” indefinite means saddle points
Scalar field: any $f : R^{n} \to R$
Vector field: any $f : R^{n} \to R^{m}$
Gradient of f: R n → R : ∇ f=[ ∂ f ∂ x 1 ∂ f ∂ x 2 ⋮ ∂ f ∂ x n ]
- Gradient points in direction of steepest ascent, trivial due to dot product: gradient at point $\vec{x}$ is $\nabla f (\vec{x}) = {max}_{∥ v ∥ = 1} (\nabla f (\vec{x}) \cdot \vec{v})$
Jacobian of $f : R^{n} \to R^{m}$ : $J f = [\begin{matrix} \frac{\partial f_{1}}{\partial x_{1}} & \frac{\partial f_{1}}{\partial x_{2}} & \dots & \frac{\partial f_{1}}{\partial x_{n}} \\ \frac{\partial f_{2}}{\partial x_{1}} & \frac{\partial f_{2}}{\partial x_{2}} & \frac{\partial f_{2}}{\partial x_{n}} \\ ⋮ & ⋱ & ⋮ \\ \frac{\partial f_{m}}{\partial x_{1}} & \frac{\partial f_{m}}{\partial x_{2}} & \dots & \frac{\partial f_{m}}{\partial x_{n}} \end{matrix}]$
Hessian $\nabla^{2} f = [\begin{matrix} \frac{\partial^{2} f}{\partial x_{1}^{2}} & \frac{\partial^{2} f}{\partial x_{1} \partial x_{2}} & \dots & \frac{\partial^{2} f}{\partial x_{1} \partial x_{n}} \\ \frac{\partial^{2} f}{\partial x_{2} \partial x_{1}} & \frac{\partial^{2} f}{\partial x_{2}^{2}} & \frac{\partial^{2} f}{\partial x_{2} \partial x_{n}} \\ ⋮ & ⋱ & ⋮ \\ \frac{\partial^{2} f}{\partial x_{n} \partial x_{1}} & \frac{\partial^{2} f}{\partial x_{n} \partial x_{2}} & \dots & \frac{\partial^{2} f}{\partial x_{n}^{2}} \end{matrix}]$ is Jacobian of gradient. Always symmetric.
Product rule:

\begin{matrix} \nabla_{x} x^{T} A x & = (\nabla x^{T}) A x + x^{T} (\nabla A x) \\ = A x + A^{T} x \end{matrix}

Least squares solution to y=Xw (sometimes Ax=b ) where there is no solution
- Geometrically, $y$ points partially out of $C (X)$ , i.e. has a $N (X)$ component, and any $X w$ by definition is in $C (X)$
- Find instead the closest point. This is the projection of $y$ onto $X$
- Thus solution is $X w = X {(X^{T} X)}^{- 1} X^{T} y$ , or $x = {(X^{T} X)}^{- 1} X^{T} y$
- Can use calculus to derive, involving quadratic forms since you are solving: $\nabla_{x} {∥ A x - b ∥}_{2}^{2} = x^{T} A^{T} A x - 2 b^{T} A x + b^{T} b = 0$
- https://www.youtube.com/watch?v=MC7l96tW8V8
TODO
- Gradients of determinant
- Eigenvalues as optimizations
- Laplacian: $d i v \nabla f = \nabla \cdot \nabla f$
Divergence: how much vector field is diverging from the given point
Only six combinations of scalars/vectors/matrices can be neatly organized in matrix form:

13.6 Decomposition aka factorization

Name	Conditions	Decomposition	$A^{- 1}$	$det A$
LU	Square matrix	$L U$ , $L$ is lower tri, $U$ is upper tri.	$U^{- 1} {(L^{T})}^{- 1} P$	$det A = s \prod_{i} U_{i i}$ , $s = \pm 1$ based on # row swaps
QR	Any matrix	$Q R$ , $Q$ is ortho, $R$ is upper tri. Or $Q D R$ , $D$ is diag.	$R^{- 1} Q^{- 1} = R^{- 1} Q^{T}$	$\| det A \| = \prod_{i} R_{i i}$
Cholesky	PD symmetric matrix	$L L^{T}$ where $L$ is lower tri.	${(L^{- 1})}^{T} L^{- 1}$	$det A = \prod_{i} L_{i i}^{2}$
SVD	Any matrix	$U D V^{T}$ , $D$ is diag (ordered singular values), $U, V$ ortho.	$V^{T} S^{- 1} U^{T}$	$\| det A \| = \prod_{i} S_{i i}$
EVD / Spectral	Square matrix	$U D U^{- 1}$ , $U$ is usually ortho (eigenvectors), $D$ is diag (eigenvalues).
NMF	Non-neg matrix	$W H$ , also non-neg (and usu. lower rank approx)

LU, QR, Cholesky: makes solving/inverting easier, e.g. for OLS
Some of these involve diagonal matrices, which are generally easier to work with when calculating A n
- E.g., if you need to find the result of repeatedly applying $A$
- $A^{n}$ has just the same diagonal entries each pow'd. Thus SVD/EVD are much more efficient to work with in once you move to their basis.
LU: just matrix form of Gaussian elim's fwd elim phase
- a square matrix $A = L U$ where $L, U$ are lower/upper triangular matrices
- efficient; good for “good” matrices
- when solving $A x = b$ , faster to reuse $L U$ : if know $A = L U$ , then can solve $L U x = b$ by solving $L y = b$ for $y$ then $U x = y$ for $x$ , with each of these done efficiently using fwd-/back-subst
QR
- Gram-schmidt algo: just compute the orthogonal matrix $Q$ . Then straightforward to find $R$ .
- Any real $m \times n$ matrix $A = Q R$ where $Q$ is orthog $m \times m$ , $R$ is upper triangular $m \times n$
- “Good balance btwn LU and SVD” (?)
Cholesky decomposition: LU decomp when $A = A^{T}$ is positive definite square matrix ( $U = A^{T}$ ), but with more efficient algo
SVD: very robust
- any matrix $A = U S V^{T}$ where $U, V$ are orthonormal, $S$ is diagonal of reals (the singular values)
- $U, V$ are PSD, and have the same nonzero eigenvalues
- Intuition: (1) rotation, (2) scaling, (3) some other rotation
- Geometrically: any transform $A$ maps $R^{m} \to R^{n}$ , any $m$ -dim unit sphere to $n$ -dim ellipsoid. Singular values are the lengths of the ellipsoid's semi-axes.
- Same as EVD for square symmetric matrices!
- Singular values are in increasing order
- Links with eigenvectors/eigenvalues
  - $V$ is the eigenvectors of $A^{T} A$ , and $S^{2}$ are the eigenvalues, and has EVD/SVD of $A^{T} A = V S^{2} V^{T}$ . Similar for $U$ and $A A^{T}$ . https://www.youtube.com/watch?v=Ed6CSJbyVak
  - Spectral theorem: if $A$ symmetric (and PSD?), then SVD is eigendecomposition
Eigendecomposition (EVD) aka spectral decomposition
- square matrix $A = U D U^{- 1}$ where $U$ 's columns are the eigenvectors (and usually but not nec. orthonormal), $D$ is diagonal of eigenvalues (may be complex values)
- Intuition: (1) rotation, (2) scaling, (3) invert rotation (hence eigenvectors before/after are same directions)
- Identical to SVD for symmetric and PSD
Nonnegative matrix factorization (NMF): use multiplicative update algorithm by Lee, Seung 1999
Weighted matrix factorization: so that the zeros don't dominate the loss, good for many real world sparse matrices (e.g. in collab filtering)
Clustering: low rank factorization into AB where B is one-hot selection of columns of A . E.g. k-means. Each matrix entry is just choosing the closest representative column.

13.7 Optimization

Eckart–Young–Mirsky theorem: SVD gives the optimal low rank approximation to a matrix (just take the truncated portion of the matrices up to the desired rank)
- I think it may just be less efficient than other approaches?

# Create a matrix
A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Compute the singular value decomposition (SVD) of the matrix
U, S, Vh = np.linalg.svd(A)

# Truncate the SVD to a lower rank
k = 2
U_k = U[:, :k]
S_k = S[:k]
Vh_k = Vh[:k]

# Reconstruct the matrix from the truncated SVD
A_approx = U_k @ np.diag(S_k) @ Vh_k

13.8 HITS

$\forall i, \forall k = 1, 2, 3, \dots,$ $\begin{matrix} a_{i}^{(k)} & = & \sum_{j : e_{j i} \in E} h_{j}^{(k - 1)} \\ h_{i}^{(k)} & = & \sum_{j : e_{i j} \in E} a_{j}^{(k)} \end{matrix}$ and $a_{i}^{(1)} = h_{i}^{(1)} = \frac{1}{n}$ .
if $\vec{h}, \vec{a}$ are vectors of $h_{i}, a_{i}$ and $\vec{L}$ is adj matrix ( $L_{i j} = {\begin{matrix} 1 & p a g e i l i n k s t o j \\ 0 & o t h e r w i s e \end{matrix}$ ) then $\begin{matrix} {\vec{a}}^{(k)} & = & {\vec{L}}^{T} {\vec{h}}^{(k - 1)} \\ {\vec{h}}^{(k)} & = & \vec{L} {\vec{a}}^{(k)} \end{matrix}$ or $\begin{matrix} {\vec{a}}^{(k)} & = & {\vec{L}}^{T} \vec{L} {\vec{a}}^{(k - 1)} \\ {\vec{h}}^{(k)} & = & \vec{L} {\vec{L}}^{T} {\vec{h}}^{(k - 1)} \end{matrix}$ i.e., power method applied to positive semi-definite matrices $\vec{L} {\vec{L}}^{T}$ (auth matrix) and ${\vec{L}}^{T} \vec{L}$ (hub matrix).
Thus, HITS amounts to solving for largest eigenvalue $λ_{1}$ in ${\vec{L}}^{T} \vec{L} \vec{a} = λ_{1} \vec{a}$ and $\vec{L} {\vec{L}}^{T} \vec{h} = λ_{1} \vec{h}$ .
http://meyer.math.ncsu.edu/Meyer/PS_Files/IMAGE.pdf

13.9 PageRank

formulate a Markov chain transition matrix—prob of clicking link from any page to any other page
rank of page $i$ at iteration $k$ : $r_{i}^{(k + 1)} = \sum_{j \in I_{i}} \frac{r_{j}^{(k)}}{| O_{j} |}$ , where $I_{i}$ is pages linking to $i$ and $O_{j}$ is pages $j$ links to
start with uniform $r_{i}^{(0)} = \frac{1}{n} \forall i$
in matrix notation, ${π^{(k + 1)}}^{T} = {π^{(k)}}^{T} H$ , where $H_{i j} = \frac{1}{| O_{i} |}$
problems
- dangling nodes i.e. pages w no outlinks (rows of all 0s)
  - replace rows with $\vec{u}$ of all $\frac{1}{n}$ ; call resulting matrix $S$
- graph may not be strongly connected
  - replace $S$ with $G = α S + (1 - α) E$ where $E$ is teleportation matrix where all rows are $\vec{u}$
$π = π G$ , usu. with normalization condition $\sum_{i} π_{i} = 1$
properties
- $S, G$ are stochastic matrices (Markov chain transition matrix; rows sum to 1), hence there's always a solution
- actually a unique solution (stationary distribution), either by Markov theory or Perron-Frobenius theorem
http://meyer.math.ncsu.edu/Meyer/PS_Files/IMAGE.pdf http://cacm.acm.org/magazines/2011/6/108660-pagerank-standing-on-the-shoulders-of-giants/fulltext

13.10 Unsorted

graph/Markov chain is irreducible iff strongly connected; matrix is irreducible iff it's transition matrix of irreducible graph/Markov chain
Power method or power iteration: an eigenvalue algo
- Given matrix $A$ , produce scalar $λ$ and non-0 vector $v$ (eigenvector) s.t. $A v = λ v$
- Doesn't compute matrix decomposition so suitable for large sparse $A$
- But will only find one eigenvalue (with max val) and may converge slowly
- Start with random vector $b_{0}$ and iteratively multiple by $A$ and normalize: $b_{k + 1} = \frac{A b_{k}}{∥ A b_{k} ∥}$
- TODO

14 Optimization

14.1 Convex optimization

lin prog, quad prog

14.2 Linear programming

$arg {max}_{\vec{x}} {\vec{c}}^{T} \vec{x}$ subject to $A \vec{x} \leq \vec{b}$ and $\vec{x} \geq 0$
algo families: basis exchange (eg simplex), interior point (eg ellipsoid)

14.3 Quadratic programming

$arg {min}_{\vec{x}} f (\vec{x}) = \frac{1}{2} {\vec{x}}^{T} Q \vec{x} + {\vec{c}}^{T} \vec{x}$ subject to $A \vec{x} \leq \vec{b}$ and $E \vec{x} = \vec{d}$
algos: interior point, active set, augmented Lagrangian, conjugate gradient, simplex extensions
for pos-def $Q$ , ellipsoid method solves in poly time
for indef $Q$ or if $Q$ has only 1 negative eigenvalue, NP-hard

15 Machine Learning

15.1 Information criteria

Akaike's information criterion (AIC): -2 log L +2p= deviance + params
- $L$ is maximized likelihood using all avail data for estimation
- $p$ is # free params in model
- also seen (TODO resolve?)
  - $- 2 log \frac{RSS}{n} + \frac{2 n k}{n - k - 1}$ where RSS is residual sum of squares and $k$ is # params
  - $- 2 log L + 2 p + \frac{2 p (p + 1)}{n - p - 1}$ http://scott.fortmann-roe.com/docs/MeasuringError.html
- asymptotically, minimizing AIC equiv to minimizing CV value
Schwarz Bayesian information criterion (BIC): -2 log L +p log n
- $n$ is # obs
- heavier penalty means model chosen by BIC is same or simpler (fewer params) than AIC
- asymptotically, minimizing BIC equiv to minimizing leave- $v$ -out CV where $v = n (1 - \frac{1}{log n - 1})$

15.2 Bayesian learning

hypothesis prior $Pr [Θ]$ , where $Θ$ is hyp RV
likelihood $L (θ) = Pr [\vec{x} ∣ θ]$ ; log-likelihood $ℓ (θ) = log Pr [\vec{x} ∣ θ]$
posterior $Pr [θ ∣ \vec{x}] = α Pr [\vec{x} ∣ θ] Pr [θ]$
Bayesian learning: predict Pr [ X' ∣ x → ] = ∑ θ Pr [ X' ∣ x → , θ ] Pr [ θ ∣ x → ] = ∑ i Pr [ X' ∣ θ ] Pr [ θ ∣ x → ]
- calculate prob of each hyp and predict over all hyps
maximum a posteriori (MAP): predict Pr [ X' ∣ θ MAP ] where θ MAP = arg max θ Pr [ θ ∣ x → ]
- predict from just the single hyp with greatest posterior (easier)
- usually $Pr [X' ∣ θ_{M A P}] \to Pr [X' ∣ \vec{x}]$ as more data arrives; otherwise, may snap to incorrect hypothesis
- equiv to MDL: $θ_{M A P} = arg {min}_{θ} - log Pr [\vec{x} ∣ θ] - log Pr [θ]$
maximum lilkelihood: assume uniform prior over hyps, then θ MAP = arg max θ Pr [ x → ∣ θ ]
- reasonable for more data, since data swamps priors

15.3 Complete data

Modeling $p (x)$ or $p (y ∣ x)$
observations are commonly IID: $Pr [\vec{x} ∣ θ] = \prod_{i = 1}^{n} Pr [{\vec{x}}_{i} ∣ θ]$
maximum likelihood
- log likelihood easier to maximize because products become sums: $log Pr [\vec{x} ∣ θ] = \sum_{i = 1}^{n} Pr [{\vec{x}}_{i} ∣ θ]$
- Naive Bayes: MLE on Bayesian network where (discrete) class is root, attrs are leaves, and attrs are IID given class
  - generative model: either class is a component of $\vec{x}$ ( ${\vec{x}}_{0}$ ) or say $Pr [\vec{x}, C ∣ θ]$
- same for both discrete and continuous models
- for network $A \to B$ where $A, B$ are continuous, MLE over $Pr [b ∣ a] = \frac{1}{\sqrt{2 π} σ} exp (- \frac{{(b - (θ_{1} a + θ_{2}))}^{2}}{2 σ^{2}})$ same as minimizing the exponent, the sum of squared errors $E = \sum_{i = 1}^{n} {(b_{i} - (θ_{1} a_{i} + θ_{2}))}^{2}$
Bayesian parameter learning (incorporating hyp probs)
- commonly use conjugate prior for hyp prior $Pr [Θ]$ to simplify math
- e.g. if $Θ = Pr [X = h e a d]$ and $Pr [θ] = b e t a [a, b] (θ) = α θ^{a - 1} {(1 - θ)}^{b - 1}$ and our data is $x = h e a d$ , then $Pr [θ ∣ x] = α Pr [x ∣ θ] Pr [θ] = α' θ θ^{a - 1} {(1 - θ)}^{b - 1} = b e t a [a + 1, b] (θ)$
- usu. assume param independence so each param can have own beta dist
- incorporating param RVs into Bayesian network itself requires making copies of the variables describing each instance

15.4 Incomplete data: latent variable models

hidden/latent variables: indirection that dramatically reduces number of parameters in Bayesian network
Modeling (assuming discrete latent z , but could also be continuous)
- $p (x) = \sum_{z} p (y ∣ x) p (z)$
- p( y ∣ x ) = ∑ z p( y ∣ x,z ) p( z )
  - Alternatives: $\sum_{z} p (y ∣ x, z) p (z ∣ x)$ (z depends on x) or $\sum_{z} p (y ∣ z) p (z ∣ x)$ (y depends only on z)
Classic example: mixture of Gaussians
Although p( x ) may be any complicated distribution, we pick simple distributions (such as normals) for both prior p( z ) and p( x ∣ z )
- Not all latent variable models are generative models and vice versa, but usu. much easier to represent generative models as latent variable models since they're complex distributions and can be decomposed into simple ones
Variational model: To train model p θ ( x ) = ∫ p( x ∣ z ) p( z ) ⁢ ⅆ z
- MLE is $\begin{matrix} θ & \leftarrow {arg max}_{θ} \frac{1}{N} \sum_{i} log p_{θ} (x_{i}) \\ = {arg max}_{θ} \frac{1}{N} \sum_{i} log (\int p (x_{i} ∣ z) p (z) ⅆ z) \end{matrix}$
- But intractable to compute the integral over all $z$ , so let's sample them from posterior $θ \leftarrow arg max \frac{1}{N} \sum_{i} E_{z \sim p (z ∣ x_{i})} [log p_{θ} (x_{i}, z)]$
- But we don't have the posterior. Approximate it with normal. TODO
https://www.youtube.com/watch?v=UTMpM4orS30

15.5 Incomplete data: EM algo

EM algo
- E-step: calculate expected likelihood given current param estimates: Q( θ ∣ θ ( t ) ) = E Z ∣ X, Θ ( t ) [ log L( θ ;X,Z ) ]
  - often, in reality, we don't actually have to calculate expected likelihood, just as we don't have to compute cost function in gradient descents
  - this step usually just updates hidden value estimates, which will then be used in M-step
- M-step: find params that maximize expected likelihood: $θ^{(t + 1)} = arg {max}_{θ} Q (θ ∣ θ^{(t)})$
EM algo examples (all have N data points)
- unsupervised clustering: Gaussian mixture model
  - single Gaussian: $N (x ∣ μ, Σ) = \frac{1}{{(2 π)}^{d / 2} \sqrt{| Σ |}} exp (- \frac{1}{2} {(x - μ)}^{T} Σ^{- 1} (x - μ))$
  - GMM: $Pr [x] = \sum_{i = 1}^{N} Pr {[C = i]}_{i} \cdot N (x ∣ μ_{i}, Σ_{i})$ where $N$ is # Gaussians
  - while ANNs are universal approximators of functions, GMMs are universal approximators of densities
  - even diagonal GMMs are universal approximators; full-rank are unwieldy, since # params is square of # dims
  - like K-means with probabilistic assignments and Gaussians instead of means (K-means is a hard EM algo)
    - very sensitive to initialization; may initialize with K-means
  - mixture model of $k$ components: $Pr [\vec{x}] = \sum_{i = 1}^{k} Pr [C = i] Pr [\vec{x} ∣ C = i]$
  - Bayes net: $C \to \vec{X}$ , $C$ hidden, $C$ discrete, $\vec{X}$ continuous
  - E-step: calculate some quantities useful later (assignment probabilities, which are the hidden variables): $\begin{matrix} p_{i j} & \leftarrow & Pr [C_{j} = i ∣ {\vec{x}}_{j}] = α Pr [{\vec{x}}_{j} ∣ C_{j} = i] Pr [C_{j} = i] \\ p_{i} & \leftarrow & \sum_{j} p_{i j} \end{matrix}$
  - M-step: maximize expected likelihood of observed & hidden vars $\begin{matrix} {\vec{μ}}_{i} & \leftarrow & \sum_{j} p_{i j} {\vec{x}}_{j} / p_{i} \\ {\vec{Σ}}_{i} & \leftarrow & \sum_{j} p_{i j} {\vec{x}}_{j} {\vec{x}}_{j}^{T} / p_{i} \\ Pr [C = i] = θ_{i} & \leftarrow & p_{i} / N \end{matrix}$
  - to avoid local maxima (component shrinking to a single point, two components merging, etc.):
    - use priors on params to apply MAP version of EM
    - restart components with new random params if it gets too small/too close to another component
    - initialize params with reasonable values
  - can also do MAP GMM
  - http://bengio.abracadoudou.com/lectures/gmm.pdf
- Naive Bayes with hidden class
  - Bayes net: $X \to \vec{Y}$ , $X$ hidden, $X, Y$ discrete
  - E-step: $\begin{matrix} p_{i j} & \leftarrow & Pr [X = i ∣ \vec{Y} = {\vec{y}}_{j}] \\ p_{i} & \leftarrow & \sum_{j} p_{i j} \end{matrix}$
  - M-step: $\hat{N}$ are expected counts: $\begin{matrix} Pr [X = i] = θ_{i} & \leftarrow & \frac{\hat{N} (X = i)}{N} = \frac{1}{N} \sum_{j = 1}^{N} Pr [X = i] = \frac{p_{i}}{N} \\ Pr [\vec{Y} = \vec{y} ∣ X = i] = {\vec{θ}}_{i} & \leftarrow & \frac{\hat{N} (\vec{Y} = \vec{y}, X = i)}{\hat{N} (X = i)} = \frac{\sum_{j : {\vec{y}}_{j} = \vec{y}} p_{i j}}{p_{i}} \end{matrix}$
- HMMs: dynamic Bayes net with single discrete state var
  - each data point is sequence of observations
  - transition probs repeat across time: $\forall t, θ_{i j t} = θ_{i j}$
  - E-step: modify forward-backward algo to compute expected counts below
    - obtained by smoothing rather than filtering: must pay attn to subsequent evidence in estimating prob of a particular transition (eg evidence is obtained after crime)
  - M-step: $θ_{i j} \leftarrow \frac{\sum_{t} \hat{N} (X_{t + 1} = j, X_{t} = i)}{\sum_{t} \hat{N} (X_{t} = i)}$
EM algo
- pretend we know params, then “complete” data infer prob dists over hidden vars, then find params that maximize likelihood of observed & hidden vars
- gist: ${\vec{θ}}^{(t + 1)} = arg {max}_{\vec{θ}} \sum_{\vec{z}} Pr [\vec{Z} = \vec{z} ∣ \vec{x}, {\vec{θ}}^{(t)}] ℓ (\vec{x}, \vec{Z} = \vec{z} ∣ θ)$
- E-step: compute Q( θ → ∣ θ → ( t ) ) = E Z → ∣ x → , θ → ( t ) [ ℓ ( x → , Z → ∣ θ → ) ]
  - expected likelihood over $\vec{Z}$ under current ${\vec{θ}}^{(t)}$
  - misnomer: what's calculated are fixed, data-dependent params of $Q$
- M-step: compute θ → ( t+1 ) = arg max θ → Q( θ → ∣ θ → ( t ) )
  - new $\vec{θ}$ that maximizes the expected likelihood
- resembles gradient-based hill-climbing but no “step size” param
- monotonically increases likelihood

15.6 Kernel models

aka Parzen-Rosenblatt window
each instance contributes small density function $K (\vec{x}, {\vec{x}}_{i})$
density estimation: $p (\vec{x}) = \frac{1}{N} \sum_{i = 1}^{N} K (\vec{x}, {\vec{x}}_{i})$
kernel normally depends only on distance D( x → , x → i )
- eg $d$ -dimensional Gaussian $K (\vec{x}, {\vec{x}}_{i}) = \frac{1}{{(w^{2} \sqrt{2 π})}^{d}} exp (- \frac{D {(\vec{x}, {\vec{x}}_{i})}^{2}}{2 w^{2}})$
supervised learning: take weighted combination of all predictions
- vs kNN's unweighted combination of $k$ instances

15.7 Classification

linear classifiers: simplest type of feedforward neural network
- TODO: single- vs multi-layer perceptron; feedforward vs backpropagation
perceptron learning algorithm: for each iteration, if y t ( θ → ⋅ x t → ) ≤ 0 (mistake), then θ → ← θ → + y t x t → .
- makes at most $\frac{R^{2}}{γ_{g}^{2}}$ mistakes on training set, where $∥ \vec{x_{i}} ∥ \leq R$ and $γ_{g} \leq \frac{y_{i} (\vec{θ^{*}} \cdot \vec{x_{i}})}{∥ \vec{θ^{*}} ∥}$ is the margin
support vector machine (SVM): maximum margin classifier with some slack minimize 1 2 ∥ θ → ∥ 2 +C ∑ t=1 n ξ t subject to y t ( θ → T x t → + θ 0 ) ≥ 1- ξ t and ξ t ≥ 0 ∀ t=1, … ,n
- equivalent formulation assuming $y_{t}$ are 1/0 instead of $\pm 1$ : $\begin{matrix} minimize & C \sum_{t = 1}^{n} [y_{t} {cost}_{1} ({\vec{θ}}^{T} {\vec{x}}_{t} + θ_{0}) + (1 - y_{t}) {cost}_{0} ({\vec{θ}}^{T} {\vec{x}}_{t} + θ_{0})] + \frac{1}{2} {∥ \vec{θ} ∥}^{2} \\ where & {cost}_{1} (z) = max (0, 1 - z) \\ {cost}_{0} (z) = max (0, 1 + z) \end{matrix}$ since $\begin{matrix} y_{t} ({\vec{θ}}^{T} {\vec{x}}_{t} + θ_{0}) & \geq & 1 - ξ_{t} \\ ξ_{t} & \geq & 0 \end{matrix}$ becomes (given that we're trying to minimize all the $ξ_{t}$ ) $\begin{matrix} ξ_{t} & = & max (0, 1 - y_{t} ({\vec{θ}}^{T} {\vec{x}}_{t} + θ_{0})) \\ = & {cost}_{1} ({\vec{θ}}^{T} {\vec{x}}_{t} + θ_{0}) \end{matrix}$
- LOOCV error $\frac{1}{2} \sum_{i = 1}^{n} L o s s (y_{i}, f (x_{i}; {\hat{\vec{θ}}}^{- i}, {\hat{\vec{θ}}}_{0}^{- i}))$ where Loss is the 0-1 loss. Upper bound is (# support vectors / $n$ ).
- quadratic programming optimization problem: single global maximum that can be found efficiently
- dual: $\begin{matrix} maximize & \sum_{i} α_{i} - \frac{1}{2} \sum_{i, j} α_{i} α_{j} y_{i} y_{j} ({\vec{x}}_{i} \cdot {\vec{x}}_{j}) \\ subject to & α_{i} \geq 0 \forall i \\ and & \sum_{i} α_{i} y_{i} = 0 \end{matrix}$
- kernel trick: substitute kernel function $K ({\vec{x}}_{i}, {\vec{x}}_{j}) = F ({\vec{x}}_{i}) \cdot F ({\vec{x}}_{j})$ where $F$ maps to high/infinite dimensions but $K$ can still be computed efficiently
- Mercer's theorem: any “reasonable” (positive definite) kernel function corresponds to some feature space
- kernels
  - quadratic: ${({\vec{x}}_{i} \cdot {\vec{x}}_{j})}^{2}$ (common illustration: slicing hyperparabola yields circular separator)
  - polynomial: ${(1 + {\vec{x}}_{i} \cdot {\vec{x}}_{j})}^{d}$
  - radial basis function (RBF): often the best
logistic regression: optimize using logit/logistic/sigmoid function Pr [ y=1 ∣ x → ; θ → ] = h θ → ( x → ) =g( z ) = e z e z +1 = 1 1+ e -z ,z= θ 0 + θ → ⋅ x →
- loss function
  - $Cost (h_{\vec{θ}} (\vec{x}), y) = {\begin{matrix} - log h_{\vec{θ}} (\vec{x}), & if y = 1 \\ - log (1 - h_{\vec{θ}} (\vec{x})), & if y = 0 \end{matrix}$
  - Total loss $J (θ) = \frac{1}{m} \sum_{i = 1}^{m} Cost (h_{\vec{θ}} ({\vec{x}}^{(i)}), y^{(i)}) = - \frac{1}{m} \sum_{i = 1}^{m} [y^{(i)} log h_{\vec{θ}} ({\vec{x}}^{(i)}) + (1 - y^{(i)}) log (1 - h_{\vec{θ}} (\vec{x^{(i)}}))]$ (a clever differentiable form)
  - intuition: 0 when completely correct and $\infty$ when completely incorrect
- softmax is a generalization of sigmoid, but they are equivalent when $k = 2$ only if you set one of the weights to 0, i.e. drop one of the terms from the regression problem. This is because it is redundant with the others, as they must sum to 1. $Pr [y = m ∣ x] = \frac{exp (x β_{m})}{\sum_{j = 1}^{J} exp (x β_{j})}$
- gradient descent: identical (LMS) update rule & derivation as in linear regression but $h$ is non-linear (logit)
- for perfectly separable data, $θ \to \infty$ ; need regularization
- another algo: Newton's method to find zero of $ℓ' (θ)$
- coefficients & intercept have log-odds interpretation
  - $z = log \frac{p}{1 - p}$ , where $p = Pr [y = 1 ∣ \vec{x}; \vec{θ}] = g (z)$
  - intercept: log-odds if $\vec{x} = \vec{0}$
  - coefficient for indicator: log-odds between 1 and 0 groups
  - coefficient for continuous: log-odds between unit deltas in value
  - http://www.ats.ucla.edu/stat/mult_pkg/faq/general/odds_ratio.htm
softmax: a generalization of sigmoid/logistic function e x 1+ e x
- For logistic

15.8 Regression

squared error SE = ∑ i ( y i - y ˆ i ) 2 , MSE = 1 n SE , RMSE = SE
- bias-variance tradeoff: $MSE = V a r [SE] + E {[y_{i} - {\hat{y}}_{i}]}^{2} = V a r [SE] + {Bias}^{2}$
relative sq err $RSE = \frac{\sum_{i} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i} {(y_{i} - \overline{y})}^{2}}$ , root rel sq err $RRSE = \sqrt{RSE}$
$MAE = \frac{1}{n} \sum_{i} | y_{i} - {\hat{y}}_{i} |$ , rel abs err $RAE = \frac{\sum_{i} | y_{i} - {\hat{y}}_{i} |}{\sum_{i} | y_{i} - \overline{y} |}$
mean abs pct err $MAPE = \frac{1}{n} \sum_{i} | \frac{y_{i} - {\hat{y}}_{i}}{y_{i}} |$ (drawbacks: zeros, unbounded)
relative errors can compare models of different units
coefficient of determination R 2 = S S reg S S tot =1- S S err S S tot =1- RSE where
- $S S_{tot} = \sum_{i} {(y_{i} - \overline{y})}^{2}$ : total sum of squares (proportional to sample variance)
- $S S_{reg} = \sum_{i} {(f_{i} - \overline{y})}^{2}$ : regression/explained sum of squares
- $S S_{err} = \sum_{i} {(y_{i} - f_{i})}^{2}$ : residual sum of squares; sum of squared residuals
- $R^{2} = 1$ is perfect; $R^{2} < 0$ means $\overline{y}$ is better than the model
adjusted R 2 is 1-( 1- R 2 ) n-1 n-p-1 =1- S S err S S tot ⋅ d f t d f e =1- Var err Var tot , where
- $p$ is total number of regressors in linear model
- $n$ is sample size
- $d f_{t}$ is dof $n - 1$ of estimate of population variance of $Y$
- $d f_{e}$ is dof $n - p - 1$ of estimate of underlying population error variance
- ${V a r}_{err} = \frac{S S_{err}}{n - p - 1}$ , ${V a r}_{tot} = \frac{S S_{tot}}{n - 1}$
- increases only if new regressor improves model more than would be expected by chance; penalizes too-many-regressors
- useful for feature selection, with small samples
intervals
- confidence interval: tells us about population params (mean/variance, or model params)
  - “at confidence level 95%, 95% of samples (repeating this experiment) would generate CIs containing true params”
- prediction interval: tells us about data values; always wider than CI
  - “next value is in PI of 95% of samples (repeated experiments)”
  - or: “95% of repeated experiments generate PI containing next random value”
  - non-parametric assumes nothing about population; simply think of numbers of points/gaps
- tolerance interval: tells us about percentage of all future values; usu. wider than PI
  - “95% of all future values are in TI in 95% of samples/repeated experiments”
- eg: in simple linear regression, y ˆ = α ˆ + β ˆ x= E [ y ∣ x ] is the mean response, while y= α + β x+ ε is actual response
  - $\hat{y}$ use CI, since $\hat{α}, \hat{β}$ use CI; draw confidence bands in plot for what the possible regression line is
  - $y$ use PI; draw prediction bands in plot for what the possible values are

15.9 Support vector regression (SVR)

linear regression formulation with no slack: minimize $\frac{1}{2} {∥ \vec{w} ∥}^{2}$ subject to $y_{i} - ⟨ \vec{w}, {\vec{x}}_{i} ⟩ - b \leq ϵ$ and $⟨ \vec{w}, {\vec{x}}_{i} ⟩ + b - y_{i} \leq ϵ$ (require prediction $⟨ \vec{w}, {\vec{x}}_{i} ⟩ + b$ to be within $\pm ϵ$ of $y_{i}$ )
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.99.2073&rep=rep1&type=pdf

15.10 Linear regression

ordinary least squares (OLS) regression: a linear regression
- minimize cost function $J (θ) = \frac{1}{2} \sum_{i = 1}^{m} {(y - h_{\vec{θ}} (\vec{x}))}^{2}, h_{\vec{θ}} (\vec{x}) = \vec{θ} \cdot \vec{x}$ (sum of squared errors)
- gradient descent: repeatedly θ j ← θ j - α ∂ ∂ θ j J( θ )
  - $α$ is configurable learning rate
  - batch: cost over all training instances
    - $θ_{j} \leftarrow θ_{j} + α \sum_{i = 1}^{m} (y^{(i)} - \vec{θ} \cdot {\vec{x}}^{(i)}) {\vec{x}}_{j}^{(i)}$
  - stochastic aka incremental: converges faster
    - $θ_{j} \leftarrow θ_{j} + α (y^{(i)} - \vec{θ} \cdot {\vec{x}}^{(i)}) {\vec{x}}_{j}^{(i)}$
- update rule is called least mean squares (LMS) rule aka Widrow-Hoff rule
  - derivation for single training instance: $\frac{\partial}{\partial θ_{j}} J (\vec{θ}) = \frac{\partial}{\partial θ_{j}} \frac{1}{2} {(y - \vec{θ} \cdot \vec{x})}^{2} = - 2 \cdot \frac{1}{2} (y - \vec{θ} \cdot \vec{x}) \frac{\partial}{\partial θ_{j}} (y - \vec{θ} \cdot \vec{x}) = - (y - \vec{θ} \cdot \vec{x}) x_{j}$
  - magnitude of update proportional to error term
- can minimize in closed form without iterative algo (some matrix calculus): $θ = {(X^{T} X)}^{- 1} X^{T} \vec{y}$
- LOOCV can also be computed without training $n$ models: $\frac{1}{n} \sum_{i = 1}^{n} {(\frac{e_{i}}{1 - h_{i}})}^{2}$ where $e_{i}$ is residual
- probabilistic interp: why linear regression/why J ?
  - assume $y = \vec{θ} \cdot \vec{x} + ε$ where $ε$ is normally distributed
  - in the following, design matrix $X$ has training inputs as rows, and examples are indep
  - to maximize likelihood $L (\vec{θ}) = p (\vec{y} ∣ X; \vec{θ}) = \prod_{i = 1}^{m} \frac{1}{\sqrt{2 π} σ} exp (- \frac{{(y^{(i)} - \vec{θ} \cdot {\vec{x}}^{(i)})}^{2}}{2 σ^{2}})$ , must minimize cost function in exponent
  - can work it out by writing log likelihood
- always passes through mean of $x$ s and $y$ s

15.11 Neural networks

$tanh$ is just a rescaled logistic sigmoid ( $tanh x = 2 g (2 x) - 1$ where $g (x) = \frac{e^{x}}{1 + e^{x}}$ ) and is useful in NNs because its range is $[- 1, 1]$

15.12 Multi-layer feed-forward neural networks

Assume $m$ examples, $K$ outputs, $L$ layers, $s_{l}$ nodes in layer $l$
Params $Θ^{(l)}$ is $s_{l + 1} \times s_{l}$ such that $Θ_{i j}^{(l)}$ is weight from node $j$ in layer $l$ to node $i$ in layer $l + 1$ (subscripts are “backward”)
Regularized cost J( Θ ) =- 1 m [ ∑ i=1 m ∑ jk=1 K y k ( i ) log h Θ ( x ( i ) ) +( 1- y k ( i ) ) log ( 1- h Θ ( x ( i ) ) ) ] + λ 2m ∑ l=1 L ∑ i=1 s l ∑ j=1 s l +1 ( Θ ji ( l ) ) 2
- Or squared error for regressions; these turn out to have the same derivations
Backpropagation: algorithm used to compute gradients
- Need random initialization of $Θ$ to break symmetry or all params will be equal
- $Δ_{i j}^{(l)} \leftarrow 0 \forall l, i, j$
- for i=1 to m :
  - Forward-propagate activations ${\vec{a}}^{(l)} = Θ^{(l - 1)} {\vec{a}}^{(l - 1)}$ for $l = 2, \dots, L$ where ${\vec{a}}^{(1)} = {\vec{x}}^{(i)}$
  - Backpropagate errors δ → ( l ) = ( Θ ( l ) ) T δ → ( l+1 ) ⊙ g'( z → ( l ) ) for l=L-1, … ,2 where
    - ${\vec{δ}}^{(L)} = {\vec{a}}^{(L)} - y^{(i)}$
    - $g' ({\vec{z}}^{(l)}) = {\vec{a}}^{(l)} ⊙ (1 - {\vec{a}}^{(l)})$ since $g' (x) = g (x) (1 - g (x))$
    - $a ⊙ b$ is element-wise multiplication (http://math.stackexchange.com/questions/52578/symbol-for-elementwise-multiplication-of-vectors)
  - $Δ_{i j}^{(l)} \leftarrow Δ_{i j}^{(l)} + δ_{i}^{(l + 1)} a_{j}^{(l)}$ for $l = L - 1, \dots, 1$
- Gradient $\frac{\partial}{\partial Θ_{i j}^{(l)}} J (Θ) = {\begin{matrix} \frac{1}{m} Δ_{i j}^{(l)} + λ Θ_{i j}^{(l)} & if j \neq 0 \\ \frac{1}{m} Δ_{i j}^{(l)} & if j = 0 \end{matrix}$
Excellent explanations of derivation at http://www.ml-class.org/course/qna/view?id=3740, http://www.scribd.com/doc/72228829/Back-Propagation
Derivation of gradient for layer L-1 focusing on a single example ( m=1 )
- Define $δ_{i}^{(l)} = \frac{\partial J}{\partial z_{i}^{(l)}}$ : (how do we reduce the second factor?) $\begin{matrix} \frac{\partial J}{\partial Θ_{i j}^{(l - 1)}} & = & \frac{\partial J}{\partial z_{i}^{(l)}} \frac{\partial z_{i}^{(l)}}{\partial Θ_{i j}^{(l - 1)}} = δ_{i}^{(l)} \frac{\partial}{\partial Θ_{i j}^{(L - 1)}} (\sum_{k} Θ_{k j}^{(l - 1)} a_{j}^{(l - 1)}) = δ_{i}^{(l)} a_{j}^{(l - 1)} \end{matrix}$
- For output layer: $\begin{matrix} δ_{i}^{(L)} & = & \frac{\partial J}{\partial z_{i}^{(L)}} \\ = & \frac{\partial}{\partial z_{i}^{(L)}} (- \sum_{k = 1}^{K} y_{k} log g (z_{k}^{(L)}) + (1 - y_{k}) log (1 - g (z_{k}^{(L)}))) \\ = & \frac{\partial}{\partial z_{i}^{(L)}} (- y_{i} log g (z_{i}^{(L)}) + (1 - y_{i}) log (1 - g (z_{i}^{(L)}))) \\ = & - (y_{i} \frac{1}{g (z_{i}^{(L)})} - (1 - y_{i}) \frac{1}{1 - g (z_{i}^{(L)})}) g (z_{i}^{(L)}) (1 - g (z_{i}^{(L)})) \\ = & - (y_{i} (1 - g (z_{i}^{(L)})) - (1 - y_{i}) g (z_{i}^{(L)})) \\ = & - (y_{i} - g (z_{i}^{(L)})) \end{matrix}$
- Build on this to get previous layers
- For simpler squared error $J (Θ) = \frac{1}{2} \sum_{j = 1}^{K} {(y_{j} - a_{j})}^{2}$ , slightly different: $\begin{matrix} δ_{i} & = & \frac{\partial}{\partial z_{i}^{(L)}} (\frac{1}{2} \sum_{k = 1}^{K} {(y_{k} - g (z_{k}^{(L)}))}^{2}) \\ = & (y_{i} - g (z_{i}^{(L)})) g' (z_{i}^{(L)}) \end{matrix}$
Useful to test with gradient checking (numerical gradient)
Generally if single hidden layer then choose more hidden units than inputs/outputs, and choose same # hidden units in each layer if more than 1 hidden layer

15.13 Local regression

locally weighted scatterplot smoothing (LOESS aka LOWESS) aka locally weighted regression (LWR)
- a type of smoother
- typically LOESS is variable-bandwidth (fixed-span) smoother (like nearest-neighbors)
- typically LOESS is locally quadratic or linear
- weight functions/kernels: tricubic (traditional), Gaussian, ...
fixed-bandwidth example: to predict at $x$ , fit $θ$ to minimize $\sum_{i} w^{(i)} {(y^{(i)} - \vec{θ} \cdot {\vec{x}}^{(i)})}^{2}$ where typically $w^{(i)} = exp (- \frac{{(x^{(i)} - x)}^{2}}{2 τ^{2}})$ (some kernel) and $τ$ is bandwidth param

15.14 Regularization

overfitting tends to occur when large weights found in $\vec{β}$
regularization pressures $\vec{β}$ to be small
LASSO/L1: minimize ∑ i ( β → x → i - y i ) 2 where ∥ β → ∥ 1 ≤ s and ∥ β → ∥ 1 = ∑ j | β → j | ( ℓ 1 norm)
- equiv: minimize $\sum_{i} \frac{1}{2 n} {(\vec{β} {\vec{x}}_{i} - y_{i})}^{2} + λ {∥ \vec{β} ∥}_{1}$
- more notes in AI.page
- better than L2 when many features (vs examples)
ridge/Tikhonov: instead of ∥ X β → - y → ∥ 2 2 , minimize ∥ X β → - y → ∥ 2 2 + ∥ Γ β → ∥ 2 2 for some suitable Tikhonov matrix Γ
- L2 regularization: when $Γ = I$
- effect aka shrinkage
- $Γ$ can also be highpass to enforce smoothing
- explicit solution $\hat{β} = {(X^{T} X + Γ^{T} Γ)}^{- 1} X^{T} \vec{y}$ ( $O (n^{3})$ time)
elastic net: handles highly correlated vars; balance L1 & L2
- minimize $\frac{1}{2 n} {∥ X \vec{β} - \vec{y} ∥}_{2}^{2} + λ α {∥ \vec{β} ∥}_{1} + \frac{λ (1 - α)}{2} {∥ \vec{β} ∥}_{2}^{2}$
http://www-stat.stanford.edu/~owen/courses/305/Rudyregularization.pdf http://scikit-learn.org/stable/modules/linear_model.html

15.15 GLM

exponential family: p( y; η ) =b( y ) exp ( η T T( y ) -a( η ) )
- $η$ is natural param aka canonical param
- $T (y)$ is sufficient statistic; often $T (y) = y$
- $a (η)$ is log partition function
- $e^{- a (η)}$ is normalizing const; ensures $p$ sums/integrates to 1 over $y$
fixed $T, a, b$ defines a family (set) of dists param'd by $η$ (vary $η$ for diff dists in fam)
assumptions to derive GLM
- $y ∣ x; θ \sim (some exponential family) (η)$
- predicting $h (x) = E [T (y) ∣ x] = E [y ∣ x]$
- $η = \vec{θ} \cdot \vec{x}$ (more of a “design choice” rather than an assumption)

15.16 PCA

algorithm: normalize data, then find eigenvectors of covariance matrix
- can use SVD: $U, S, V = SVD (C o v (\frac{X - μ}{σ}))$ , where $U$ cols are eigenvectors in decreasing importance, $S$ is diagonal matrix of singular values
- down-project: $\vec{z} = U \vec{x}$
- recover: $\vec{\hat{x}} = U^{T} \vec{z}$
use: viz (project to 2D/3D), compression, speeding up learning
don't use to reduce dimensionality for learning since it doesn't consider labels; use regularization
choose smallest k where lost variance 1 m ∑ i=1 m ∥ x ( i ) - x ˆ ( i ) ∥ 2 1 m ∑ i=1 m ∥ x ( i ) ∥ 2 = ≤ .01
- or (equivalently) retained variance $\frac{\sum_{i = 1}^{k} S_{i i}}{\sum_{i = 1}^{n} S_{i i}} \geq .99$ (“99% variance retained”)
PCA transpose trick http://blog.echen.me/2011/03/14/pca-transpose-trick/
- given $m \times n$ obs matrix $A$ , often $n ≫ m$ (dims $≫$ obs)
- finding evecs of big $n \times n$ matrix $A^{T} A$ is expensive
- insight: if $v$ is evec of $A A^{T}$ , then $A^{T} v$ is evec of $A^{T} A$ w same eval (short proof)
- so, find evecs of $A A^{T}$ , then multiply by $A^{T}$
Related to the Eckhart-Young theorem, that SVD yields the best lower rank approximation (by Frobenius norm)

15.17 SVD

M=U Σ V ∗ , where
- $U$ is $m \times m$ real/complex unitary matrix; cols (left singular vectors) are evecs of $M M^{*}$
- Σ is m × n non-neg real diag matrix
  - entries are singular values
  - non-0 singular values are sqrts of non-0 evals of $M M^{*}$ or $M^{*} M$
- $V^{*}$ is $n \times n$ real/complex unitary matrix; cols (right singular vectors) are evecs of $M^{*} M$
when all real, $U, V$ are rotations and $Σ$ is scaling
eval decomp: similar concept but only for square matrices M
- $M^{*} M = V Σ^{*} U^{*} U Σ V^{*} = V (Σ^{*} Σ) V^{*}$
- $M M^{*} = U Σ V^{*} V Σ^{*} U^{*} = U (Σ Σ^{*}) U^{*}$
- RHSs are eval decomps of LHSs
use SVD to perform PCA
- let $M$ be deviations matrix; covar matrix is $\frac{1}{n} M M^{T} = \frac{1}{n} U Σ^{2} U^{T}$
low-rank approx to $M$ : $\tilde{M} = U \tilde{Σ} V^{*}$ , where $\tilde{Σ}$ is same as $Σ$ but with only $r$ largest singular values (rest replaced by 0)

15.18 Matrix factorization

R ≈ MU where M is n M × d , U is d × n U
- $d$ latent features
EM-like algo (6.867 project)
- randomly init $M$
- learn column $i$ of $U$ with OLS: $\vec{u_{i}} = {(M^{T} M)}^{- 1} M^{T} \vec{r_{i}}$ where $\vec{r_{i}}$ has just known values from column $i$ of $R$
- learn $M$ from $U$ ; iterate
- strongly overfits; can try introducing prior to prevent overfitting and to coerce values to be within 1 to 5
- can't run if too sparse because ${(M^{T} M)}^{- 1}$ becomes singular
gradient descent http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial-and-implementation-in-python/
- $e_{i j}^{2} = {(r_{i j} - \hat{r_{i j}})}^{2} = {(r_{i j} - \sum_{k = 1}^{K} p_{i k} q_{k j})}^{2}$
- gradient: $\begin{matrix} \frac{\partial}{\partial p_{i k}} e_{i j}^{2} & = & - 2 {(r_{i j} - \hat{r_{i j}})}^{2} q_{k j} = - 2 e_{i j} q_{k j} \\ \frac{\partial}{\partial q_{k j}} e_{i j}^{2} & = & - 2 {(r_{i j} - \hat{r_{i j}})}^{2} p_{i k} = - 2 e_{i j} p_{i k} \end{matrix}$
- update rule (usu. $α = .0002$ ): $\begin{matrix} p_{i k}' & = & p_{i k} + α \frac{\partial}{\partial p_{i k}} e_{i j}^{2} = p_{i k} + 2 α e_{i j} q_{k j} \\ q_{k j}' & = & q_{k j} + α \frac{\partial}{\partial q_{k j}} e_{i j}^{2} = q_{k j} + 2 α e_{i j} p_{i k} \end{matrix}$
- with regularization to avoid overfitting: $\begin{matrix} e_{i j} & = & {(r_{i j} - \sum_{k = 1}^{K} p_{i k} q_{k j})}^{2} + \frac{β}{2} \sum_{k = 1}^{K} ({∥ P ∥}^{2} + {∥ Q ∥}^{2}) \\ p_{i k}' & = & p_{i k} + α \frac{\partial}{\partial p_{i k}} e_{i j}^{2} = p_{i k} + α (2 e_{i j} q_{k j} - β p_{i k}) \\ q_{k j}' & = & q_{k j} + α \frac{\partial}{\partial q_{k j}} e_{i j}^{2} = q_{k j} + α (2 e_{i j} p_{i k} - β q_{k j}) \end{matrix}$
- important extension: non-negative matrix factorization (NMF)
  - require $P, Q$ to be non-negative

15.19 Decision trees

information gain or entropy discretization
- entropy $H (.)$ is information uncertainty
- in total, need $H (Y)$ bits to classify (e.g. 1 bit for even 2-class distribution)
- each branch gives some information/removes uncertainty; want to move toward 0 uncertainty
- after branching on an attr with value dist $V$ , avg remaining uncertainty is $E_{V} [H (Y_{V})] \leq H (Y)$ , where $Y_{v}$ is class dist down branch for attr value $v$
- choose attr w lowest remaining uncertainty, or greatest information gain $H (Y) - E_{V} [H (Y_{V})]$
stopping criterion
- AIMA suggests building full tree then prune instead of early stop (eg consider learning xor), but this doesn't seem to be done elsewhere
- AIMA suggests chi-square test
- MDL suppose to be best-performing http://www.jair.org/media/279/live-279-1538-jair.pdf

15.20 Markov Chain Monte Carlo (MCMC)

given multivariate dist, simpler to sample from conditional dists than the joint (hard to marginalize by integrating over joint dist)
Gibbs sampling: initialize vars X i ( 0 ) ∀ i , then iteratively sample X i ( t ) =P( X i ( t ) ∣ { X j ( t-1 ) ∣ j ≠ i } ) ∀ i
- typically discard samples from initial burn-in period
- typically consider only every $n$ samples, since successive samples have some correlation
- useful when conditional distributions known, e.g. in Bayesian networks
- M-H special case where proposal dist is conditional dist; proposals accepted 100% of time
Metropolis Hastings: (slower) generalization of Gibbs sampling for when conditional distributions unknown
- use proposal conditional dist $Q (X'; X)$ , eg $N (X, σ^{2} I)$ (needn't be symmetric)
- a= a 1 a 2 = P( X' ) Q( X ( t ) ;X' ) P( X ( t ) ) Q( X'; X ( t ) ) where:
  - $a_{1} = \frac{P (X')}{P (X^{(t)})}$ is likelihood ratio btwn proposed sample $X'$ and prior sample $X^{(t)}$
  - $a_{2} = \frac{Q (X^{(t)}; X')}{Q (X'; X^{(t)})}$ is ratio of proposal density in both directions
- $X^{(t + 1)} = {\begin{matrix} X' & if a \geq 1 \\ X' & with prob a if a < 1 \\ X^{(t)} & with prob 1 - a if a < 1 \end{matrix}$
Burn-in commentary: http://www.johndcook.com/blog/2011/08/10/markov-chains-dont-converge/

15.21 Unsupervised learning

K-means
- Algo: given inputs x ( 1 ) , … , x ( m ) , repeatedly:
  - update cluster assignments $c^{(1)}, \dots, c^{(m)}$ : assign each point to nearest cluster
  - update cluster centroids $μ_{1}, \dots, μ_{K}$ to mean of assigned points
- Objective: minimize distortion function $J (c^{(1)}, \dots, c^{(m)}, μ_{1}, \dots, μ_{K}) = \frac{1}{m} \sum_{i = 1}^{m} {∥ x^{(i)} - μ_{c^{(i)}} ∥}^{2}$
- Use random initialization (good to initialize to $K$ random data points); run many times; choose clustering with min $J$
- Choose $K$ at knee/elbow in curve of $J$ over $K$ (or choose based on the problem you're solving)

15.22 Infinite Mixture Models with Nonparametric Bayes and the Dirichlet Process

nonparametric: params can change w data
in the following
- $n$ is # diners, $α$ is dispersion param
- $G_{0}$ is base dist
Chinese restaurant process
- each new diner sits at new table w prob $\frac{α}{n + a}$ or table $k$ w prob $\frac{n_{k}}{n + α}$ ; call table assignments $g_{i}$
- table $k$ serves some set of food parameterized by $φ_{k} \sim G_{0}$
- generate data points $p_{i} \sim F (φ_{g_{i}})$
Polya urn model
- start w urn containing $α G_{0} (φ_{k})$ balls of “color” $φ_{k}$ , for ea possible color $φ_{k}$
- draw ball, note color $φ_{i}$ , put back orig ball & new ball of same color
- generate data points $p_{i} \sim F (φ_{g_{i}})$
stick-breaking process
- start w stick of length 1
- sample $β_{1} \sim B e t a (1, α)$ ; $w_{1} = β_{1}$ is the eventual proportion of customers at table 1
- sample $β_{2} \sim B e t a (1, α)$ ; $w_{2} = (1 - β_{1}) β_{2}$ is eventual prop of customers at table 2
- generate group assignments $g_{i} \sim M u l t i n o m i a l (w_{1}, \dots, w_{\infty})$
- generate data points $p_{i} \sim F (φ_{g_{i}})$
Dirichlet process
- generate a dist $G \sim D P (G_{0}, α)$
- generate group-level params $x_{i} \sim G$ , where $x_{i}$ is group param for $i$ th data point
- generate each data point $p_{i} \sim F (x_{i})$
inference: Gibbs sampling
- randomly init group assignments
- pick a point; fix assignments of other points; assign point to most likely group (existing or new)
- repeat till convergence
Indian buffet process: customers belong to multiple tables instead of just 1
CRP, Polya, sticks: sequential; DP: parallel
http://blog.echen.me/2012/03/20/infinite-mixture-models-with-nonparametric-bayes-and-the-dirichlet-process/

16 Natural language processing (NLP)

16.1 Naive Bayes

NB usu uses multi-variate Bernoulli event model: $X_{i} \sim {Bernoulli}_{c, i}$ are whether dict word $i$ is present (for class $c$ )
multinomial event model: better for text classif; X i ∼ Multinomial c is i th word in doc, sampled from dict (for class c )
- model: $Pr [D ∣ {\vec{θ}}_{c}] = \frac{(\sum_{i} f_{i})!}{\prod_{i} f_{i}!} \prod {(θ_{c, i})}^{f_{i}}$ where $D$ is document (this is just multinomial PMF)
- prediction: $arg {max}_{c} [log p ({\vec{θ}}_{c}) + \sum_{i} f_{i} log θ_{c, i}]$
- estimation: ${\hat{θ}}_{c, i} = \frac{N_{c, i} + α_{i}}{N_{c} + α}$ where $α = \sum_{i} α_{i}$
data skew bias: sensitive to training class dist
- fix with complement NB: ${\hat{θ}}_{\tilde{c}, i} = \frac{N_{\tilde{c}, i} + α_{i}}{N_{\tilde{c}} + α}$ and $arg {max}_{c} [log p ({\vec{θ}}_{c}) - \sum_{i} f_{i} log θ_{\tilde{c}, i}]$ (notice the $-$ )
- note: doesn't do anything for binary; need 3+ classes
weight magnitude errors: deal with deps like “San Francisco” vs. “Boston”
- use weight normalization $\frac{log {\hat{θ}}_{c, i}}{\sum_{k} | log {\hat{θ}}_{c, k} |}$ instead of $log {\hat{θ}}_{c, i}$ (given without much explanation)
transforms to model text more accurately based on empirical text dists
- term frequency: empirical dists were heavier-tailed than predicted by multinomial, more like power-law
  - eg multinomial says $Pr [f_{i} = 9]$ (see a word 9 times) is tiny, but in reality pretty high (“burstiness”)
  - use $f_{i}' = log (d + f_{i})$ where $d = 1$ (or some optimized value)
  - brings closer to the Bernoulli (0-1) model
- document frequency
  - common words unlikely to be predictive, but random variations can create fictitious correlations
  - use $f_{i}' = f_{i} log \frac{\sum_{j} 1}{\sum_{j} δ_{i, j}}$ to downweight common words where $δ_{i, j} = 1$ iff word $i$ in doc $j$
- document length
  - docs have strong inter-word deps; if word appears, likely to re-appear
  - longer docs have disproportionately higher empirical term freqs/heavier tails; can have strong effect
  - use $f_{i}' = \frac{f_{i}}{\sqrt{\sum_{k} {(f_{k})}^{2}}}$ ; denom is doc length; discounts long docs; common in IR
http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf
BLEU score: BLEU ( reference , generated ) = BP ( ∏ n=1 4 precision n ) 1 4 = BP exp ( 1 4 ∑ n=1 4 precision n ) where BP = min ( 1, generated length reference length ) is brevity penalty
- modified n-gram precision is $\frac{c l i p p e d m a t c h c o u n t (w)}{g e n e r a t e d n - g r a m c o u n t}$ so that for (say) unigrams, “the the the ...” doesn't get scored highly just because “the” showed up once or twice in any single reference sentence. Count only up to the max # occurrences in reference sentences.
- averages over several different n-gram precisions, using harmonic mean (since it's a product)

17 Complexity and computability

17.1 Turing completeness

Every decision problem has corresponding optimization problem: does there exist a solution of at least/most K
P: decision problems that can be solved in polynomial time
- NP: decision problems that can be verified in polynomial time
- P is subset of NP
- NP hard: cannot be verified in polynomial time
- Most believe P is not NP
NP
- NP hard: problems that can solve any problem in NP (any NP problem can be reduced to any NP hard problem)
- NP complete: problems that are in both NP and NP hard (NP hard that can be verified in polynomial time)
Turing complete: can do everything a Turing machine can do
- aoeuaoeu

18 Diffusion

TODO main derivation

19 Misc math

19.1 Metric spaces

Requirements: symmetry, identity, and triangle inequality
Examples: Euclidian, taxicab, Chebyshev (diagonal is dist 1)
Non-examples: KL divergence

19.2 Elo ratings

Elo rating is based on logistic curve.
- Design choice: if A is 400 points more than B, then prob of A winning is 10x greater than prob B.
- Define scores as probabilities: A winning = 1, losing = 0, and draw = 0.5.
- Expected score is the prob, based on the difference in their ratings: $E_{A} = Pr [A w i n s] = σ_{400} (R_{A} - R_{B}) = \frac{1}{1 + 1 0^{- (R_{A} - R_{B}) / 400}}$
- Originally used normal dist, but switched to logistic which better fit data
Update formula: R A ← R A +K( S A - E A ) where K= η log 10 s =32 ( s is scaling factor of 400, and η is gradient adaptation step), S A is the actual, and E A is expected.
- It's a stochastic gradient update step to maximize the probability/minimize the log loss of $- η log σ_{s} (R_{A} - R_{B})$ . The bigger the surprise, the bigger the correction.

20 Problems

Solve for $x$ in $x^{x^{x^{\dots}}} = 2$ .
Which is greater, $e^{π}$ or $π^{e}$ ? (Hint: $e^{x} > 1 + x$ for $x > 0$ )
http://www.quantnet.com/forum/threads/quantitative-interview-questions-and-answers.437/