# A Story of Basis and Kernel - Part II: Reproducing Kernel Hilbert Space

## 1. Opening Words

In the previous blog, the function basis was briefly discussed. We began with viewing a function as an infinite vector, and then defined the inner product of functions. Similar to $\mathcal{R}^n$ space, we can also find orthogonal function basis for a function space.

This blog will move a step further discussing about kernel functions and reproducing kernel Hilbert space (RKHS). Kernel methods have been widely used in a variety of data analysis techniques. The motivation of kernel method arises in mapping a vector in $\mathcal{R}^n$ space as another vector in a feature space. For example, imagine there are some red points and some blue points as the next figure shows, which are not easily separable in $\mathcal{R}^n$ space. However, if we map them into a high-dimension feature space, we may be able to seperate them easily. This article will not provide strict theoretical definition, but rather intuitive description on the basic ideas.

## 2. Eigen Decomposition

For a real symmetric matrix $\mathbf{A}$ , there exists real number $\lambda$ and vector $\mathbf{x}$ so that

$\mathbf{A} \mathbf{x} = \lambda \mathbf{x}$

Then $\lambda$ is an eigenvalue of $\mathbf{A}$ and $\mathbf{x}$ is the corresponding eigenvector. If $\mathbf{A}$ has two different eigenvalues $\lambda_1$ and $\lambda_2$ , $\lambda_1 \neq \lambda_2$ , with corresponding eigenvectors $\mathbf{x}_1$ and $\mathbf{x}_2$ respectively,

$\lambda_1 \mathbf{x}_1^T \mathbf{x}_2 = \mathbf{x}_1^T \mathbf{A}^T \mathbf{x}_2 = \mathbf{x}_1^T \mathbf{A} \mathbf{x}_2 = \lambda_2 \mathbf{x}_1^T \mathbf{x}_2$

Since $\lambda_1 \neq \lambda_2$ , we have $\mathbf{x}_1^T \mathbf{x}_2 = 0$ , i.e., $\mathbf{x}_1$ and $\mathbf{x}_2$ are orthogonal.

For $\mathbf{A} \in \mathcal{R}^{n \times n}$ , we can find $n$ eigenvalues a long with $n$ orthogonal eigenvectors. As a result, $\mathbf{A}$ can be decomposited as

$\mathbf{A} = \mathbf{Q} \mathbf{D} \mathbf{Q}^T$

where $\mathbf{Q}$ is an orthogonal matrix (i.e., $\mathbf{Q} \mathbf{Q}^T = \mathbf{I}$ ) and $\mathbf{D} = \text{diag} (\lambda_1, \lambda_2, \cdots, \lambda_n)$ . If we write $\mathbf{Q}$ column by column

$\mathbf{Q}=\left( \mathbf{q}_1, \mathbf{q}_2, \cdots, \mathbf{q}_n \right)$

then

$\begin{array}{rl} \mathbf{A}=\mathbf{Q} \mathbf{D} \mathbf{Q}^T &= \left( \mathbf{q}_1, \mathbf{q}_2, \cdots, \mathbf{q}_n \right) \begin{pmatrix} \lambda_1\ &&& \\ & \lambda_2\ && \\ && \ddots\ & \\ &&&\lambda_n \end{pmatrix} \begin{pmatrix} \mathbf{q}_1^T \\ \mathbf{q}_2^T \\ \vdots \\ \mathbf{q}_n^T \end{pmatrix} \\ &= \left( \lambda_1 \mathbf{q}_1, \lambda_2 \mathbf{q}_2, \cdots, \lambda_n \mathbf{q}_n \right) \begin{pmatrix} \mathbf{q}_1^T \\ \mathbf{q}_2^T \\ \vdots \\ \mathbf{q}_n^T \end{pmatrix} \\ &=\sum_{i=1}^n \lambda_i \mathbf{q}_i \mathbf{q}_i^T \end{array}$

Here ${ \{\mathbf{q}_i \} }_{i=1}^n$ is a set of orthogonal basis of $\mathcal{R}^n$ .

## 3. Kernel Function

A function $f(\mathbf{x})$ can be viewed as an infinite vector, then for a function with two independent variables $K(\mathbf{x},\mathbf{y})$ , we can view it as an infinite matrix. Among them, if $K(\mathbf{x},\mathbf{y}) = K(\mathbf{y},\mathbf{x})$ and

$\int \int f(\mathbf{x}) K(\mathbf{x},\mathbf{y}) f(\mathbf{y}) d\mathbf{x} d\mathbf{y} \geq 0$

for any function $f$ , then $K(\mathbf{x},\mathbf{y})$ is symmetric and positive definite, in which case $K(\mathbf{x},\mathbf{y})$ is a kernel function.

Similar to matrix eigenvalue and eigenvector, there exists eigenvalue $\lambda$ and eigenfunction $\psi(\mathbf{x})$ so that

$\int K(\mathbf{x},\mathbf{y}) \psi(\mathbf{x}) d\mathbf{x} = \lambda \psi(\mathbf{y})$

For different eigenvalues $\lambda_1$ and $\lambda_2$ with corresponding eigenfunctions $\psi_1(\mathbf{x})$ and $\psi_2(\mathbf{x})$ , it is easy to show that

$\begin{array}{rl} \int \lambda_1 \psi_1(\mathbf{x}) \psi_2(\mathbf{x}) d\mathbf{x} & = \int \int K(\mathbf{y},\mathbf{x}) \psi_1(\mathbf{y}) d\mathbf{y} \psi_2(\mathbf{x}) d\mathbf{x} \\ & = \int \int K(\mathbf{x},\mathbf{y}) \psi_2(\mathbf{x}) d\mathbf{x} \psi_1(\mathbf{y}) d\mathbf{y} \\ & = \int \lambda_2 \psi_2(\mathbf{y}) \psi_1(\mathbf{y}) d\mathbf{y} \\ & = \int \lambda_2 \psi_2(\mathbf{x}) \psi_1(\mathbf{x}) d\mathbf{x} \end{array}$

Therefore,

$< \psi_1, \psi_2 > = \int \psi_1(\mathbf{x}) \psi_2(\mathbf{x}) d\mathbf{x} = 0$

Again, the eigenfunctions are orthogonal. Here $\psi$ denotes the function (the infinite vector) itself.

For a kernel function, infinite eigenvalues ${ \{\lambda_i\} }_{i=1}^{\infty}$ along with infinite eigenfunctions ${ \{\psi_i\} }_{i=1}^{\infty}$ may be found. Similar to matrix case,

$K(\mathbf{x},\mathbf{y}) = \sum_{i=0}^{\infty} \lambda_i \psi_i (\mathbf{x}) \psi_i (\mathbf{y})$

which is the Mercer's theorem. Here $< \psi_i, \psi_j > = 0$ for $i \neq j$ . Therefore, ${ \{\psi_i\} }_{i=1}^{\infty}$ construct a set of orthogonal basis for a function space.

Here are some commonly used kernels:

• Polynomial kernel $K(\mathbf{x},\mathbf{y}) = ( \gamma \mathbf{x}^T \mathbf{y} + C)^d$
• Gaussian radial basis kernel $K(\mathbf{x},\mathbf{y}) = \exp (-\gamma \Vert \mathbf{x} - \mathbf{y} \Vert^2 )$
• Sigmoid kernel $K(\mathbf{x},\mathbf{y}) = \tanh (\gamma \mathbf{x}^T \mathbf{y} + C )$

## 4. Reproducing Kernel Hilbert Space

Treat ${ \{\sqrt{\lambda_i} \psi_i\} }_{i=1}^{\infty}$ as a set of orthogonal basis and construct a Hilbert space $\mathcal{H}$ . Any function or vector in the space can be represented as the linear combination of the basis. Suppose

$f = \sum_{i=1}^{\infty} f_i \sqrt{\lambda_i} \psi_i$

we can denote $f$ as an infinite vector in $\mathcal{H}$ :

$f = (f_1, f_2, ...)_\mathcal{H}^T$

For another function $g = (g_1, g_2, ...)_\mathcal{H}^T$ , we have

$< f,g >_\mathcal{H} = \sum_{i=1}^{\infty} f_i g_i$

For the kernel function $K$, here I use $K(\mathbf{x},\mathbf{y})$ to denote the evaluation of $K$ at point $\mathbf{x},\mathbf{y}$ which is a scalar, use $K(\cdot,\cdot)$ to denote the function (the infinite matrix) itself, and use $K(\mathbf{x},\cdot)$ to denote the $\mathbf{x}$th "row" of the matrix, i.e., we fix one parameter of the kernel function to be $\mathbf{x}$ then we can regard it as a function with one parameter or as an infinite vector. Then

$K(\mathbf{x},\cdot) = \sum_{i=0}^{\infty} \lambda_i \psi_i (\mathbf{x}) \psi_i$

In space $\mathcal{H}$ , we can denote

$K(\mathbf{x},\cdot) = (\sqrt{\lambda_1} \psi_1 (\mathbf{x}), \sqrt{\lambda_2} \psi_2 (\mathbf{x}), \cdots )_\mathcal{H}^T$

Therefore

$< K(\mathbf{x},\cdot), K(\mathbf{y},\cdot) >_\mathcal{H} = \sum_{i=0}^{\infty} \lambda_i \psi_i (\mathbf{x}) \psi_i(\mathbf{y}) = K(\mathbf{x},\mathbf{y})$

This is the reproducing property, thus $\mathcal{H}$ is called reproducing kernel Hilbert space (RKHS).

Now it is time to return to the problem from the beginning of this article: how to map a point into a feature space? If we define a mapping

$\bold{\Phi} (\mathbf{x}) = K(\mathbf{x},\cdot) = (\sqrt{\lambda_1} \psi_1 (\mathbf{x}), \sqrt{\lambda_2} \psi_2 (\mathbf{x}), \cdots )^T$

then we can map the point $\mathbf{x}$ to $\mathcal{H}$ . Here $\bold{\Phi}$ is not a function, since it points to a vector or a funtion in the feature space $\mathcal{H}$ . Then

$< \bold{\Phi} (\mathbf{x}), \bold{\Phi} (\mathbf{y}) >_\mathcal{H} = < K(\mathbf{x},\cdot), K(\mathbf{y},\cdot) >_\mathcal{H} = K(\mathbf{x},\mathbf{y})$

As a result, we do not need to actually know what is the mapping, where is the feature space, or what is the basis of the feature space. For a symmetric positive-definite function $K$ , there must exist at least one mapping $\bold{\Phi}$ and one feature space $\mathcal{H}$ so that

$< \bold{\Phi} (\mathbf{x}), \bold{\Phi} (\mathbf{y}) > = K(\mathbf{x},\mathbf{y})$

which is the so-called kernel trick.

## 5. A Simple Example

Consider kernel function

$K(\mathbf{x},\mathbf{y}) = \left( x_1, x_2, x_1 x_2 \right) \begin{pmatrix} y_1 \\ y_2 \\ y_1 y_2 \end{pmatrix} = x_1 y_1 + x_2 y_2 + x_1 x_2 y_1 y_2$

where $\mathbf{x}=(x_1,x_2)^T, \mathbf{y}=(y_1,y_2)^T$ . Let $\lambda_1=\lambda_2=\lambda_3=1$ , $\psi_1(\mathbf{x})=x_1$ , $\psi_2(\mathbf{x})=x_2$ , $\psi_3(\mathbf{x})=x_1 x_2$ . We can define the mapping as

$\begin{pmatrix} x_1 \\ x_2 \end{pmatrix} \overset{\bold{\Phi}}{\longrightarrow} \begin{pmatrix} x_1 \\ x_2 \\ x_1 x_2 \end{pmatrix}$

Then

$< \bold{\Phi} (\mathbf{x}), \bold{\Phi}(\mathbf{y}) > = \left( x_1, x_2, x_1 x_2 \right) \begin{pmatrix} y_1 \\ y_2 \\ y_1 y_2 \end{pmatrix} = K(\mathbf{x},\mathbf{y})$

## 6. Support Vector Machine

Support vector machine (SVM) is one of the most widely known application of RKHS. Suppose we have data pairs ${ (\mathbf{x}_i, y_i) }_{i=1}^n$ where $y_i$ is either 1 or -1 denoting the class of the point $\mathbf{x}_i$. SVM assumes a hyperplane to best seperate the two classes.

$\min_{\boldsymbol{\beta}, \beta_0} \frac{1}{2} \Vert \boldsymbol{\beta} \Vert^2 + C \sum_{i=1}^n \xi_i$

$\text{subject to } \xi_i \geq 0, y_i (\mathbf{x}_i^T \boldsymbol{\beta} + \beta_0 ) \geq 1 - \xi_i, \forall i$

Sometimes the two classes cannot be easily seperated in $\mathcal{R}^n$ space, thus we can map $\mathbf{x}_i$ into a high-dimension feature space where the two classes may be easily seperated. The original problem can be reformulated as

$\min_{\boldsymbol{\beta}, \beta_0} \frac{1}{2} \Vert \boldsymbol{\beta} \Vert^2 + C \sum_{i=1}^n \xi_i$

$\text{subject to } \xi_i \geq 0, y_i (\bold{\Phi}(\mathbf{x}_i)^T \boldsymbol{\beta} + \beta_0 ) \geq 1 - \xi_i, \forall i$

The Lagrange function is

$L_p = \frac{1}{2} \Vert \boldsymbol{\beta} \Vert^2 + C \sum_{i=1}^n \xi_i - \sum_{i=1}^n \alpha_i [y_i (\bold{\Phi}(\mathbf{x}_i)^T \boldsymbol{\beta} + \beta_0) - (1-\xi_i)] -\sum_{i=1}^n \mu_i \xi_i$

Since

$\frac{\partial L_p}{\partial \boldsymbol{\beta}} = \mathbf{0}$

we get

$\boldsymbol{\beta} = \sum_{i=1}^n \alpha_i y_i \bold{\Phi}(\mathbf{x}_i)$

That is, $\boldsymbol{\beta}$ can be writen as the linear combination of $\mathbf{x}_i$s! We can substitute $\boldsymbol{\beta}$ and get the new optimization problem. The objective function changes to:

$\begin{array}{rl} &\frac{1}{2} \Vert \sum_{i=1}^n \alpha_i y_i \bold{\Phi} (\mathbf{x}_i) \Vert^2 + C \sum_{i=1}^n \xi_i \\ =& \frac{1}{2} < \sum_{i=1}^n \alpha_i y_i \bold{\Phi} (\mathbf{x}_i), \sum_{j=1}^n \alpha_j y_j \bold{\Phi} (\mathbf{x}_j) > + C \sum_{i=1}^n \xi_i \\ =& \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j < \bold{\Phi} (\mathbf{x}_i), \bold{\Phi} (\mathbf{x}_j) > + C \sum_{i=1}^n \xi_i \\ = & \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j K(\mathbf{x}_i, \mathbf{x}_j) + C \sum_{i=1}^n \xi_i \end{array}$

The constraints changes to:

$\begin{array}{rl} & y_i \left[\bold{\Phi}(\mathbf{x}_i)^T \left( \sum_{j=1}^n \alpha_j y_j \bold{\Phi}(\mathbf{x}_j) \right) + \beta_0 \right] \\ =& y_i \left[ \left( \sum_{j=1}^n \alpha_j y_j < \bold{\Phi}(\mathbf{x}_i), \bold{\Phi}(\mathbf{x}_j) > \right) + \beta_0 \right] \\ =& y_i \left[ \left( \sum_{j=1}^n \alpha_j y_j K(\mathbf{x}_i, \mathbf{x}_j) \right) + \beta_0 \right] \geq 1 - \xi_i, \forall i \end{array}$

What we need to do is determining a kernel function and solve for $\boldsymbol{\alpha}, \beta_0, \xi_i$ . We do not need to actually construct the feature space. For a new data $\mathbf{x}$ with unknown class, we can predict its class by

$\begin{array}{ccl} \hat{y} &=& \text{sign} \left[ \bold{\Phi} (\mathbf{x})^T \boldsymbol{\beta} + \beta_0 \right] \\ &=& \text{sign} \left[ \bold{\Phi} (\mathbf{x})^T \left( \sum_{i=1}^n \alpha_i y_i \bold{\Phi}(\mathbf{x}_i) \right) + \beta_0 \right] \\ &=& \text{sign} \left( \sum_{i=1}^n \alpha_i y_i < \bold{\Phi} (\mathbf{x}), \bold{\Phi}(\mathbf{x}_i) > + \beta_0 \right) \\ &=& \text{sign} \left( \sum_{i=1}^n \alpha_i y_i K(\mathbf{x},\mathbf{x}_i) + \beta_0 \right) \end{array}$

Kernel methods greatly strengthen the discriminative power of SVM.

## 7. Summary and Reference

Kernel method has been widely utilized in data analytics. Here, the fundamental property of RKHS is introduced. With kernel trick, we can easily map the data to a feature space and do analysis. Here is a video with nice demonstration on why we can easily do classification with kernel SVM in a high-dimension feature space.

The example in Section 5 is from

• Gretton A. (2015): Introduction to RKHS, and some simple kernel algorithms, Advanced Topics in Machine Learning, Lecture conducted from University College London.

Other reference includes

• Paulsen, V. I. (2009). An introduction to the theory of reproducing kernel Hilbert spaces. Lecture Notes.
• Daumé III, H. (2004). From zero to reproducing kernel hilbert spaces in twelve pages or less.
• Friedman, J., Hastie, T., and Tibshirani, R. (2001). The elements of statistical learning. Springer, Berlin: Springer series in statistics.)