Somebody else looking in  # Preceptron – A simple binary classifier

## What is a preceptron?

Suppose we have a linearly separable two-category dataset with an input space \mathcal{X}\subseteq\mathbb{R}^n and a output space \mathcal{Y}=\{+1,-1\}, preceptron tries to find a hyper-plane that can completely seperate these two class (-1 and 1). Here is a intutive plot of it: The hyper-plane cuts the dataset into two parts with the values of -1 and 1, that white vacancy is the imagined hyper-plane.

## The Model

As we give the input and the output space above, the function for that reflection is:

f(x)=\mathrm{sign}(w\cdot x+b)

Here the \cdot stand for inner product of two vector, w for weight vector and b for bias.
Plus, the \mathrm{sign} here is the sign function:

\begin{cases}
&amp;  +1,x\ge 0 \\
&amp;  -1,x&lt; 0
\end{cases}

For a given dataset T where T=\{(x_1,y_1),(x_2,y_2),...,(x_N,y_N)\}, preceptron model is to get the w,b that satisfy the classfication requires. Then using the model to forecast any given data is avaliable.

## The Strategy

### Linear Separability

For a given dataset

T=\{(x_1,y_1),...(x_N,y_N)\}

If it exists a hyper-plane that can completely divide dataset into positive and negative subsets, in which for all
i\space\text{that}\space y_i=+1\space\text{s.t.}\space w\cdot x_i+b>0, we consider this a linearly separable data set, and vice versa.

### The Loss Function

#### Distance

For a random point x_0 in the input space \mathbb{R}^n, we have the distance to the given hyper-plane S is

\frac{1}{||w||}|w\cdot x_0+b|

For misclassified data, -y_i(w\cdot x_i+b)>0, so the distance to the hyper-plane is

\frac{1}{||w||}y_i(w\cdot x_i+b)

Let M be the set contains all the misclassified data, the total distance would be

\displaystyle -\frac{1}{||w||}\sum_{x_i\in M}y_i(w\cdot x_i+b)

Dismiss \frac{1}{||w||}, we have the loss function

\displaystyle L(w,b)=-\sum_{x_i\in M}y_i(w\cdot x_i+b)

## The Algoritm

The algoritm of preceptron is a optimization-algoritm, which is, calculating the parameters w,b by minimizing the loss function

\displaystyle \min_{w,b}L(w,b)=-\sum_{x_i\in M}y_i(w\cdot x_i+b)

Here we use stochastic gradient descent to find the minimum of the loss function

\displaystyle \nabla_wL(w,b)=-\sum_{x_i\in M}y_ix_i\\
\nabla_wL(w,b)=-\sum_{x_i\in M}y_i

In this case the update of w and b would be

w\leftarrow w+\eta y_ix_i\\
b\leftarrow b+\eta y_i

Where \eta(0<\eta\leq1) is the stepsize (some would like to say learning rate).

## The Dual Problem

In the primary problem, when the procedure stops, the final w,b will be

\displaystyle w=\sum^{N}_{i=1}\alpha_i y_i x_i\\
\displaystyle b= \sum^{N}_{i=1}\alpha_i y_i

So we can conclude a model like

\displaystyle f(x)=\mathrm{sign}(\sum_{j=1}^N \alpha_iy_jx_j\cdot x+b)\\
\mathrm{where}\quad \alpha = (\alpha_1,\alpha_2,...,\alpha_N)^T

And the condition would be like

\displaystyle y_i(\sum^N_{j=1}\alpha_jy_jx_j\cdot x_i+b)\leq 0

We only need the inner product of each data, and store them in a matrix call the Gram.

G=[x_i\cdot x_j]_{N\times N}