Pattern Recognition

Buzzwords like “Machine Learning”, AI”, “Big Data” are now everywhere. If a company has nothing to do with any of these fashion words, it may even be considered as somewhat “conserved”, even if it’s not true.

Yes, the rapid development of information technology gives humans more and more powerful tools to deal with problems that were extremely difficult or complicated in the past. Pattern Recognition, the old name of “Machine Learning”, is still well-functioning for most of nowadays applications. As a “white-box” method (almost), Pattern Recognition is still widely chosen because its explainability. For applications where people need to know how and why an answer is produced by the algorithm, Pattern Recognition methods are still the first or even only methods to choose from.

As a person who keeps an eye on the chaotic combat, I would like to express part of my understanding from an EE&CS&Math perspective, together with my limited project experiences. I hope there are a few viewpoints that are also recognized by you.

For the remaining part of my notes, I would like to talk about the necessary background knowledge to basic ideas of Pattern Recognitions in different applications.

I know that there must be a massive amount of errors, typos, and even stupid grammatical mistakes. Please help me by pointing them out in the email. If you also have some confusion and would like to ask why you are also welcomed. My email is wangy52@rpi.edu. Thank you very much!!!

Probability

Probability is the fundation of Pattern Recognition, espeically the results from Bayesians. The most important, powerful, and straightforward conclusion here is the optimal decision rule (MAP) that \(y^*=argmax_y P(y|x)\) where \(x\) is the data point and \(y^*\) is the optimal predicted label. As you will see, figuring out posterior \(P(y|x)\) or even finding a good approximation of the probability. Of course, there are many other useful results, but as a researcher, “optimal” means way more than “working” right?

This notes talks about four commonly used probability distributions. Pay great attention to the concept called “conjugate prior” here.

Mean and variance of conditional Gaussian is derived in the note.

Sequential Updating is an online algorithm which can be changed to Gradient Descent.

MLE and MAP corresponding to Frequentists and Bayesians’ view of estimation models. This note use MLE and MAP to solve the coin flip problem and you can know what are the differences between the two methods.

Linear Algebra is useful throughout the whole learning period. Matrix Calculus is extremely useful if you want to jump to Neural Network.

Regression

Regression and classification are the two basic tasks of Machine Learning. The output of the regression model is a countinuous, usually uncountable, value, while classification produces a label in a countable and often finite set. Many of you may have heard of polynomial fitting which uses a polynomial to approximate a trend. Here, we go a little bit deeper. We introduce Singular Value Decomposition (SVD), which tells you how the polynomial coefficients are calculated. You will also have a good understanding of where comes from the fitting error.

Classification

Classification tells you which category the given data point belongs to.

Linear Regression is one typical representative of this classification method.

This is a simple but proved-to-be-powerful method to model some seemly difficult problems.

Support Vector Machine

Before learning, I recommend you to know something related to Convex Optimization. The derivation of SVM requires to solve one minimization from both prime and dual directions. Knowledge about Lagrangian is important.

Graphical Models

Graphical Models are used to simplify the characterization of \(P(y|x) \sim P(y,x)\) if we know some independency information of \(x\). If we know feature \(x_1\) and \(x_2\) are independent, the joint probability \(P(x_1,x_2)=P(x_1)P(x_2)\). Since figuring out \(P(x_1)\) and \(P(x_2)\) only need \(O(2N)\) times of estimation while characterizing \(P(x_1, x_2)\) requires \(O(N^2)\), the information of independency saves tones of effort. Other than that, we can also infer the cause from result from the model, e.g. you are reading this example => you are visiting my website => you have a smart device and/or a laptop.

This part is still not finished. Algorithms, say Forward propagation and Viterbi, will be added to this part.

Mixture Model and EM

Strongly recommend you to read Graphic Model in previous section before reading this section’s notes.

Multiarm Bandit Machine Problem

In online learning, there is a class of problem requires the agent to learning the system while taking actions. When the cost of taking an action is negligible, the problem can be treated as a offline learning problem. However, when the cost is considerable, ideally, the agent should quickly figure out the which is the most beneficial action, exploration, and spend almost all the budget onto that, exploitation. When the ground truth is considered not change with time (or slowly change), these problems are called multi-arm bandit machine (with limited budge) problem. Greedy algorithm, or a more practical variation :math:`epsilon-`greedy, is a solution. But here, we would like to introduce a more involved and active algorithm, Thompson Sampling.

Neural Network and Deep Neural Network