OpenCV3的机器学习算法kNN－使用Python

时间 2019-11-13

标签 opencv3 opencv 机器学习算法 knn 使用 python 栏目 Python 繁體版

原文原文链接

OpenCV3的机器学习算法－使用Pythonhtml

英文：http://docs.opencv.org/master/d5/d26/tutorial_py_knn_understanding.html
算法

Goal

In this chapter, we will understand the concepts of k-Nearest Neighbour (kNN) algorithm.数组

Theory

kNN is one of the simplest of classification algorithms available for supervised learning. The idea is to search for closest match of the test data in feature space. We will look into it with below image.app

K 近邻(k-Nearest Neighbour )

1 理解 K 近邻目标

• 本节咱们要理解 k 近邻(kNN)的基本概念、原理。dom

kNN 能够说是最简单的监督学习分类器了。想法也很简单,就是找出测试数据在特征空间中的最近邻居。咱们将使用下面的图片介绍它。机器学习

In the image, there are two families, Blue Squares and Red Triangles. We call each family as Class. Their houses are shown in their town map which we call feature space. *(You can consider a feature space as a space where all datas are projected. For example, consider a 2D coordinate space. Each data has two features, x and y coordinates. You can represent this data in your 2D coordinate space, right? Now imagine if there are three features, you need 3D space. Now consider N features, where you need N-dimensional space, right? This N-dimensional space is its feature space. In our image, you can consider it as a 2D case with two features)*.ide

上图中的对象能够分红两组,蓝色方块和红色三角。每一组也能够称为一个类。咱们能够把全部的这些对象当作是一个城镇中房子,而全部的房子分别属于蓝色和红色家族,而这个城镇就是所谓的特征空间。(你能够把一个特征空间当作是全部点的投影所在的空间。例如在一个 2D 的坐标空间中,每一个数据都两个特征 x 坐标和 y 坐标,你能够在 2D 坐标空间中表示这些数据。若是每个数据都有 3 个特征呢,咱们就须要一个 3D 空间。N 个特征就须要 N 维空间,这个 N 维空间就是特征空间。在上图中,咱们能够认为是具备两个特征色2D 空间)。学习

Now a new member comes into the town and creates a new home, which is shown as green circle. He should be added to one of these Blue/Red families. We call that process, Classification. What we do? Since we are dealing with kNN, let us apply this algorithm.
测试

如今城镇中来了一个新人,他的新房子用绿色圆盘表示。咱们要根据他房子的位置把他归为蓝色家族或红色家族。咱们把这过程成为分类。咱们应该怎么作呢?由于咱们正在学习看 kNN,那咱们就使用一下这个算法吧。ui

One method is to check who is his nearest neighbour. From the image, it is clear it is the Red Triangle family. So he is also added into Red Triangle. This method is called simply Nearest Neighbour, because classification depends only on the nearest neighbour.

一个方法就是查看他最近的邻居属于那个家族,从图像中咱们知道最近的是红色三角家族。因此他被分到红色家族。这种方法被称为简单近邻,由于分类仅仅决定与它最近的邻居。

But there is a problem with that. Red Triangle may be the nearest. But what if there are lot of Blue Squares near to him? Then Blue Squares have more strength in that locality than Red Triangle. So just checking nearest one is not sufficient. Instead we check some k nearest families. Then whoever is majority in them, the new guy belongs to that family. In our image, let's take k=3, ie 3 nearest families. He has two Red and one Blue (there are two Blues equidistant, but since k=3, we take only one of them), so again he should be added to Red family. But what if we take k=7? Then he has 5 Blue families and 2 Red families. Great!! Now he should be added to Blue family. So it all changes with value of k. More funny thing is, what if k = 4? He has 2 Red and 2 Blue neighbours. It is a tie !!! So better take k as an odd number. So this method is called k-Nearest Neighbour since classification depends on k nearest neighbours.

可是这里还有一个问题。红色三角多是最近的,但若是他周围还有不少蓝色方块怎么办呢?此时蓝色方块对局部的影响应该大于红色三角。因此仅仅检测最近的一个邻居是不足的。因此咱们检测 k 个最近邻居。谁在这 k 个邻居中占据多数,那新的成员就属于谁那一类。若是 k 等于 3,也就是在上面图像中检测 3 个最近的邻居。他有两个红的和一个蓝的邻居,因此他仍是属于红色家族。可是若是 k 等于 7 呢?他有 5 个蓝色和 2 个红色邻居,如今他就会被分到蓝色家族了。k 的取值对结果影响很是大。更有趣的是,若是 k 等于 4呢?两个红两个蓝。这是一个死结。因此 k 的取值最好为奇数。这中根据 k 个最近邻居进行分类的方法被称为 kNN。

Again, in kNN, it is true we are considering k neighbours, but we are giving equal importance to all, right? Is it justice? For example, take the case of k=4. We told it is a tie. But see, the 2 Red families are more closer to him than the other 2 Blue families. So he is more eligible to be added to Red. So how do we mathematically explain that? We give some weights to each family depending on their distance to the new-comer. For those who are near to him get higher weights while those are far away get lower weights. Then we add total weights of each family separately. Whoever gets highest total weights, new-comer goes to that family. This is called modified kNN.

在 kNN 中咱们考虑了 k 个最近邻居,可是咱们给了这些邻居相等的权重,这样作公平吗?以 k 等于 4 为例,咱们说她是一个死结。可是两个红色三角比两个蓝色方块距离新成员更近一些。因此他更应该被分为红色家族。那用数学应该如何表示呢?咱们要根据每一个房子与新房子的距离对每一个房子赋予不同的权重。距离近的具备更高的权重,距离远的权重更低。而后咱们根据两个家族的权重和来判断新房子的归属,谁的权重大就属于谁。这被称为修改过的kNN。

So what are some important things you see here?

You need to have information about all the houses in town, right? Because, we have to check the distance from new-comer to all the existing houses to find the nearest neighbour. If there are plenty of houses and families, it takes lots of memory, and more time for calculation also.
There is almost zero time for any kind of training or preparation.

Now let's see it in OpenCV.

那这里面些是重要的呢?

• 咱们须要整个城镇中每一个房子的信息。由于咱们要测量新来者到全部现存房子的距离,并在其中找到最近的。若是那里有不少房子,就要占用很大的内存和更多的计算时间。

• 训练和处理几乎不须要时间。如今咱们看看 OpenCV 中的 kNN。

kNN in OpenCV

We will do a simple example here, with two families (classes), just like above. Then in the next chapter, we will do an even better example.

So here, we label the Red family as Class-0 (so denoted by 0) and Blue family as Class-1 (denoted by 1). We create 25 families or 25 training data, and label them either Class-0 or Class-1. We do all these with the help of Random Number Generator in Numpy.

Then we plot it with the help of Matplotlib. Red families are shown as Red Triangles and Blue families are shown as Blue Squares.

1.1 OpenCV 中的 kNN

咱们这里来举一个简单的例子,和上面同样有两个类。下一节咱们会有一个更好的例子。

这里咱们将红色家族标记为 Class-0,蓝色家族标记为 Class-1。还要再建立 25 个训练数据,把它们非别标记为 Class-0 或者 Class-1。Numpy中随机数产生器能够帮助咱们完成这个任务。

而后借助 Matplotlib 将这些点绘制出来。红色家族显示为红色三角蓝色家族显示为蓝色方块。

# -*- coding: utf-8 -*-
"""
Created on Tue Jan 28 18:00:18 2014
@author: duan
"""

import cv2import numpy as np
import matplotlib.pyplot as plt

# Feature set containing (x,y) values of 25 known/training data
trainData = np.random.randint(0,100,(25,2)).astype(np.float32)

# Labels each one either Red or Blue with numbers 0 and 1
responses = np.random.randint(0,2,(25,1)).astype(np.float32)

# Take Red families and plot them
red = trainData[responses.ravel()==0]
plt.scatter(red[:,0],red[:,1],80,'r','^')

# Take Blue families and plot them
blue = trainData[responses.ravel()==1]
plt.scatter(blue[:,0],blue[:,1],80,'b','s')
plt.show()

You will get something similar to our first image. Since you are using random number generator, you will be getting different data each time you run the code.

Next initiate the kNN algorithm and pass the trainData and responses to train the kNN (It constructs a search tree).

Then we will bring one new-comer and classify him to a family with the help of kNN in OpenCV. Before going to kNN, we need to know something on our test data (data of new comers). Our data should be a floating point array with size numberoftestdata×numberoffeatures. Then we find the nearest neighbours of new-comer. We can specify how many neighbours we want. It returns:

The label given to new-comer depending upon the kNN theory we saw earlier. If you want Nearest Neighbour algorithm, just specify k=1 where k is the number of neighbours.
The labels of k-Nearest Neighbours.
Corresponding distances from new-comer to each nearest neighbour.

So let's see how it works. New comer is marked in green color.

你可能会获得一个与上面相似的图形,但不会彻底同样,由于你使用了随机数产生器,每次你运行代码都会获得不一样的结果。

下面就是 kNN 算法分类器的初始化,咱们要传入一个训练数据集,以及与训练数据对应的分类来训练 kNN 分类器(构建搜索树)。

最后要使用 OpenCV 中的 kNN 分类器,咱们给它一个测试数据,让它来进行分类。在使用 kNN 以前,咱们应该对测试数据有所了解。咱们的数据应该是大小为数据数目乘以特征数目的浮点性数组。而后咱们就能够经过计算找到测试数据最近的邻居了。咱们能够设置返回的最近邻居的数目。返回值包括:

1. 由 kNN 算法计算获得的测试数据的类别标志(0 或 1)。若是你想使用最近邻算法,只须要将 k 设置为 1,k 就是最近邻的数目。

2. k 个最近邻居的类别标志。

3. 每一个最近邻居到测试数据的距离。让咱们看看它是如何工做的。测试数据被标记为绿色。

newcomer = np.random.randint(0,100,(1,2)).astype(np.float32)
plt.scatter(newcomer[:,0],newcomer[:,1],80,'g','o')

knn = cv2.KNearest()
knn.train(trainData,responses)
ret, results, neighbours ,dist = knn.find_nearest(newcomer, 3)

print "result: ", results,"\n"
print "neighbours: ", neighbours,"\n"print "distance: ", dist
plt.show()

I got the result as follows:

下面是我获得的结果:

It says our new-comer got 3 neighbours, all from Blue family. Therefore, he is labelled as Blue family. It is obvious from plot below:

这说明咱们的测试数据有 3 个邻居,他们都是蓝色,因此它被分为蓝色家族。结果很明显,以下图所示:

result:  [[ 1.]]
neighbours:  [[ 1.  1.  1.]]
distance:  [[ 53.  58.  61.]]

If you have large number of data, you can just pass it as array. Corresponding results are also obtained as arrays.

若是咱们有大量的数据要进行测试,能够直接传入一个数组。对应的结果一样也是数组。

# 10 new comers
newcomers = np.random.randint(0,100,(10,2)).astype(np.float32) ret, results,neighbours,dist = knn.find_nearest(newcomer, 3)
# The results also will contain 10 labels.

更多资源

1. NPTEL notes on Pattern Recognition, Chapter 11