你能够在这里阅读 上一篇html
我是薛银亮,感谢英文原版在线书籍,这是我学习机器学习过程当中感受很是适合新手入门的一本书。鉴于知识分享的精神,我但愿能将其翻译过来,并分享给全部想了解机器学习的人,本人翻译水平有限,欢迎读者提出问题和发现错误,更欢迎大牛的指导。由于篇幅较长,文章将会分为多个部分进行,感兴趣的能够关注个人文集,文章会持续更新。python
根据上一篇学到的知识,让咱们使用随机梯度降低和MNIST数据集来实现咱们的手写数字识别程序吧。若是你还没看过必备的知识,请移步到上一篇文章进行学习,你能够关注个人文集来或许个人持续更新。这里咱们将使用Phthon(2.7),代码量仅仅74行,可是请注意,咱们若是为了学习机器学习的思想而且但愿能将这种技术应用到更多的领域的话,我建议不要过于关注代码,千万不要去试图背诵代码,由于这是没有意义的。git
第一件事是获取MNIST数据集,若是是一个git的使用者,关于git是什么我就不介绍了,应该是每个程序员或者技术研究者都会的才对。你可使用下面的命令来获取数据集合:程序员
git clone https://github.com/mnielsen/neural-networks-and-deep-learning.git复制代码
若是你没有用过git,也能够在这里下载github
补充:前面的文章我介绍说MNIST数据有60000个训练数据,10000个测试数据,这是MNIST官方的介绍。实际上,咱们这里的数据有点小不一样。咱们的测试数据是从训练数据中分离出一部分组成的,也就是说,咱们把60000个图片的数据分红50000个组成训练集,而后剩下的10000个组成测试集。算法
咱们会使用一个Python库Numpy。使用它提供的线性代数的运算,若是你尚未安装Numpy,能够在这里得到。shell
我先解释一下代码的结构设计,核心是Network类,这表明一个神经网络,这里给出初始化一个Network对象的代码:数组
class Network(object):
def __init__(self, sizes):
self.num_layers = len(sizes)
self.sizes = sizes
self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
self.weights = [np.random.randn(y, x)
for x, y in zip(sizes[:-1], sizes[1:])]复制代码
其中,sizes是一个list,包含了网络中每一层神经元的数量。例如,咱们建立一个网络第一层有2个神经元,第二层有3个神经元,最后一层有1个神经元,咱们可使用下面的代码:bash
net = Network([2, 3, 1])复制代码
b和权重的值都是被初始化成随机数的,使用Numpy的np.random.randn函数生成的均值为0方差为1符合高斯分布的随机数。这些初始值是咱们开始随机梯度降低开始的地方。可是在后面的章节,咱们会有更好的方法来初始化权重和b值,只是如今先这样作。注意,网络中的第一层是输入层,这一层中没有设置b值,由于b值仅仅用于后续的输出层。微信
全部的b和权重都被做为list存储在Numpy的向量中。例如,net.weights[1]是Numpy向量中存储链接神经元第二和第三层的(不是第一和第二层,由于python的list的index开始是0)。由于net.weights[1]这样的写法过于复杂,咱们能够定义向量是w。例如Wjk表明的是第二层第k个神经元和第三层中第j个神经元之间的权重。咱们将 σ函数向量化:
a是第二层神经元的激活向量,很容易看出来公式(22)和公式(4)是相同的。
定义sigmoid函数:
def sigmoid(z):
return 1.0/(1.0+np.exp(-z))复制代码
注意,当z是一个向量或一个Numpy数组时,Numpy会自动的对向量中每个元素使用sigmoid函数,就是向量化操做。
而后添加feedforward方法:当网络输入一个a,获得响应的输出*。这个方法实现的就是公式(22),每一层也都用这个方法:
def feedforward(self, a):
"""Return the output of the network if "a" is input."""
for b, w in zip(self.biases, self.weights):
a = sigmoid(np.dot(w, a)+b)
return a复制代码
固然,咱们主要是要Network来学习。咱们给出一叫作SGD(stochastic gradient descent)的方法来实现随机梯度降低。下面是代码:
def SGD(self, training_data, epochs, mini_batch_size, eta, test_data=None):
"""Train the neural network using mini-batch stochastic gradient descent. The "training_data" is a list of tuples "(x, y)" representing the training inputs and the desired outputs. The other non-optional parameters are self-explanatory. If "test_data" is provided then the network will be evaluated against the test data after each epoch, and partial progress printed out. This is useful for tracking progress, but slows things down substantially."""
if test_data: n_test = len(test_data)
n = len(training_data)
for j in xrange(epochs):
random.shuffle(training_data)
mini_batches = [
training_data[k:k+mini_batch_size]
for k in xrange(0, n, mini_batch_size)]
for mini_batch in mini_batches:
self.update_mini_batch(mini_batch, eta)
if test_data:
print "Epoch {0}: {1} / {2}".format(
j, self.evaluate(test_data), n_test)
else:
print "Epoch {0} complete".format(j)复制代码
训练数据training_data 是元组(x, y)表明输入和指望的输出。变量epochs和mini_batch_size是你能够设置的训练次数和采样时小批量的个数。eta是学习速率η。若是可选参数test_data被设置了,程序就会在每次训练后将结果打印,这对跟踪调试颇有帮助可是却会下降速度。
在每一次epoch训练中,都会把训练数据从新排序,而后放在大小指定的list中(mini_batches),这是一种简单的抽取训练数据的方法。而后对每个mini_batch再执行单一的梯度降低,这一步是使用self.update_mini_batch(mini_batch, eta)这句代码执行。这会根据这一次的迭代更新权重和b值。接下来给出update_mini_batch的方法:
def update_mini_batch(self, mini_batch, eta):
"""Update the network's weights and biases by applying gradient descent using backpropagation to a single mini batch. The ``mini_batch`` is a list of tuples ``(x, y)``, and ``eta`` is the learning rate."""
nabla_b = [np.zeros(b.shape) for b in self.biases]
nabla_w = [np.zeros(w.shape) for w in self.weights]
for x, y in mini_batch:
delta_nabla_b, delta_nabla_w = self.backprop(x, y)
nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
self.weights = [w-(eta/len(mini_batch))*nw
for w, nw in zip(self.weights, nabla_w)]
self.biases = [b-(eta/len(mini_batch))*nb
for b, nb in zip(self.biases, nabla_b)]复制代码
大部分工做都在这一行:
delta_nabla_b, delta_nabla_w = self.backprop(x, y)复制代码
这个被调用的方法叫后向传播算法(backpropagation),这是一个计算损失函数最快的方法,因此update_mini_batch这个方法能很快的使用每个训练样本mini_batch计算并更新self.weights 和 self.biases属性。
这里先不提供self.backprop的代码,咱们将会在下面的章节学到有关后向传播算法(backpropagation)以及它的代码实现。如今假设它就是可以根据输入训练样本x恰当的返回梯度就好了。
让咱们来看一看所有代码,包括文档说明和我上面省略的部分。其中self.backprop 方法中使用了sigmoid_prime方法来协助计算梯度,这个计算了 σ函数的导数。self.cost_derivative方法你能够经过看代码和注视了解,咱们会在下一个章节详细解释。全部的代码能够在这里下载:
""" network.py ~~~~~~~~~~ A module to implement the stochastic gradient descent learning algorithm for a feedforward neural network. Gradients are calculated using backpropagation. Note that I have focused on making the code simple, easily readable, and easily modifiable. It is not optimized, and omits many desirable features. """
#### Libraries
# Standard library
import random
# Third-party libraries
import numpy as np
class Network(object):
def __init__(self, sizes):
"""The list ``sizes`` contains the number of neurons in the respective layers of the network. For example, if the list was [2, 3, 1] then it would be a three-layer network, with the first layer containing 2 neurons, the second layer 3 neurons, and the third layer 1 neuron. The biases and weights for the network are initialized randomly, using a Gaussian distribution with mean 0, and variance 1. Note that the first layer is assumed to be an input layer, and by convention we won't set any biases for those neurons, since biases are only ever used in computing the outputs from later layers."""
self.num_layers = len(sizes)
self.sizes = sizes
self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
self.weights = [np.random.randn(y, x)
for x, y in zip(sizes[:-1], sizes[1:])]
def feedforward(self, a):
"""Return the output of the network if ``a`` is input."""
for b, w in zip(self.biases, self.weights):
a = sigmoid(np.dot(w, a)+b)
return a
def SGD(self, training_data, epochs, mini_batch_size, eta, test_data=None):
"""Train the neural network using mini-batch stochastic gradient descent. The ``training_data`` is a list of tuples ``(x, y)`` representing the training inputs and the desired outputs. The other non-optional parameters are self-explanatory. If ``test_data`` is provided then the network will be evaluated against the test data after each epoch, and partial progress printed out. This is useful for tracking progress, but slows things down substantially."""
if test_data: n_test = len(test_data)
n = len(training_data)
for j in xrange(epochs):
random.shuffle(training_data)
mini_batches = [
training_data[k:k+mini_batch_size]
for k in xrange(0, n, mini_batch_size)]
for mini_batch in mini_batches:
self.update_mini_batch(mini_batch, eta)
if test_data:
print "Epoch {0}: {1} / {2}".format(
j, self.evaluate(test_data), n_test)
else:
print "Epoch {0} complete".format(j)
def update_mini_batch(self, mini_batch, eta):
"""Update the network's weights and biases by applying gradient descent using backpropagation to a single mini batch. The ``mini_batch`` is a list of tuples ``(x, y)``, and ``eta`` is the learning rate."""
nabla_b = [np.zeros(b.shape) for b in self.biases]
nabla_w = [np.zeros(w.shape) for w in self.weights]
for x, y in mini_batch:
delta_nabla_b, delta_nabla_w = self.backprop(x, y)
nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
self.weights = [w-(eta/len(mini_batch))*nw
for w, nw in zip(self.weights, nabla_w)]
self.biases = [b-(eta/len(mini_batch))*nb
for b, nb in zip(self.biases, nabla_b)]
def backprop(self, x, y):
"""Return a tuple ``(nabla_b, nabla_w)`` representing the gradient for the cost function C_x. ``nabla_b`` and ``nabla_w`` are layer-by-layer lists of numpy arrays, similar to ``self.biases`` and ``self.weights``."""
nabla_b = [np.zeros(b.shape) for b in self.biases]
nabla_w = [np.zeros(w.shape) for w in self.weights]
# feedforward
activation = x
activations = [x] # list to store all the activations, layer by layer
zs = [] # list to store all the z vectors, layer by layer
for b, w in zip(self.biases, self.weights):
z = np.dot(w, activation)+b
zs.append(z)
activation = sigmoid(z)
activations.append(activation)
# backward pass
delta = self.cost_derivative(activations[-1], y) * \
sigmoid_prime(zs[-1])
nabla_b[-1] = delta
nabla_w[-1] = np.dot(delta, activations[-2].transpose())
# Note that the variable l in the loop below is used a little
# differently to the notation in Chapter 2 of the book. Here,
# l = 1 means the last layer of neurons, l = 2 is the
# second-last layer, and so on. It's a renumbering of the
# scheme in the book, used here to take advantage of the fact
# that Python can use negative indices in lists.
for l in xrange(2, self.num_layers):
z = zs[-l]
sp = sigmoid_prime(z)
delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
nabla_b[-l] = delta
nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
return (nabla_b, nabla_w)
def evaluate(self, test_data):
"""Return the number of test inputs for which the neural network outputs the correct result. Note that the neural network's output is assumed to be the index of whichever neuron in the final layer has the highest activation."""
test_results = [(np.argmax(self.feedforward(x)), y)
for (x, y) in test_data]
return sum(int(x == y) for (x, y) in test_results)
def cost_derivative(self, output_activations, y):
"""Return the vector of partial derivatives \partial C_x / \partial a for the output activations."""
return (output_activations-y)
#### Miscellaneous functions
def sigmoid(z):
"""The sigmoid function."""
return 1.0/(1.0+np.exp(-z))
def sigmoid_prime(z):
"""Derivative of the sigmoid function."""
return sigmoid(z)*(1-sigmoid(z))复制代码
那么程序执行的效果怎么样呢?Well,让咱们先加载MNIST数据,这里使用一个工具类来帮助咱们作这件事mnist_loader.py,咱们在Python的shell里执行这个文件:
>>> import mnist_loader
>>> training_data, validation_data, test_data = \
... mnist_loader.load_data_wrapper()复制代码
接下来运行network,咱们设置30个隐藏层:
>>> import network
>>> net = network.Network([784, 30, 10])复制代码
而后设置训练次数30次,epochs=30;每组训练数据10个,mini-batch.size = 10;学习速率3.0, η=3.0。
>>> net.SGD(training_data, 30, 10, 3.0, test_data=test_data)复制代码
若是你在这时候运行代码,可能须要花点时间才能运行完。我建议你继续阅读,设置一些运行的东西,并按期检查代码输出。可是若是你想如今就运行,能够经过减小训练次数、减小隐藏层的数量或仅仅使用部分训练数据来加快速度。请注意:写这些代码只是为了帮助你了解神经网络工做的方式,而不是性能很高的代码。固然,一旦咱们已经训练出来一个很好的神经网络,那么就能够直接将其移植到网页(用JS)或app等,这时候它也会运行的很快的。正如你看到的,仅仅一次训练之后,正确识别的数量就已经达到了9129个(一共10000个)。
Epoch 0: 9129 / 10000
Epoch 1: 9295 / 10000
Epoch 2: 9348 / 10000
...
Epoch 27: 9528 / 10000
Epoch 28: 9542 / 10000
Epoch 29: 9534 / 10000复制代码
可是,由于咱们初始化时用的随机生成的权重和b,因此你运行的结果可能并不会和个人如出一辙。
咱们把隐藏层改为100看看:
>>> net = network.Network([784, 100, 10])
>>> net.SGD(training_data, 30, 10, 3.0, test_data=test_data)复制代码
咱们会发现准确率提高了,至少在这种状况下增长隐藏层数量会帮助咱们获得更好的结果。
若是咱们减少学习效率到η=0.001:
>>> net = network.Network([784, 100, 10])
>>> net.SGD(training_data, 30, 10, 0.001, test_data=test_data)复制代码
结果就有点不近人情了:
Epoch 0: 1139 / 10000
Epoch 1: 1136 / 10000
Epoch 2: 1135 / 10000
...
Epoch 27: 2101 / 10000
Epoch 28: 2123 / 10000
Epoch 29: 2142 / 10000复制代码
再改变学习速率到0.01,发现结果又变好一点了。相似的当你发现改变一个参数能使得结果改变时就多试几回。咱们能够最终选择更适合咱们的这个参数。
通常,调试神经网络是比较困难的。当指定一个参数没有随机选择的好的时候尤为如此。假设咱们尝试设置隐藏层30个神经元,学习效率η=100.0:
>>> net = network.Network([784, 30, 10])
>>> net.SGD(training_data, 30, 10, 100.0, test_data=test_data)复制代码
会发现学习效率过高了:
Epoch 0: 1009 / 10000
Epoch 1: 1009 / 10000
Epoch 2: 1009 / 10000
Epoch 3: 1009 / 10000
...
Epoch 27: 982 / 10000
Epoch 28: 982 / 10000
Epoch 29: 982 / 10000复制代码
这个时候咱们应该会适当调小学习速率,来提升准确率。可是假如这是咱们第一次来尝试,那么咱们可能不会马上怀疑时学习速率太大的问题,而是可能会怀疑时咱们神经网络的问题,好比是否是咱们初始化权重和b值形成网络的问题呢?会不会是训练数据的问题呢?是否是训练次数问题?或者是否是应该改变学习算法呢?等等各类猜测,因此当你第一次遇到这个状况时,你是不肯定是什么问题出现形成的这种结果。可是这里先不解释,会在以后的文章中解释这些问题。这里仅仅是展现源码。
咱们来看看前面提到的如何加载MNIST数据的细节,下面是源码。数据结构是MNIST官网文档上面描述的是stuff、tuples和lists。若是你不了解ndarray,能够理解成向量。
""" mnist_loader ~~~~~~~~~~~~ A library to load the MNIST image data. For details of the data structures that are returned, see the doc strings for ``load_data`` and ``load_data_wrapper``. In practice, ``load_data_wrapper`` is the function usually called by our neural network code. """
#### Libraries
# Standard library
import cPickle
import gzip
# Third-party libraries
import numpy as np
def load_data():
"""Return the MNIST data as a tuple containing the training data, the validation data, and the test data. The ``training_data`` is returned as a tuple with two entries. The first entry contains the actual training images. This is a numpy ndarray with 50,000 entries. Each entry is, in turn, a numpy ndarray with 784 values, representing the 28 * 28 = 784 pixels in a single MNIST image. The second entry in the ``training_data`` tuple is a numpy ndarray containing 50,000 entries. Those entries are just the digit values (0...9) for the corresponding images contained in the first entry of the tuple. The ``validation_data`` and ``test_data`` are similar, except each contains only 10,000 images. This is a nice data format, but for use in neural networks it's helpful to modify the format of the ``training_data`` a little. That's done in the wrapper function ``load_data_wrapper()``, see below. """
f = gzip.open('../data/mnist.pkl.gz', 'rb')
training_data, validation_data, test_data = cPickle.load(f)
f.close()
return (training_data, validation_data, test_data)
def load_data_wrapper():
"""Return a tuple containing ``(training_data, validation_data, test_data)``. Based on ``load_data``, but the format is more convenient for use in our implementation of neural networks. In particular, ``training_data`` is a list containing 50,000 2-tuples ``(x, y)``. ``x`` is a 784-dimensional numpy.ndarray containing the input image. ``y`` is a 10-dimensional numpy.ndarray representing the unit vector corresponding to the correct digit for ``x``. ``validation_data`` and ``test_data`` are lists containing 10,000 2-tuples ``(x, y)``. In each case, ``x`` is a 784-dimensional numpy.ndarry containing the input image, and ``y`` is the corresponding classification, i.e., the digit values (integers) corresponding to ``x``. Obviously, this means we're using slightly different formats for the training data and the validation / test data. These formats turn out to be the most convenient for use in our neural network code."""
tr_d, va_d, te_d = load_data()
training_inputs = [np.reshape(x, (784, 1)) for x in tr_d[0]]
training_results = [vectorized_result(y) for y in tr_d[1]]
training_data = zip(training_inputs, training_results)
validation_inputs = [np.reshape(x, (784, 1)) for x in va_d[0]]
validation_data = zip(validation_inputs, va_d[1])
test_inputs = [np.reshape(x, (784, 1)) for x in te_d[0]]
test_data = zip(test_inputs, te_d[1])
return (training_data, validation_data, test_data)
def vectorized_result(j):
"""Return a 10-dimensional unit vector with a 1.0 in the jth position and zeroes elsewhere. This is used to convert a digit (0...9) into a corresponding desired output from the neural network."""
e = np.zeros((10, 1))
e[j] = 1.0
return e复制代码
咱们知道,2的图案比1的更加黑一点,由于它更多区域被染成黑色。
有个建议是计算0到9的平均黑度,这样在有个数字要猜想时,能够先计算它的黑度而后再猜想它是什么数字,这个实现不太难,因此这里不写出代码,而是把代码放在了GitHub repository,这种方法能提升咱们的准确率。
可是若是你想尽量提升准确率,咱们可使用支持向量机算法SVM(support vector machine)。不用担忧,暂时咱们不须要理解SVM算法细节,咱们如今能够显示用库 scikit-learn,里面提供了SVM算法C语言对Python方便的接口。代码在这里 here.这说明SVM比咱们的算法更厉害,这有点不太好,因此在后面,咱们会提升咱们的算法,让它比SVM准确率更高。
SVM也有不少能够调整的参数,若是你感兴趣能够学习 this blog post做者是 Andreas Mueller 。
咱们能够经过转移咱们的技术来分析另外一个问题,是否是一张人脸:
能够参考以下模型:
然而,一个子问题还能拆解成更小的问题:
这样,就能把网络变成深度神经网络。人们如今常常训练有5到10个隐藏层的网络。事实证实,这些方法比浅层神经网络,即只有一个隐藏层的网络在许多问题上的表现要好得多。缘由是深层网络创建一个复杂的概念层次的能力。
若是个人文章对你有帮助,我建议你能够关注个人文集或者打赏,我建议打赏¥5。固然你也能够随意打赏。
若是有问题,欢迎跟我联系讨论space-x@qq.com.我会尽可能及时回复。