[Python Debug]Kernel Crash While Running Neural Network with Keras|Jupyter Notebook运行Keras服务器宕机缘由及解决

时间 2019-11-20

标签 python debug kernel crash running neural network keras jupyter notebook 运行服务器宕机缘由解决栏目 Python 繁體版

原文原文链接

最近作Machine Learning做业，要在Jupyter Notebook上用Keras搭建Neural Network。结果连最简单的一层神经网络都运行不了，更奇怪的是我先用iris数据集跑了一遍并无任何问题，可是用老师给的fashion mnist一运行服务器就提示挂掉重启。更更奇怪的是一样的code在同窗的电脑上跑也是一点问题都没有，让我一度觉得是个人macbook年代久远配置过低什么的，差点要买新电脑了>_<python

今天上课经ML老师几番调试，居然完美解决了，不愧是CMU大神！（这里给Prof强烈打call，虽然他看不懂中文><）由于刚学python没多久，还很不熟悉，通过此次又学会好多新技能✌️git

出问题的完整code以下，就是用Keras实现logistic regression，是一个简单的一层网络，可是每次运行到最后一行server就挂掉，而后重启kernel。github

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA, FastICA
from sklearn.linear_model import LogisticRegression
from keras.models import Sequential
from keras.layers import Dense, Activation, Conv2D
from keras.utils import to_categorical
from keras.datasets import fashion_mnist

(x3_train, y_train), (x3_test, y_test) = fashion_mnist.load_data()
n_classes = np.max(y_train) + 1

# Vectorize image arrays, since most methods expect this format
x_train = x3_train.reshape(x3_train.shape[0], np.prod(x3_train.shape[1:]))
x_test = x3_test.reshape(x3_test.shape[0], np.prod(x3_test.shape[1:]))

# Binary vector representation of targets (for one-hot or multinomial output networks)
y3_train = to_categorical(y_train)
y3_test = to_categorical(y_test)

from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)     
x_test_scaled = scaler.fit_transform(x_test) 

n_output = y3_train.shape[1]
n_input = x_train_scaled.shape[1]

nn_lr = Sequential() 
nn_lr.add(Dense(units=n_output, input_dim= n_input, activation = 'softmax'))
nn_lr.compile(optimizer = 'sgd', loss = 'categorical_crossentropy', metrics = ['accuracy'])

因为Jupyter Notebook只是一直重启kernel，并无任何错误提示，因此让人无从下手。可是经老师提示原来启动Jupyter Notebook时自动打开的terminal上会记录运行的信息（小白第一次发现。。），包括了kerter停止及重启的详细过程及缘由：编程

[I 22:11:54.603 NotebookApp] Kernel interrupted: 7e7f6646-97b0-4ec7-951c-1dce783f60c4服务器

[I 22:13:49.160 NotebookApp] Saving file at /Documents/[Rutgers]Study/2019Spring/MACHINE LEARNING W APPLCTN LARGE DATASET/hw/Untitled1.ipynb网络

2019-03-28 22:13:49.829246: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA多线程

2019-03-28 22:13:49.829534: I tensorflow/core/common_runtime/process_util.cc:69] Creating new thread pool with default inter op setting: 4. Tune using inter_op_parallelism_threads for best performance.架构

OMP: Error #15: Initializing libiomp5.dylib, but found libiomp5.dylib already initialized.dom

OMP: Hint: This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.this

[I 22:13:51.049 NotebookApp] KernelRestarter: restarting kernel (1/5), keep random ports

kernel c1114f5a-3829-432f-a26a-c2db6c330352 restarted

还有另一个方法，把代码copy到ipython中，也能够获得相似的信息，因此最后定位的错误是：

OMP: Error #15: Initializing libiomp5.dylib, but found libiomp5.dylib already initialized.

谷歌了一下，github上有一个很详细的讨论帖，可是楼主是运行XGBoost时遇到了这个问题，让我联想到寒假安装XGBoost确实通过了很曲折的过程，可能不当心把某个文件重复下载到了不一样路径，因而程序加载package时出现了冲突。帖子里提供了几种可能的缘由及解决方法：

1. 卸载clang-omp

brew uninstall libiomp clang-omp

as long as u got gcc v5 from brew it come with openmp

follow steps in:
https://github.com/dmlc/xgboost/tree/master/python-package

尝试了卸载xgboost再安装，而后卸载clang-omp，获得错误提示

No such keg: /usr/local/Cellar/libiomp

pip uninstall xbgoost
pip install xgboost
brew uninstall libiomp clang-omp

2. 直接在jupyter notebook里运行：

# DANGER! DANGER!
import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'

老师说这行命令可让系统忽略package冲突的问题，自行选择一个package使用。试了一下这个方法确实有效，但这是很是危险的作法，极度不推荐！

3. 找到重复的libiomp5.dylib文件，删除其中一个

在Finder中确实找到了两个文件，分别在~/⁨anaconda3⁩/lib⁩和~/anaconda3⁩/⁨lib⁩/⁨python3.6⁩/⁨site-packages⁩/⁨_solib_darwin⁩/⁨_U@mkl_Udarwin_S_S_Cmkl_Ulibs_Udarwin___Uexternal_Smkl_Udarwin_Slib⁩ （？？？？）但是不太肯定应该删除哪个，感受这种作法也蛮危险的，删错了整个跑不起来了。

4. OpenMP冲突

Hint: This means that multiple copies of the OpenMP runtime have been linked into the program

根据提示信息里的Hint，搜了下TensorFlow OpenMP。OpenMP是一个多线程并行编程的平台，TensorFlow彷佛有本身的并行计算架构，并用不上OpenMP（see https://github.com/tensorflow/tensorflow/issues/12434）

5. 卸载nomkl

I had the same error on my Mac with a python program using numpy, keras, and matplotlib. I solved it with 'conda install nomkl'.

这是最后有效的作法！nomkl全称是Math Kernel Library (MKL) Optimization，是Interl开发的用来加速数学运算的模块，经过conda安装package能够自动使用mkl，更详细的信息能够看这个Anaconda的官方文档。

To opt out, run conda install nomkl and then use conda install to install packages that would normally include MKL or depend on packages that include MKL, such as scipy, numpy, and pandas.

多是numpy之类的package更新时出现了一些冲突，安装nomkl以后居然神奇地解决了，后来又尝试把MKL卸载了，程序依然正常运行。。卸载命令以下：

conda remove mkl mkl-service

总结：

1. 老师好厉害呀，三下五除二就把问题解决了><

2. 经大神提醒，运行python以前建立一个虚拟环境能够很好避免package冲突之类的问题，具体方法：https://www.jianshu.com/p/d8e7135dca40。