本文首发于我的博客kezunlin.me/post/658069…,欢迎阅读!html
compile opencv with CUDA support on windows 10 java
requirements:python
see cuda compute capacitylinux
笔记本版本的显卡和台式机的计算能力是有差距的。git
for opencv functionsgithub
Get opencv 3.1.0 for git and fix some bugsweb
git clone https://github.com/opencv/opencv.git
cd opencv
git checkout -b v3.1.0 3.1.0复制代码
# fix bugs for 3.1.0
git cherry-pick 10896
git cherry-pick cdb9c
git cherry-pick 24dbb复制代码
git branch 复制代码
master
* v3.1.0复制代码
mkdir build && cd build && cmake-gui ..复制代码
configure with VS 2015 win64
with options算法
BUILD_SHARED_LIBS ON
CMAKE_CONFIGURATION_TYPES Release # Release
CMAKE_CXX_FLAGS_RELEASE /MD /O2 /Ob2 /DNDEBUG /MP # for multiple processor复制代码
WITH_VTK OFF
BUILD_PERF_TESTS OFF # if ON, build errors occur复制代码
WITH_CUDA ON
CUDA_TOOLKIT_ROOT_DIR C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v8.0
#CUDA_ARCH_BIN 3.0 3.5 5.0 5.2 6.0 6.1 # very time-consuming
CUDA_ARCH_PTX 3.0复制代码
for opencvdocker
CUDA_ARCH_BIN 3.0 3.5 5.0 5.2 6.0 6.1
relate withubuntu
-gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;复制代码
CUDA_ARCH_PTX 3.0
relate with
-gencode;arch=compute_30,code=compute_30;
复制代码
for caffe
the
CUDA_ARCH_BIN
parameter specifies multiple architectures so as to support a variety of GPU boards. otherwise, cuda programs will not run with other type of GPU boards.为了支持在多个不一样计算能力的GPU上运行可执行程序,opencv/caffe编译过程当中须要支持多个不一样架构,
eg. CUDA_ARCH_BIN 3.0 3.5 5.0 5.2 6.0 6.1
, 所以编译过程很是耗时。在编译的而过程当中尽量选择须要发布release版本的GPU架构进行配置编译。
configure and output:
Selecting Windows SDK version 10.0.14393.0 to target Windows 10.0.17134.
found IPP (ICV version): 9.0.1 [9.0.1]
at: C:/compile/opencv/3rdparty/ippicv/unpack/ippicv_win
CUDA detected: 8.0
CUDA NVCC target flags: -gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_30,code=compute_30
Could NOT find Doxygen (missing: DOXYGEN_EXECUTABLE)
To enable PlantUML support, set PLANTUML_JAR environment variable or pass -DPLANTUML_JAR=<filepath> option to cmake
Could NOT find PythonInterp: Found unsuitable version "1.4", but required is at least "3.4" (found C:/Users/zunli/.babun/cygwin/bin/python)
Could NOT find PythonInterp: Found unsuitable version "1.4", but required is at least "3.2" (found C:/Users/zunli/.babun/cygwin/bin/python)
Could NOT find Matlab (missing: MATLAB_MEX_SCRIPT MATLAB_INCLUDE_DIRS MATLAB_ROOT_DIR MATLAB_LIBRARIES MATLAB_LIBRARY_DIRS MATLAB_MEXEXT MATLAB_ARCH MATLAB_BIN) 复制代码
General configuration for OpenCV 3.1.0 =====================================
Version control: 3.1.0-3-g5e9beb8复制代码
Platform:
Host: Windows 10.0.17134 AMD64
CMake: 3.10.0
CMake generator: Visual Studio 14 2015 Win64
CMake build tool: C:/Program Files (x86)/MSBuild/14.0/bin/MSBuild.exe
MSVC: 1900复制代码
C/C++:
Built as dynamic libs?: YES
C++ Compiler: C:/Program Files (x86)/Microsoft Visual Studio 14.0/VC/bin/x86_amd64/cl.exe (ver 19.0.24215.1)
C++ flags (Release): /DWIN32 /D_WINDOWS /W4 /GR /EHa /D _CRT_SECURE_NO_DEPRECATE /D _CRT_NONSTDC_NO_DEPRECATE /D _SCL_SECURE_NO_WARNINGS /Gy /bigobj /Oi /wd4251 /wd4324 /wd4275 /wd4589 /MP8 /MD /O2 /Ob2 /DNDEBUG /MP /Zi
C++ flags (Debug): /DWIN32 /D_WINDOWS /W4 /GR /EHa /D _CRT_SECURE_NO_DEPRECATE /D _CRT_NONSTDC_NO_DEPRECATE /D _SCL_SECURE_NO_WARNINGS /Gy /bigobj /Oi /wd4251 /wd4324 /wd4275 /wd4589 /MP8 /MDd /Zi /Ob0 /Od /RTC1
C Compiler: C:/Program Files (x86)/Microsoft Visual Studio 14.0/VC/bin/x86_amd64/cl.exe
C flags (Release): /DWIN32 /D_WINDOWS /W3 /D _CRT_SECURE_NO_DEPRECATE /D _CRT_NONSTDC_NO_DEPRECATE /D _SCL_SECURE_NO_WARNINGS /Gy /bigobj /Oi /MP8 /MD /O2 /Ob2 /DNDEBUG /Zi
C flags (Debug): /DWIN32 /D_WINDOWS /W3 /D _CRT_SECURE_NO_DEPRECATE /D _CRT_NONSTDC_NO_DEPRECATE /D _SCL_SECURE_NO_WARNINGS /Gy /bigobj /Oi /MP8 /MDd /Zi /Ob0 /Od /RTC1
Linker flags (Release): /machine:x64 /INCREMENTAL:NO /debug
Linker flags (Debug): /machine:x64 /debug /INCREMENTAL
Precompiled headers: YES
Extra dependencies: comctl32 gdi32 ole32 setupapi ws2_32 vfw32 cudart nppc nppi npps cufft -LC:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v8.0/lib/x64
3rdparty dependencies: zlib libjpeg libwebp libpng libtiff libjasper IlmImf复制代码
OpenCV modules:
To be built: cudev core cudaarithm flann imgproc ml video cudabgsegm cudafilters cudaimgproc cudawarping imgcodecs photo shape videoio cudacodec highgui objdetect ts features2d calib3d cudafeatures2d cudalegacy cudaobjdetect cudaoptflow cudastereo stitching superres videostab python2
Disabled: world
Disabled by dependency: -
Unavailable: java python3 viz复制代码
Windows RT support: NO复制代码
GUI:
QT: NO
Win32 UI: YES
OpenGL support: NO
VTK support: NO复制代码
Media I/O:
ZLib: build (ver 1.2.8)
JPEG: build (ver 90)
WEBP: build (ver 0.3.1)
PNG: build (ver 1.6.19)
TIFF: build (ver 42 - 4.0.2)
JPEG 2000: build (ver 1.900.1)
OpenEXR: build (ver 1.7.1)
GDAL: NO复制代码
Video I/O:
Video for Windows: YES
DC1394 1.x: NO
DC1394 2.x: NO
FFMPEG: YES (prebuilt binaries)
codec: YES (ver 56.41.100)
format: YES (ver 56.36.101)
util: YES (ver 54.27.100)
swscale: YES (ver 3.1.101)
resample: NO
gentoo-style: YES
GStreamer: NO
OpenNI: NO
OpenNI PrimeSensor Modules: NO
OpenNI2: NO
PvAPI: NO
GigEVisionSDK: NO
DirectShow: YES
Media Foundation: NO
XIMEA: NO
Intel PerC: NO复制代码
Parallel framework: Concurrency复制代码
Other third-party libraries:
Use IPP: 9.0.1 [9.0.1]
at: C:/compile/opencv/3rdparty/ippicv/unpack/ippicv_win
Use IPP Async: NO
Use Eigen: NO
Use Cuda: YES (ver 8.0)
Use OpenCL: YES
Use custom HAL: NO复制代码
NVIDIA CUDA
Use CUFFT: YES
Use CUBLAS: NO
USE NVCUVID: NO
NVIDIA GPU arch: 30 35 50 52 60 61
NVIDIA PTX archs: 30
Use fast math: NO复制代码
OpenCL:
Version: dynamic
Include path: C:/compile/opencv/3rdparty/include/opencl/1.2
Use AMDFFT: NO
Use AMDBLAS: NO复制代码
Python 2:
Interpreter: C:/Python27/python.exe (ver 2.7.13)
Libraries: C:/Python27/libs/python27.lib (ver 2.7.13)
numpy: C:/Python27/lib/site-packages/numpy/core/include (ver 1.11.3)
packages path: C:/Python27/Lib/site-packages复制代码
Python 3:
Interpreter: NO复制代码
Python (for build): C:/Python27/python.exe复制代码
Java:
ant: NO
JNI: C:/Program Files/Java/jdk1.8.0_161/include C:/Program Files/Java/jdk1.8.0_161/include/win32 C:/Program Files/Java/jdk1.8.0_161/include
Java wrappers: NO
Java tests: NO复制代码
Matlab: Matlab not found or implicitly disabled复制代码
Documentation:
Doxygen: NO
PlantUML: NO复制代码
Tests and samples:
Tests: YES
Performance tests: NO
C/C++ Examples: NO复制代码
Install path: C:/compile/opencv/build/install复制代码
cvconfig.h is in: C:/compile/opencv/build
-----------------------------------------------------------------复制代码
Configuring done
Generating done复制代码
Notice for gencode
CUDA NVCC target flags: -gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_30,code=compute_30
复制代码
Open OpenCV.sln
with VS 2015
and build release version.
this may take hours to finish.
possible solutions
With
BUILD_PERF_TESTS
andBUILD_TESTS
disabled, I managed to build OpenCV 3.1 with CUDA 8.0 on Windows 10 with VS2015 x64 arch target. Without building test/performance modules, the build process costs less time as well : )I actually got it to work both on my laptop and my desktop (GTX960M and GTX970 respectively) running with OpenCV 3.2 and the latest version of CUDA 8.0 for Win10 in Visual Studio 15 Community! What I did was to enable
WITH_CUBLAS
aswell asWITH_CUDA
. I also turned offBUILD_PERF_TESTS
andBUILD_TESTS
. The configuration was built using the Visual Studio 14 2015 C++ compiler.
my solution:
disable `BUILD_PERF_TESTS`复制代码
configure and build again. this time cost only about 1 minutes.
after error fixed,build results
OpenCV GPU
module is written using CUDA
, therefore it benefits from the CUDA
ecosystem.
GPU modules includes class cv::cuda::GpuMat
which is a primary container for data kept in GPU memory. It’s interface is very similar with cv::Mat
, its CPU counterpart. All GPU functions receive GpuMat as input and output arguments. This allows to invoke several GPU algorithms without downloading data. GPU module API interface is also kept similar with CPU interface where possible. So developers who are familiar with Opencv on CPU could start using GPU straightaway.
The GPU module is designed as a host-level API. This means that if you have pre-compiled OpenCV GPU binaries, you are not required to have the CUDA Toolkit installed or write any extra code to make use of the GPU.
find_package(OpenCV REQUIRED COMPONENTS core highgui imgproc features2d calib3d
cudaarithm cudabgsegm cudafilters cudaimgproc cudawarping cudafeatures2d # for cuda-enabled
) #
MESSAGE( [Main] " OpenCV_INCLUDE_DIRS = ${OpenCV_INCLUDE_DIRS}")
MESSAGE( [Main] " OpenCV_LIBS = ${OpenCV_LIBS}")复制代码
In the sample below an image is loaded from local file, next it is uploaded to GPU, thresholded, downloaded and displayed.
#include <opencv2/cudaarithm.hpp>
#include <opencv2/cudabgsegm.hpp>
#include <opencv2/cudafilters.hpp>
#include <opencv2/cudaimgproc.hpp>
#include <opencv2/cudawarping.hpp>
#include <opencv2/cudafeatures2d.hpp>
int test_opencv_gpu()
{
try
{
cv::Mat src_host = cv::imread("file.png", CV_LOAD_IMAGE_GRAYSCALE);
cv::cuda::GpuMat dst, src;
src.upload(src_host);
cv::cuda::threshold(src, dst, 128.0, 255.0, CV_THRESH_BINARY);
cv::Mat result_host;
dst.download(result_host);
cv::imshow("Result", result_host);
cv::waitKey();
}
catch (const cv::Exception& ex)
{
std::cout << "Error: " << ex.what() << std::endl;
}
return 0;
}复制代码
(1) 使用cuda版本的opencv caffe网络的第一次建立很是耗时,后面的网络建立则很是快。
(2) opencv的gpu代码比cpu代码慢,初次启动多耗费20s左右。(事实是因为编译的caffe和GPU计算力不匹配致使的)
Your problem is that CUDA needs to initialize! And it will generally takes between serveral seconds
Why first function call is slow?
That is because of initialization overheads. On first GPU function call
Cuda Runtime API
is initialized implicitly.The first gpu function call is always takes more time, because CUDA initialize context for device.
The following calls will be faster.
Not Reasons:
(1) CPU clockspeed is 10x faster than GPU clockspeed.
(2) memory transfer times between host (CPU) and device (GPU) (upload,downloa data)
gtx 1060 编译的opencv caffe在gtx 970m上运行出现错误
im2col.cu Check failed: error == cudaSuccess (8 vs. 0) invalid device function
gtx 1060 sm_61
gtx 970m sm_52复制代码
im2col 是caffe的源文件,代表gtx 970m
的计算能力不支持可执行文件的运行。
see what-is-the-purpose-of-using-multiple-arch-flags-in-nvidias-nvcc-compiler
Roughly speaking, the code compilation flow goes like this:
CUDA C/C++ device code source --> PTX --> SASS
The virtual architecture (e.g.
compute_20
, whatever is specified by-arch compute
...) determines what type of PTX code will be generated. The additional switches (e.g.-code sm_21
) determine what type of SASS code will be generated. SASS is actually executable object code for a GPU (machine language). An executable can contain multiple versions of SASS and/or PTX, and there is a runtime loader mechanism that will pick appropriate versions based on the GPU actually being used.
for win7, if we install 398.82-desktop-win8-win7-64bit-international-whql.exe
,errors may occur:
> nvidia-smi.exe
Failed to initialize NVML: Unknown error复制代码
Solutions: use older drivers 385.69
(1) api在linux平均耗时3ms;一样的代码在windows平均耗时14ms
(2) vs编译开启代码优化先后性能相差接近5倍,125ms vs 25ms
(3) cmake编译RELEASE选项默认已经开启了代码优化 -O3