分类问题:3分类 (identifying three sub-types of lymphoma: Chronic Lymphocytic Leukemia (CLL, 慢性淋巴细胞白血病), Follicular Lymphoma (FL,滤泡性淋巴瘤), and Mantle Cell Lymphoma (MCL,套细胞淋巴瘤)
网络模型:AlexNet
数据集: 原图1388*1040大小,共计374张, 1.4G。 CLL:113, FL:138, MCL:122git
准备工做:
caffe环境配置好;数据集、代码下载完毕github
将大图切成小的patches.代码:step1_make_patches.m。
代码须要修改的就是路径,这点须要注意。为了方便,将数据集放在与.m的同级目录下.
在这以前,为了与教程所描述的数据集中图片的命名一致,要在每一类别下的图片加类名前缀。这里给出ubuntu下批量修改文件名的方法:数据库
cd 到子类所在的路径下
假设要加的类名前缀为CLL-
sudo rename 's/^/CLL-/' *tif
ubuntu
修正后的代码以及简要理解以下:数组
clc clear all % 子图的输出路径 outdir='./subs/'; %output directory for all of the sub files mkdir(outdir) % 设置取patch时的步长 step_size=32; % 设置patch大小,注意做者在这里提到,输入caffe时还会被crop成32*32 patch_size=36; %size of the pathces we would like to extract, bigger since Caffee will randomly crop 32 x 32 patches from them % 按类别取patch classes={'CLL','FL','MCL'}; class_struct={}; for classi=1:length(classes) % 获得目标类文件夹下全部图片名称 files=dir([sprintf('./case7_lymphoma_classification/%s/', classes{classi}),'*.tif']); % we only use images for which we have a mask available, so we use their filenames to limit the patients % 生成无重复的病人序号。这里解释一下,由于做者想作的是创建与病人关联的数据库,可是实际上该数据集没有病人信息,但为了统一,仍采用这种结构生成数据 % arrayfun: 对数组中的每个元素进行fun运算; x{1}{1}生成1x1的cell patients=unique(arrayfun(@(x) x{1}{1},arrayfun(@(x) strsplit(x.name,'.'),files,'UniformOutput',0),'UniformOutput',0)); %this creates a list of patient id numbers patient_struct=[]; % parfor 并行 parfor ci=1:length(patients) % for each of the *patients* we extract patches % base属性为名字 patient_struct(ci).base=patients{ci}; %we keep track of the base filename so that we can split it into sets later. a "base" is for example 12750 in 12750_500_f00003_original.tif % sub_file 属性存放该病人(大图)的patch存放路径 patient_struct(ci).sub_file=[]; %this will hold all of the patches we extract which will later be written % 获得对应病人的大图 files=dir(sprintf([sprintf('./case7_lymphoma_classification/%s/', classes{classi}),'%s*.tif'],patients{ci})); %get a list of all of the image files associated with this particular patient for fi=1:length(files) %for each of the files..... % 由上,该数据集无重复,每一个病人只对应一张大图 disp([ci,length(patients),fi,length(files)]) fname=files(fi).name; % 保存的该病人每张大图的名字 patient_struct(ci).sub_file(fi).base=fname; %each individual image name gets saved as well io=imread([sprintf('./case7_lymphoma_classification/%s/', classes{classi}),fname]); %read the image [nrow,ncol,ndim]=size(io); fnames_sub={}; i=1; % 取图像的patch,其实是矩阵取子块,数量为[(1388-36)/32+1]*[(1040-36)/32+1]*2 for rr=1:step_size:nrow-patch_size for cc=1:step_size:ncol-patch_size for rot=1:2 % 旋转,旋转90度,扩充数据集x2, try % 能够改为rr=1:step_size:nrow-patch_size+1,... , % subio=io(rr:rr+patch_size-1,cc+1:cc+patch_size-1,:); subio=io(rr+1:rr+patch_size,cc+1:cc+patch_size,:); subio=imrotate(subio,(rot-1)*90); % patch的命名方式:第几个patch subfname=sprintf('%s_sub_%d.tif',fname(1:end-4),i); fnames_sub{end+1}=subfname; imwrite(subio,[outdir,subfname]); i=i+1; catch err disp(err); continue end end end end patient_struct(ci).sub_file(fi).fnames_subs=fnames_sub; end end class_struct{classi}=patient_struct; end save('class_struct.mat','class_struct') %save this just incase the computer crashes before the next step finishes
每一个图片切出2752张patchesbash
生成交叉验证集,为了获得最好的模型。使用5折交叉验证。代码step2_make_training_lists.m。
每个交叉验证集须要生成4个txt文件,以第一折为例:
train_w32_parent_1.txt,test_w32_parent_1.txt:该交叉验证集包含的病人名称列表的txt
train_w32_1.txt,test_w32_1.txt: 该交叉验证集包含的图片名称以及对应类别的列表的txt
代码比较直观,只要是要理解5折交叉验证的原理。简单记录下代码:网络
load class_struct %save this just incase the computer crashes before the next step finishes % 5折交叉验证 nfolds=5; %determine how many folds we want to use during cross validation fidtrain=[]; fidtest=[]; fidtrain_parent=[]; fidtest_parent=[]; % 生成全部文件的句柄 for zz=1:nfolds %open all of the file Ids for the training and testing files %each fold has 4 files created (as discussed in the tutorial) fidtrain(zz)=fopen(sprintf('train_w32_%d.txt',zz),'w'); fidtest(zz)=fopen(sprintf('test_w32_%d.txt',zz),'w'); fidtrain_parent(zz)=fopen(sprintf('train_w32_parent_%d.txt',zz),'w'); fidtest_parent(zz)=fopen(sprintf('test_w32_parent_%d.txt',zz),'w'); end % 将病人ID写入patient.txt .将病人的patch图片及类别(CLL:0,FL:1,MCL : 2)名写入另外两个txt % 5折交叉验证是:4个为训练集,剩余一个为测试集,这样能够组合为5个数据集 for classi=1:length(class_struct) patient_struct=class_struct{classi}; npatients=length(patient_struct); %get the number of patients that we have indices=crossvalind('Kfold',npatients,nfolds); %use the matlab function to generate a k-fold set for fi=1:npatients %for each patient disp([fi,npatients]); for k=1:nfolds %for each fold if(indices(fi)==k) %if this patient is in the test set for this fold, set the file descriptor accordingly fid=fidtest(k); fid_parent=fidtest_parent(k); else %otherwise its in the training set fid=fidtrain(k); fid_parent=fidtrain_parent(k); end fprintf(fid_parent,'%s\n',patient_struct(fi).base); %print this patien's ID to the parent file subfiles=patient_struct(fi).sub_file; %get the patient's images for subfi=1:length(subfiles) %for each of the patient images try subfnames=subfiles(subfi).fnames_subs; %now get all of the negative patches % !!!这里注意要将%s\t%d改成%s\ %d,使用空格做为分隔,不然后面格式转换时会出错:could not open or find file... cellfun(@(x) fprintf(fid,'%s\ %d\n',x,classi-1),subfnames); %write them to the list as belonging to the 0 class (non nuclei) catch err disp(err) disp([patient_struct(fi).base,' ',patient_struct(fi).sub_file(subfi).base]) %if there are any errors, display them, but continue continue end end end end end for zz=1:nfolds %now that we're done, make sure that we close all of the files fclose(fidtrain(zz)); fclose(fidtest(zz)); fclose(fidtrain_parent(zz)); fclose(fidtest_parent(zz)); end
5个数据集模型, 每一个测试集203648张patches,训练集825600,训练集:测试集~1:4app
生成数据集。这里利用caffe的命令行生成leveldb格式的数据和相应的均值文件。之因此不直接用image layer,是由于:还需计算所需格式的均值,并且image layer也不是设计为大数据量读取的,因此直接使用caffe命令行更加方便。
代码:step3_make_dbs.sh,** 在sub文件夹内运行**,以确保路径正确。仍是要修改源代码的一些路径问题和一些细节上的错误:dom
#!/bin/bash filepath=$(cd "$(dirname "$0")"; pwd) for kfoldi in {1..5} do echo "doing fold $kfoldi" #注意这里,若是你实验的目录是在caffe路径下时,能够这样,不然须要绝对路径。并且原代码的for循环内部{{kfoldi}} 应改成kfoldi #~/py-R-FCN/caffe/build/tools/convert_imageset -shuffle -backend leveldb subs/ train_w32_${kfoldi}.txt DB_train_${kfoldi} & #~/py-R-FCN/caffe/build/tools/convert_imageset -shuffle -backend leveldb subs/ test_w32_${kfoldi}.txt DB_test_${kfoldi} & /home/mz/py-R-FCN/caffe/build/tools/convert_imageset -shuffle -backend leveldb subs/ train_w32_$kfoldi.txt DB_train_$kfoldi & /home/mz/py-R-FCN/caffe/build/tools/convert_imageset -shuffle -backend leveldb subs/ test_w32_$kfoldi.txt DB_test_$kfoldi & done FAIL=0 for job in `jobs -p` do echo $job wait $job || let "FAIL+=1" done echo "number failed: $FAIL" cd ../ for kfoldi in {1..5} do echo "doing fold $kfoldi" #这里同上,进行修改 /home/mz/py-R-FCN/caffe/build/tools/compute_image_mean DB_train_$kfoldi DB_train_w32_$kfoldi.binaryproto -backend leveldb & done
说明:使用的网络结构是alexnet,其实比AlexNet官方结构精简,只有3对卷积池化和两个全链接,实际上这是cifar10分类中使用的网络结构。要考虑到这里的输入图片大小为32*32(网络结构中对输入的定义还作了crop为32的操做),并且是3分类(alexnet是1000分类),因此从模型的复杂度上也不须要作的和alexnet那样复杂。因此网络深度和一些参数须要变化,不能照搬AlexNet。可是值得实验的是,是否病理学图像必须裁成小图,大一些的图是否能够,少加一些pool,把深度提上去,不知道性能会怎么样?编辑器
须要的文件:与7-lymphoma同级的common文件夹下的BASE-alexnet_solver_ada.prototxt、(BASE-alexnet_traing_32w_db.prototxt、BASE-alexnet_traing_32w_dropout_db.prototxt;带不带dropout),(deploy_train32.prototxt、deploy_train32_dropout.prototxt,测试网络定义)。
复制5份,用于5个模型(5折交叉验证),命名方式1-alexnet_solver_ada.prototxt,以此类推。
修改的内容:
修改caffe的测试迭代次数,在solver文件下的test_iter。计算方法为测试数据量/测试时的batch_size。batch_size = 128,而前者能够经过运行下面指令快速获得:
wc -l test_w32_1.txt
或者打开文件拉到最后一行,看文本编辑器的下方显示的行数。
进行训练:
/home/mz/py-R-FCN/caffe/build/tools/caffe train --solver=1-alexnet_solver_ada.prototxt
对于模型5,迭代600000次,不加dropout的模型:0.841879,loss = 0.513672 ;
加dropout的模型:0.826787,loss = 0.576846
对于模型4,迭代600000次,不加dropout的模型:0.86142,loss = 0.364765 ;
加dropout的模型:0.85352,loss = 0.500288
对于模型3,迭代600000次,不加dropout的模型:0.840632,loss = 0.448586 ;
加dropout的模型:0.814813,loss = 0.546735
对于模型2,迭代600000次,不加dropout的模型:0.817167,loss = 0.466199 ;
加dropout的模型:0.797229,loss = 0.557098
对于模型1,迭代600000次,不加dropout的模型:0.85496,loss = 0.435163 ;
加dropout的模型:0.828828,loss = 0.577961
clc clear all % 子图的输出路径 outdir='./subs_227/'; %output directory for all of the sub files mkdir(outdir) % 设置取patch时的步长 step_size=227; % 设置patch大小,注意做者在这里提到,输入caffe时还会被crop成32*32 patch_size=227; %size of the pathces we would like to extract, bigger since Caffee will randomly crop 32 x 32 patches from them % 是否水平翻转 flip = true; % 按类别取patch classes={'CLL','FL','MCL'}; class_struct={}; for classi=1:length(classes) % 获得目标类文件夹下全部图片名称 files=dir([sprintf('./case7_lymphoma_classification/%s/', classes{classi}),'*.tif']); % we only use images for which we have a mask available, so we use their filenames to limit the patients % 生成无重复的病人序号。这里解释一下,由于做者想作的是创建与病人关联的数据库,可是实际上该数据集没有病人信息,但为了统一,仍采用这种结构生成数据 % arrayfun: 对数组中的每个元素进行fun运算; x{1}{1}生成1x1的cell patients=unique(arrayfun(@(x) x{1}{1},arrayfun(@(x) strsplit(x.name,'.'),files,'UniformOutput',0),'UniformOutput',0)); %this creates a list of patient id numbers patient_struct=[]; % parfor 并行 parfor ci=1:length(patients) % for each of the *patients* we extract patches % base属性为名字 patient_struct(ci).base=patients{ci}; %we keep track of the base filename so that we can split it into sets later. a "base" is for example 12750 in 12750_500_f00003_original.tif % sub_file 属性存放该病人(大图)的patch存放路径 patient_struct(ci).sub_file=[]; %this will hold all of the patches we extract which will later be written % 获得对应病人的大图 files=dir(sprintf([sprintf('./case7_lymphoma_classification/%s/', classes{classi}),'%s*.tif'],patients{ci})); %get a list of all of the image files associated with this particular patient for fi=1:length(files) %for each of the files..... % 由上,该数据集无重复,每一个病人只对应一张大图 disp([ci,length(patients),fi,length(files)]) fname=files(fi).name; % 保存的该病人每张大图的名字 patient_struct(ci).sub_file(fi).base=fname; %each individual image name gets saved as well io=imread([sprintf('./case7_lymphoma_classification/%s/', classes{classi}), fname]); %read the image [nrow,ncol,ndim]=size(io); fnames_sub={}; i=1; % 取图像的patch,其实是矩阵取子块,数量为[(1388-36)/32+1]*[(1040-36)/32+1]*2 for rr=1:step_size:nrow-patch_size for cc=1:step_size:ncol-patch_size for rot=1:2 % 旋转,旋转90度,扩充数据集x2, try % 能够改为rr=1:step_size:nrow-patch_size+1,... , % subio=io(rr:rr+patch_size-1,cc+1:cc+patch_size-1,:); subio=io(rr+1:rr+patch_size,cc+1:cc+patch_size,:); subio=imrotate(subio,(rot-1)*90); % patch的命名方式:第几个patch subfname=sprintf('%s_sub_%d.tif',fname(1:end-4),i); fnames_sub{end+1}=subfname; imwrite(subio,[outdir,subfname]); i=i+1; if flip subio_flip = subio(:,end:-1:1,1:3); % patch的命名方式:第几个patch subfname=sprintf('%s_sub_%d.tif',fname(1:end-4),i); fnames_sub{end+1}=subfname; imwrite(subio_flip,[outdir,subfname]); i=i+1; end catch err disp(err); continue end end end end patient_struct(ci).sub_file(fi).fnames_subs=fnames_sub; end end class_struct{classi}=patient_struct; end save('class_struct.mat','class_struct') %save this just incase the computer crashes before the next step finishes
step2.生成分别包含训练和测试集图片name list的TXT文件.训练集:17856;测试集:9120.;校验集:8928
load class_struct %save this just incase the computer crashes before the next step finishes % 生成文件的句柄 fidtrain=fopen(sprintf('train_w227.txt'),'w'); fidval=fopen(sprintf('val_w227.txt'),'w'); fidtest=fopen(sprintf('test_w227.txt'),'w'); fidtrain_parent = fopen(sprintf('train_w227_parent.txt'),'w'); fidval_parent = fopen(sprintf('val_w227_parent.txt'),'w'); fidtest_parent = fopen(sprintf('test_w227_parent.txt'),'w'); % 将病人的patch图片及类别(CLL:0,FL:1,MCL : 2)名写入训练和测试txt % 训练集,校验集和测试集比例2:1:1 for classi=1:length(class_struct) patient_struct=class_struct{classi}; npatients=length(patient_struct); %get the number of patients that we have % 打乱顺序 RandIndex = randperm(npatients); test_index = RandIndex(1:ceil(0.25*npatients)); val_index = RandIndex(ceil(0.25*npatients)+1:ceil(0.5*npatients)); train_index = RandIndex(ceil(0.5*npatients)+1:end); for fi=1:npatients %for each patient disp([fi,npatients]); if(ismember(fi, test_index)) %if this patient is in the test set for this fold, set the file descriptor accordingly fid=fidtest; fid_parent=fidtest_parent; elseif(ismember(fi, train_index)) %otherwise its in the training set fid=fidtrain; fid_parent=fidtrain_parent; else fid=fidval; fid_parent=fidval_parent; end fprintf(fid_parent,'%s\n',patient_struct(fi).base); %print this patien's ID to the parent file subfiles=patient_struct(fi).sub_file; %get the patient's images for subfi=1:length(subfiles) %for each of the patient images try subfnames=subfiles(subfi).fnames_subs; %now get all of the negative patches cellfun(@(x) fprintf(fid,'%s %d\n',x,classi-1),subfnames); %write them to the list as belonging to the 0 class (non nuclei) catch err disp(err) disp([patient_struct(fi).base,' ',patient_struct(fi).sub_file(subfi).base]) %if there are any errors, display them, but continue continue end end end end %now that we're done, make sure that we close all of the files fclose(fidtrain); fclose(fidtest); fclose(fidval); fclose(fidtrain_parent); fclose(fidtest_parent); fclose(fidval_parent);
step3. 生成leveldb格式的数据以及对应的均值文件。将按以下修改的step3文件放入subs_227
#!/bin/bash echo "create lmdb data" /home/mz/py-R-FCN/caffe/build/tools/convert_imageset -shuffle -backend lmdb ./ ../train_w227.txt ../DB_train & /home/mz/py-R-FCN/caffe/build/tools/convert_imageset -shuffle -backend lmdb ./ ../test_w227.txt ../DB_test & /home/mz/py-R-FCN/caffe/build/tools/convert_imageset -shuffle -backend lmdb ./ ../test_w227.txt ../DB_val & FAIL=0 for job in `jobs -p` do echo $job wait $job || let "FAIL+=1" done echo "number failed: $FAIL" cd ../ echo "ceate mean binary" /home/mz/py-R-FCN/caffe/build/tools/compute_image_mean DB_train DB_train_w227.binaryproto -backend lmdb &
sudo /home/mz/py-R-FCN/caffe/build/tools/caffe test -model=train_test.prototxt -weights=../models/caffe_alexnet_train_iter_50000.caffemodel -gpu 0 -iterations=183
-iterations迭代次数参数计算方式:测试集数量/batch_size
test-accuracy: 0.856721 loss=1.03727