case7 淋巴瘤子类分类实验记录

时间 2019-11-11

标签 case7 case 子类分类实验记录繁體版

原文原文链接

case7 淋巴瘤子类分类实验记录

简介

分类问题：3分类 (identifying three sub-types of lymphoma: Chronic Lymphocytic Leukemia (CLL, 慢性淋巴细胞白血病), Follicular Lymphoma (FL,滤泡性淋巴瘤), and Mantle Cell Lymphoma (MCL,套细胞淋巴瘤)
网络模型：AlexNet
数据集：原图1388*1040大小，共计374张, 1.4G。 CLL:113, FL:138, MCL:122git

实验流程

准备工做：
caffe环境配置好；数据集、代码下载完毕github

将大图切成小的patches.代码：step1_make_patches.m。
代码须要修改的就是路径，这点须要注意。为了方便，将数据集放在与.m的同级目录下.
在这以前，为了与教程所描述的数据集中图片的命名一致，要在每一类别下的图片加类名前缀。这里给出ubuntu下批量修改文件名的方法：数据库

cd 到子类所在的路径下
假设要加的类名前缀为CLL-
sudo rename 's/^/CLL-/' *tifubuntu

修正后的代码以及简要理解以下：数组

clc
    clear all 
    % 子图的输出路径
    outdir='./subs/'; %output directory for all of the sub files
    mkdir(outdir)

    % 设置取patch时的步长
    step_size=32;
    % 设置patch大小，注意做者在这里提到，输入caffe时还会被crop成32*32
     patch_size=36; %size of the pathces we would like to extract, bigger since Caffee will randomly crop 32 x 32 patches from them
    % 按类别取patch
    classes={'CLL','FL','MCL'};
    class_struct={};
    for classi=1:length(classes)
        % 获得目标类文件夹下全部图片名称
        files=dir([sprintf('./case7_lymphoma_classification/%s/', classes{classi}),'*.tif']); % we only use images for which we have a mask available, so we use their filenames to limit the patients
        % 生成无重复的病人序号。这里解释一下，由于做者想作的是创建与病人关联的数据库，可是实际上该数据集没有病人信息，但为了统一，仍采用这种结构生成数据
        % arrayfun: 对数组中的每个元素进行fun运算； x{1}{1}生成1x1的cell
        patients=unique(arrayfun(@(x) x{1}{1},arrayfun(@(x) strsplit(x.name,'.'),files,'UniformOutput',0),'UniformOutput',0)); %this creates a list of patient id numbers
        patient_struct=[];
        % parfor 并行
    parfor ci=1:length(patients) % for each of the *patients* we extract patches
            % base属性为名字
            patient_struct(ci).base=patients{ci}; %we keep track of the base filename so that we can split it into sets later. a "base" is for example 12750 in 12750_500_f00003_original.tif
            % sub_file 属性存放该病人（大图）的patch存放路径
            patient_struct(ci).sub_file=[]; %this will hold all of the patches we extract which will later be written
            % 获得对应病人的大图
            files=dir(sprintf([sprintf('./case7_lymphoma_classification/%s/', classes{classi}),'%s*.tif'],patients{ci})); %get a list of all of the image files associated with this particular patient
    
            for fi=1:length(files) %for each of the files..... % 由上，该数据集无重复，每一个病人只对应一张大图
                disp([ci,length(patients),fi,length(files)])
                fname=files(fi).name;
                % 保存的该病人每张大图的名字
                patient_struct(ci).sub_file(fi).base=fname; %each individual image name gets saved as well
        
                io=imread([sprintf('./case7_lymphoma_classification/%s/', classes{classi}),fname]); %read the image
            
                [nrow,ncol,ndim]=size(io);
                fnames_sub={};
                i=1;
                % 取图像的patch，其实是矩阵取子块,数量为[(1388-36)/32+1]*[(1040-36)/32+1]*2
                for rr=1:step_size:nrow-patch_size
                    for cc=1:step_size:ncol-patch_size
                        for rot=1:2  % 旋转，旋转90度，扩充数据集x2，
                            try
                                % 能够改为rr=1:step_size:nrow-patch_size+1，... ,
                                % subio=io(rr:rr+patch_size-1,cc+1:cc+patch_size-1,:);
                        
                                subio=io(rr+1:rr+patch_size,cc+1:cc+patch_size,:);                            
                                subio=imrotate(subio,(rot-1)*90);
                                % patch的命名方式：第几个patch
                                subfname=sprintf('%s_sub_%d.tif',fname(1:end-4),i);
                                fnames_sub{end+1}=subfname;
                                imwrite(subio,[outdir,subfname]);
                                i=i+1;
                            catch err
                                disp(err);
                                continue
                            end
                        end
                    end
                end
        
        
                patient_struct(ci).sub_file(fi).fnames_subs=fnames_sub;
            end
    
        end
        class_struct{classi}=patient_struct;

    end

    save('class_struct.mat','class_struct') %save this just incase the computer crashes before the next step finishes

每一个图片切出2752张patchesbash

生成交叉验证集，为了获得最好的模型。使用5折交叉验证。代码step2_make_training_lists.m。
每个交叉验证集须要生成4个txt文件，以第一折为例：
train_w32_parent_1.txt，test_w32_parent_1.txt：该交叉验证集包含的病人名称列表的txt
train_w32_1.txt，test_w32_1.txt：该交叉验证集包含的图片名称以及对应类别的列表的txt
代码比较直观，只要是要理解5折交叉验证的原理。简单记录下代码：网络

load class_struct %save this just incase the computer crashes before the next step finishes
  % 5折交叉验证
  nfolds=5; %determine how many folds we want to use during cross validation
  fidtrain=[];
  fidtest=[];


  fidtrain_parent=[];
  fidtest_parent=[];
  % 生成全部文件的句柄
  for zz=1:nfolds %open all of the file Ids for the training and testing files
      %each fold has 4 files created (as discussed in the tutorial)
      fidtrain(zz)=fopen(sprintf('train_w32_%d.txt',zz),'w');
      fidtest(zz)=fopen(sprintf('test_w32_%d.txt',zz),'w');

      fidtrain_parent(zz)=fopen(sprintf('train_w32_parent_%d.txt',zz),'w');
      fidtest_parent(zz)=fopen(sprintf('test_w32_parent_%d.txt',zz),'w');
  end

  % 将病人ID写入patient.txt .将病人的patch图片及类别（CLL：0，FL：1，MCL : 2）名写入另外两个txt
  % 5折交叉验证是：4个为训练集，剩余一个为测试集，这样能够组合为5个数据集
  for classi=1:length(class_struct)

      patient_struct=class_struct{classi};

      npatients=length(patient_struct); %get the number of patients that we have
      indices=crossvalind('Kfold',npatients,nfolds); %use the matlab function to generate a k-fold set

      for fi=1:npatients %for each patient
          disp([fi,npatients]);
          for k=1:nfolds %for each fold

              if(indices(fi)==k) %if this patient is in the test set for this fold, set the file descriptor accordingly
                  fid=fidtest(k);
                  fid_parent=fidtest_parent(k);
              else %otherwise its in the training set
                  fid=fidtrain(k);
                  fid_parent=fidtrain_parent(k);
              end

              fprintf(fid_parent,'%s\n',patient_struct(fi).base); %print this patien's ID to the parent file

              subfiles=patient_struct(fi).sub_file; %get the patient's images

              for subfi=1:length(subfiles) %for each of the patient images
                  try
                      subfnames=subfiles(subfi).fnames_subs; %now get all of the negative patches
                      % ！！！这里注意要将%s\t%d改成%s\ %d,使用空格做为分隔，不然后面格式转换时会出错：could not open or find file...
                      cellfun(@(x) fprintf(fid,'%s\ %d\n',x,classi-1),subfnames); %write them to the list as belonging to the 0 class (non nuclei)

                  catch err
                      disp(err)
                      disp([patient_struct(fi).base,'  ',patient_struct(fi).sub_file(subfi).base]) %if there are any errors, display them, but continue
                      continue
                  end
              end

          end
      end

  end

  for zz=1:nfolds %now that we're done, make sure that we close all of the files
      fclose(fidtrain(zz));
      fclose(fidtest(zz));

      fclose(fidtrain_parent(zz));
      fclose(fidtest_parent(zz));

  end

5个数据集模型，每一个测试集203648张patches,训练集825600，训练集：测试集~1:4app

生成数据集。这里利用caffe的命令行生成leveldb格式的数据和相应的均值文件。之因此不直接用image layer，是由于：还需计算所需格式的均值，并且image layer也不是设计为大数据量读取的，因此直接使用caffe命令行更加方便。
代码：step3_make_dbs.sh,** 在sub文件夹内运行**，以确保路径正确。仍是要修改源代码的一些路径问题和一些细节上的错误：dom

#!/bin/bash

  filepath=$(cd "$(dirname "$0")"; pwd)

  for kfoldi in {1..5}
  do
  echo "doing fold $kfoldi"
  #注意这里，若是你实验的目录是在caffe路径下时，能够这样，不然须要绝对路径。并且原代码的for循环内部{{kfoldi}} 应改成kfoldi
  #~/py-R-FCN/caffe/build/tools/convert_imageset -shuffle -backend leveldb   subs/ train_w32_${kfoldi}.txt DB_train_${kfoldi} &
  #~/py-R-FCN/caffe/build/tools/convert_imageset -shuffle -backend leveldb   subs/ test_w32_${kfoldi}.txt DB_test_${kfoldi} &
  /home/mz/py-R-FCN/caffe/build/tools/convert_imageset -shuffle -backend leveldb   subs/ train_w32_$kfoldi.txt DB_train_$kfoldi &
  /home/mz/py-R-FCN/caffe/build/tools/convert_imageset -shuffle -backend leveldb   subs/ test_w32_$kfoldi.txt DB_test_$kfoldi &
  done




  FAIL=0
  for job in `jobs -p`
  do
      echo $job
      wait $job || let "FAIL+=1"
  done




  echo "number failed: $FAIL"

  cd ../

  for kfoldi in {1..5}
  do
  echo "doing fold $kfoldi"
  #这里同上，进行修改
  /home/mz/py-R-FCN/caffe/build/tools/compute_image_mean DB_train_$kfoldi DB_train_w32_$kfoldi.binaryproto -backend leveldb  &
  done

训练DL分类器

说明：使用的网络结构是alexnet,其实比AlexNet官方结构精简，只有3对卷积池化和两个全链接，实际上这是cifar10分类中使用的网络结构。要考虑到这里的输入图片大小为32*32(网络结构中对输入的定义还作了crop为32的操做)，并且是3分类（alexnet是1000分类），因此从模型的复杂度上也不须要作的和alexnet那样复杂。因此网络深度和一些参数须要变化，不能照搬AlexNet。可是值得实验的是，是否病理学图像必须裁成小图，大一些的图是否能够，少加一些pool，把深度提上去，不知道性能会怎么样？编辑器

须要的文件：与7-lymphoma同级的common文件夹下的BASE-alexnet_solver_ada.prototxt、(BASE-alexnet_traing_32w_db.prototxt、BASE-alexnet_traing_32w_dropout_db.prototxt；带不带dropout),(deploy_train32.prototxt、deploy_train32_dropout.prototxt，测试网络定义）。
复制5份，用于5个模型（5折交叉验证），命名方式1-alexnet_solver_ada.prototxt,以此类推。
修改的内容：

核对全部文件中的$(kfoldi)d，须要相应替换为数字1-5. 修改prototxt文件最后ip layer的输出为3。
要修改路径。文件中的路径（数据，prototxt）是指模型定义都放在了caffe的./model下，而数据集存LMDB和mean文件放在caffe根目录下。若是不是，须要替换为绝对路径。
修改caffe的测试迭代次数，在solver文件下的test_iter。计算方法为测试数据量/测试时的batch_size。batch_size = 128,而前者能够经过运行下面指令快速获得：
```
wc -l test_w32_1.txt
```
或者打开文件拉到最后一行，看文本编辑器的下方显示的行数。

进行训练：

/home/mz/py-R-FCN/caffe/build/tools/caffe train --solver=1-alexnet_solver_ada.prototxt

对于模型5，迭代600000次，不加dropout的模型：0.841879，loss = 0.513672 ;
加dropout的模型：0.826787,loss = 0.576846
对于模型4，迭代600000次，不加dropout的模型：0.86142，loss = 0.364765 ;
加dropout的模型：0.85352,loss = 0.500288
对于模型3，迭代600000次，不加dropout的模型：0.840632，loss = 0.448586 ;
加dropout的模型：0.814813,loss = 0.546735
对于模型2，迭代600000次，不加dropout的模型：0.817167，loss = 0.466199 ;
加dropout的模型：0.797229,loss = 0.557098
对于模型1，迭代600000次，不加dropout的模型：0.85496，loss = 0.435163 ;
加dropout的模型：0.828828,loss = 0.577961

尝试大尺寸的patch，而后使用不一样的网络结构（AlexNet,VGG-16,GoogLeNet,ResNet）

数据准备。
如今尝试大尺寸的patch,这里裁剪成227*227。后续的实验再也不进行交叉验证。将所有数据合为一份数据集，按照2:1:1划分训练集，校验集和测试集。
方法：从新新建一个文件夹，用来存放实验数据。更改原来的step1和step2的文件中的代码。参考以下：
step1. 从原图上切227×227的patch，同时对这些patch作水平翻转，扩充数据。一张原图生成96张patch。

clc
clear all 
% 子图的输出路径
outdir='./subs_227/'; %output directory for all of the sub files
mkdir(outdir)

% 设置取patch时的步长
step_size=227;
% 设置patch大小，注意做者在这里提到，输入caffe时还会被crop成32*32
patch_size=227; %size of the pathces we would like to extract, bigger since Caffee will randomly crop 32 x 32 patches from them

% 是否水平翻转
flip = true;

% 按类别取patch
classes={'CLL','FL','MCL'};
class_struct={};
for classi=1:length(classes)
    % 获得目标类文件夹下全部图片名称
    files=dir([sprintf('./case7_lymphoma_classification/%s/', classes{classi}),'*.tif']); % we only use images for which we have a mask available, so we use their filenames to limit the patients
    % 生成无重复的病人序号。这里解释一下，由于做者想作的是创建与病人关联的数据库，可是实际上该数据集没有病人信息，但为了统一，仍采用这种结构生成数据
    % arrayfun: 对数组中的每个元素进行fun运算； x{1}{1}生成1x1的cell
    patients=unique(arrayfun(@(x) x{1}{1},arrayfun(@(x) strsplit(x.name,'.'),files,'UniformOutput',0),'UniformOutput',0)); %this creates a list of patient id numbers
    patient_struct=[];
    % parfor 并行
   parfor ci=1:length(patients) % for each of the *patients* we extract patches
        % base属性为名字
        patient_struct(ci).base=patients{ci}; %we keep track of the base filename so that we can split it into sets later. a "base" is for example 12750 in 12750_500_f00003_original.tif
        % sub_file 属性存放该病人（大图）的patch存放路径
        patient_struct(ci).sub_file=[]; %this will hold all of the patches we extract which will later be written
        % 获得对应病人的大图
        files=dir(sprintf([sprintf('./case7_lymphoma_classification/%s/', classes{classi}),'%s*.tif'],patients{ci})); %get a list of all of the image files associated with this particular patient
        
        for fi=1:length(files) %for each of the files..... % 由上，该数据集无重复，每一个病人只对应一张大图
            disp([ci,length(patients),fi,length(files)])
            fname=files(fi).name;
            % 保存的该病人每张大图的名字
            patient_struct(ci).sub_file(fi).base=fname; %each individual image name gets saved as well
            
            io=imread([sprintf('./case7_lymphoma_classification/%s/', classes{classi}), fname]); %read the image
                
            [nrow,ncol,ndim]=size(io);
            fnames_sub={};
            i=1;
            % 取图像的patch，其实是矩阵取子块,数量为[(1388-36)/32+1]*[(1040-36)/32+1]*2
            for rr=1:step_size:nrow-patch_size
                for cc=1:step_size:ncol-patch_size
                    for rot=1:2  % 旋转，旋转90度，扩充数据集x2，
                        try
                            % 能够改为rr=1:step_size:nrow-patch_size+1，... ,
                            % subio=io(rr:rr+patch_size-1,cc+1:cc+patch_size-1,:);
                            
                            subio=io(rr+1:rr+patch_size,cc+1:cc+patch_size,:);                            
                            subio=imrotate(subio,(rot-1)*90);
                            % patch的命名方式：第几个patch
                            subfname=sprintf('%s_sub_%d.tif',fname(1:end-4),i);
                            fnames_sub{end+1}=subfname;
                            imwrite(subio,[outdir,subfname]);
                            i=i+1;
                            if flip
                                subio_flip = subio(:,end:-1:1,1:3);
                                % patch的命名方式：第几个patch
                                subfname=sprintf('%s_sub_%d.tif',fname(1:end-4),i);
                                fnames_sub{end+1}=subfname;
                                imwrite(subio_flip,[outdir,subfname]);
                                i=i+1;
                            end
                        catch err
                            disp(err);
                            continue
                        end
                    end
                end
            end
            
            
            patient_struct(ci).sub_file(fi).fnames_subs=fnames_sub;
        end
        
    end
    class_struct{classi}=patient_struct;

end

save('class_struct.mat','class_struct') %save this just incase the computer crashes before the next step finishes

step2.生成分别包含训练和测试集图片name list的TXT文件.训练集：17856；测试集：9120.；校验集：8928

load class_struct %save this just incase the computer crashes before the next step finishes

% 生成文件的句柄

fidtrain=fopen(sprintf('train_w227.txt'),'w');
fidval=fopen(sprintf('val_w227.txt'),'w');
fidtest=fopen(sprintf('test_w227.txt'),'w');
fidtrain_parent  = fopen(sprintf('train_w227_parent.txt'),'w');
fidval_parent  = fopen(sprintf('val_w227_parent.txt'),'w');
fidtest_parent  = fopen(sprintf('test_w227_parent.txt'),'w');
% 将病人的patch图片及类别（CLL：0，FL：1，MCL : 2）名写入训练和测试txt

% 训练集，校验集和测试集比例2:1:1


for classi=1:length(class_struct)
    
    patient_struct=class_struct{classi};
    
    npatients=length(patient_struct); %get the number of patients that we have
    % 打乱顺序
    RandIndex = randperm(npatients);
    test_index = RandIndex(1:ceil(0.25*npatients));
    val_index = RandIndex(ceil(0.25*npatients)+1:ceil(0.5*npatients));
    train_index = RandIndex(ceil(0.5*npatients)+1:end);
        
    for fi=1:npatients %for each patient
        disp([fi,npatients]);
            
        if(ismember(fi, test_index)) %if this patient is in the test set for this fold, set the file descriptor accordingly
            fid=fidtest;
            fid_parent=fidtest_parent;
        elseif(ismember(fi, train_index)) %otherwise its in the training set
            fid=fidtrain;
            fid_parent=fidtrain_parent;
        else
            fid=fidval;
            fid_parent=fidval_parent;
        end
            
        fprintf(fid_parent,'%s\n',patient_struct(fi).base); %print this patien's ID to the parent file

        subfiles=patient_struct(fi).sub_file; %get the patient's images

        for subfi=1:length(subfiles) %for each of the patient images
            try
                subfnames=subfiles(subfi).fnames_subs; %now get all of the negative patches
                cellfun(@(x) fprintf(fid,'%s %d\n',x,classi-1),subfnames); %write them to the list as belonging to the 0 class (non nuclei)

            catch err
                disp(err)
                disp([patient_struct(fi).base,'  ',patient_struct(fi).sub_file(subfi).base]) %if there are any errors, display them, but continue
                continue
            end
        end

    end
end
    

 %now that we're done, make sure that we close all of the files
fclose(fidtrain);
fclose(fidtest);
fclose(fidval);
fclose(fidtrain_parent);
fclose(fidtest_parent);
fclose(fidval_parent);

step3. 生成leveldb格式的数据以及对应的均值文件。将按以下修改的step3文件放入subs_227

#!/bin/bash

echo "create lmdb data"
/home/mz/py-R-FCN/caffe/build/tools/convert_imageset -shuffle -backend lmdb   ./ ../train_w227.txt ../DB_train &
/home/mz/py-R-FCN/caffe/build/tools/convert_imageset -shuffle -backend lmdb   ./ ../test_w227.txt ../DB_test &
/home/mz/py-R-FCN/caffe/build/tools/convert_imageset -shuffle -backend lmdb   ./ ../test_w227.txt ../DB_val &





FAIL=0
for job in `jobs -p`
do
    echo $job
    wait $job || let "FAIL+=1"
done




echo "number failed: $FAIL"

cd ../


echo "ceate mean binary"
/home/mz/py-R-FCN/caffe/build/tools/compute_image_mean DB_train DB_train_w227.binaryproto -backend lmdb  &

不一样的模型

AlexNet

从caffe/models下拷贝bvlc-alexnet文件夹，获得Alexnet的模型定义prototxt和solver.prototxt.更改相关参数，进行训练。
参数：迭代次数：50000；test_iter=179;test_interval=200;fc8-output=3;
结果
val-accuracy: 0.927598; train-loss = 8.55613e-05;
这里测试的时候仍使用train_val.prototxt,另存一份，起名为train_test.prototxt。只是要将校验集路径改成测试集路径。而后，执行下面命令：

sudo /home/mz/py-R-FCN/caffe/build/tools/caffe test -model=train_test.prototxt -weights=../models/caffe_alexnet_train_iter_50000.caffemodel -gpu 0 -iterations=183

-iterations迭代次数参数计算方式：测试集数量/batch_size
test-accuracy: 0.856721 loss=1.03727

case7 淋巴瘤子类分类实验记录

case7 淋巴瘤子类分类实验记录

简介

实验流程

尝试大尺寸的patch，而后使用不一样的网络结构（AlexNet,VGG-16,GoogLeNet,ResNet）

不一样的模型

AlexNet

GooLeNet

VGG

ResNet