<div id="article_content" class="article_content clearfix csdn-tracking-statistics" data-pid="blog" data-mod="popu_307" data-dsm="post" style="height: 1264px; overflow: hidden;"> 转载自:https://blog.csdn.net/lanchunhui/article/details/50521648 <div class="markdown_views"> <pre class="prettyprint" name="code"><code class="hljs python has-numbering"><span class="hljs-keyword">from</span> sklearn.pipeline <span class="hljs-keyword">import</span> Pipeline</code><ul class="pre-numbering" style=""><li style="color: rgb(153, 153, 153);">1</li></ul></pre>css
<p>管道机制在机器学习算法中得以应用的根源在于,参数集在新数据集(好比测试集)上的<strong>重复使用</strong>。</p>python
<p>管道机制实现了对所有步骤的流式化封装和管理(<strong>streaming workflows with pipelines</strong>)。</p>算法
<p>注意:管道机制更像是编程技巧的创新,而非算法的创新。</p>编程
<p>接下来咱们以一个具体的例子来演示sklearn库中强大的Pipeline用法:</p>bash
<h2 id="1-加载数据集"><a name="t0"></a>1. <strong>加载数据集</strong></h2>markdown
<pre class="prettyprint" name="code"><code class="hljs python has-numbering"><span class="hljs-keyword">from</span> pandas <span class="hljs-keyword">as</span> pd <span class="hljs-keyword">from</span> sklearn.cross_validation <span class="hljs-keyword">import</span> train_test_split <span class="hljs-keyword">from</span> sklearn.preprocessing <span class="hljs-keyword">import</span> LabelEncoder df = pd.read_csv(<span class="hljs-string">'https://archive.ics.uci.edu/ml/machine-learning-databases/'</span> <span class="hljs-string">'breast-cancer-wisconsin/wdbc.data'</span>, header=<span class="hljs-keyword">None</span>) <span class="hljs-comment"># Breast Cancer Wisconsin dataset</span> X, y = df.values[:, <span class="hljs-number">2</span>:], df.values[:, <span class="hljs-number">1</span>] <span class="hljs-comment"># y为字符型标签</span> <span class="hljs-comment"># 使用LabelEncoder类将其转换为0开始的数值型</span> encoder = LabelEncoder() y = encoder.fit_transform(y) >>> encoder.transform([<span class="hljs-string">'M'</span>, <span class="hljs-string">'B'</span>]) array([<span class="hljs-number">1</span>, <span class="hljs-number">0</span>]) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=<span class="hljs-number">.2</span>, random_state=<span class="hljs-number">0</span>) </code><ul class="pre-numbering" style=""><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li><li style="color: rgb(153, 153, 153);">15</li><li style="color: rgb(153, 153, 153);">16</li><li style="color: rgb(153, 153, 153);">17</li></ul></pre>网络
<h2 id="2-构思算法的流程"><a name="t1"></a>2. <strong>构思算法的流程</strong></h2>dom
<p>可放在Pipeline中的步骤可能有:</p>机器学习
<ul> <li>特征标准化是须要的,可做为第一个环节</li> <li>既然是分类器,classifier也是少不了的,天然是最后一个环节</li> <li>中间可加上好比数据降维(PCA)</li> <li>。。。</li> </ul>ide
<pre class="prettyprint" name="code"><code class="hljs python has-numbering"><span class="hljs-keyword">from</span> sklearn.preprocessing <span class="hljs-keyword">import</span> StandardScaler <span class="hljs-keyword">from</span> sklearn.decomposition <span class="hljs-keyword">import</span> PCA <span class="hljs-keyword">from</span> sklearn.linear_model <span class="hljs-keyword">import</span> LogisticRegression <span class="hljs-keyword">from</span> sklearn.pipeline <span class="hljs-keyword">import</span> Pipeline pipe_lr = Pipeline([(<span class="hljs-string">'sc'</span>, StandardScaler()), (<span class="hljs-string">'pca'</span>, PCA(n_components=<span class="hljs-number">2</span>)), (<span class="hljs-string">'clf'</span>, LogisticRegression(random_state=<span class="hljs-number">1</span>)) ]) pipe_lr.fit(X_train, y_train) print(<span class="hljs-string">'Test accuracy: %.3f'</span> % pipe_lr.score(X_test, y_test)) <span class="hljs-comment"># Test accuracy: 0.947</span></code><ul class="pre-numbering" style=""><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li></ul></pre>
<p>Pipeline对象接受<strong>二元tuple构成的list</strong>,每个二元 tuple 中的第一个元素为 arbitrary <strong>identifier string</strong>,咱们用以获取(access)Pipeline object 中的 individual elements,二元 tuple 中的第二个元素是 scikit-learn与之相适配的<strong>transformer 或者 estimator。</strong></p>
<pre class="prettyprint" name="code"><code class="hljs bash has-numbering">Pipeline([(<span class="hljs-string">'sc'</span>, StandardScaler()), (<span class="hljs-string">'pca'</span>, PCA(n_components=<span class="hljs-number">2</span>)), (<span class="hljs-string">'clf'</span>, LogisticRegression(random_state=<span class="hljs-number">1</span>))])</code><ul class="pre-numbering" style=""><li style="color: rgb(153, 153, 153);">1</li></ul></pre>
<h2 id="3-pipeline执行流程的分析"><a name="t2"></a>3. <strong>Pipeline执行流程的分析</strong></h2>
<p>Pipeline 的中间过程由scikit-learn相适配的转换器(transformer)构成,最后一步是一个estimator。好比上述的代码,<em>StandardScaler</em>和<em>PCA</em> <strong>transformer</strong> 构成intermediate steps,LogisticRegression 做为最终的<strong>estimator</strong>。</p>
<p>当咱们执行 <code>pipe_lr.fit(X_train, y_train)</code>时,首先由<em>StandardScaler</em>在训练集上执行 <em>fit</em>和<em>transform</em>方法,transformed后的数据又被传递给Pipeline对象的下一步,也即PCA()。和<em>StandardScaler</em>同样,PCA也是执行fit和transform方法,最终将转换后的数据传递给 <em>LosigsticRegression</em>。整个流程以下图所示:</p>
<p></p><center> <br> <img src="https://img-blog.csdn.net/20160115095855517" height="400," width="500"> <br> </center><p></p>
<h2 id="4-pipeline-与深度神经网络的multi-layers"><a name="t3"></a>4. <strong>pipeline 与深度神经网络的multi-layers</strong></h2>
<p>只不过步骤(step)的概念换成了层(layer)的概念,甚至the last step 和 输出层的含义都是同样的。</p>
<p>只是抛出一个问题,是否是有那么一丢丢的类似性?</p> </div> <link rel="stylesheet" href="https://csdnimg.cn/release/phoenix/template/css/markdown_views-ea0013b516.css"> </div>