编写一个最简单的Nutch插件

nutch是高度可扩展的,他使用的插件系统是基于Eclipse2.x的插件系统。在这篇文章中我讲解一下如何编写一个nutch插件,以及在这个过程当中我遇到的坑。html

请先确保你在eclipse中成功运行了nutch,能够参考在eclipse中运行nutchjava

咱们要实现的插件的功能是接管抓取过程,而后不管抓取什么网址,咱们都返回hello world,够简单吧。。。apache

插件机制

nutch的插件机制大体是这样;nutch自己暴露了几个扩展点,每一个扩展点都是一个接口,咱们能够经过实现接口来实现这个扩展点,这就是一个扩展。一个插件能够保护多个扩展。eclipse

这是nutch官网列举的nutch的几个主要扩展点:ide

  • IndexWriter -- Writes crawled data to a specific indexing backends (Solr, ElasticSearch, a CVS file, etc.).oop

  • IndexingFilter -- Permits one to add metadata to the indexed fields. All plugins found which implement this extension point are run sequentially on the parse (from javadoc).学习

  • Parser -- Parser implementations read through fetched documents in order to extract data to be indexed. This is what you need to implement if you want Nutch to be able to parse a new type of content, or extract more data from currently parseable content.测试

  • HtmlParseFilter -- Permits one to add additional metadata to HTML parses (from javadoc).fetch

  • Protocol -- Protocol implementations allow Nutch to use different protocols (ftp, http, etc.) to fetch documents.网站

  • URLFilter -- URLFilter implementations limit the URLs that Nutch attempts to fetch. The RegexURLFilter distributed with Nutch provides a great deal of control over what URLs Nutch crawls, however if you have very complicated rules about what URLs you want to crawl, you can write your own implementation.

  • URLNormalizer -- Interface used to convert URLs to normal form and optionally perform substitutions.

  • ScoringFilter -- A contract defining behavior of scoring plugins. A scoring filter will manipulate scoring variables in CrawlDatum and in resulting search indexes. Filters can be chained in a specific order, to provide multi-stage scoring adjustments.

  • SegmentMergeFilter -- Interface used to filter segments during segment merge. It allows filtering on more sophisticated criteria than just URLs. In particular it allows filtering based on metadata collected while parsing page.

咱们要接管网页抓取部分,因此Protocol扩展点是咱们的目标。

分析插件protocol-http

nutch包含许多默认插件。这些插件的源代码在src/plugin中。若是咱们抓取的url是http协议的,nutch就会使用protocol-http插件。分析是最好的学习,咱们来看看protocol-http插件是如何实现的。

目录结构

protocol-http源码的目录结构

protocol-http:                                            
│  build.xml    // 插件的ant build文件,描述如何build插件                                │  ivy.xml      // 定义插件因此来的第三方库                               │  plugin.xml   // 插件描述文件,nutch经过其中的内容来得知该插件实现了哪一个扩展点,进而决定什么时候调用插件│                                               
└─src           // 插件源码目录                                
    ├─java                                      
    │  └─org                                    
    │      └─apache                             
    │          └─nutch                          
    │              └─protocol                   
    │                  └─http                   
    │                          Http.java        
    │                          HttpResponse.java
    │                          package.html     
    │                                           
    └─test                                      
        └─org                                   
            └─apache                            
                └─nutch                         
                    └─protocol                  
                        └─http

分析plugin文件

分别看看各个文件中的内容

build.xml:

<?xml version="1.0"?>
<project name="protocol-http" default="jar-core">  // name属性定义了插件的名字

  <import file="../build-plugin.xml"/>

  <!-- Build compilation dependencies -->
  <target name="deps-jar">
    <ant target="jar" inheritall="false" dir="../lib-http"/>  // protocol-http插件依赖了另一个插件lib-http
  </target>

  <!-- Add compilation dependencies to classpath -->
  <path id="plugin.deps">
    <fileset dir="${nutch.root}/build">
      <include name="**/lib-http/*.jar" />
    </fileset>
  </path>

  <!-- Deploy Unit test dependencies -->
  <target name="deps-test">
    <ant target="deploy" inheritall="false" dir="../lib-http"/>
    <ant target="deploy" inheritall="false" dir="../nutch-extensionpoints"/>
  </target>

</project>

ivy.xml:

<?xml version="1.0" ?>
<ivy-module version="1.0">
  <info organisation="org.apache.nutch" module="${ant.project.name}">
    <license name="Apache 2.0"/>
    <ivyauthor name="Apache Nutch Team" url="http://nutch.apache.org"/>
    <description>
        Apache Nutch
    </description>
  </info>

  <configurations>
    <include file="../../..//ivy/ivy-configurations.xml"/>
  </configurations>

  <publications>
    <!--get the artifact from our module name-->
    <artifact conf="master"/>
  </publications>

  <dependencies>
  </dependencies>

</ivy-module>

plugin.xml:

<?xml version="1.0" encoding="UTF-8"?>
<plugin
   id="protocol-http"              // 插件id
   name="Http Protocol Plug-in"    // 插件名字
   version="1.0.0"                 // 插件版本
   provider-name="nutch.org">      // 插件提供者

   <runtime>
      <library name="protocol-http.jar">  // 插件最终生成的jar名
         <export name="*"/>
      </library>
   </runtime>

   <requires>                      // 插件须要的其余插件
      <import plugin="nutch-extensionpoints"/>  
      <import plugin="lib-http"/>
   </requires>   // 插件包含的扩展
   <extension id="org.apache.nutch.protocol.http"           // 扩展id
              name="HttpProtocol"                           // 扩展名
              point="org.apache.nutch.protocol.Protocol">   // 扩展点

      // 扩展能够包含多个实现
      <implementation id="org.apache.nutch.protocol.http.Http"      // 实现id
                      class="org.apache.nutch.protocol.http.Http">  // 实现类
        <parameter name="protocolName" value="http"/>               // 若是protocolName为http则使用该实现(关于这一点,nutch文档里找不到相关定义)
      </implementation>

      <implementation id="org.apache.nutch.protocol.http.Http"
                       class="org.apache.nutch.protocol.http.Http">
           <parameter name="protocolName" value="https"/>
      </implementation>

   </extension>

</plugin>

最简单配置文件

经过概括,能够得出最简单的配置文件格式(由于在nutch文档中没有找到详细定义,因此只能推理概括了。。。)

build.xml:

<?xml version="1.0"?>
<project name="插件ID" default="jar-core">

  <import file="../build-plugin.xml"/>

</project>

若是没有以来第三方库,ivy.xml直接这样写就能够;

<?xml version="1.0" ?>
<ivy-module version="1.0">
  <info organisation="org.apache.nutch" module="${ant.project.name}">
    <license name="Apache 2.0"/>
    <ivyauthor name="Apache Nutch Team" url="http://nutch.apache.org"/>
    <description>
        Apache Nutch
    </description>
  </info>

  <configurations>
    <include file="../../..//ivy/ivy-configurations.xml"/>
  </configurations>

  <publications>
    <!--get the artifact from our module name-->
    <artifact conf="master"/>
  </publications>

  <dependencies>
  </dependencies>
</ivy-module>

plugin.xml:

<?xml version="1.0" encoding="UTF-8"?>
<plugin
   id="插件ID"
   name="插件名称"
   version="插件版本x.x.x"
   provider-name="插件做者">

   <runtime>
      <library name="插件ID.jar">
         <export name="*"/>
      </library>
   </runtime>

   <requires>
      <import plugin="nutch-extensionpoints"/>
   </requires>

   <extension id="扩展ID(通常使用扩展的包名)"
              name="扩展名称"
              point="扩展点(接口完整名称)">

      <implementation id="实现ID"
                      class="实现类型(完整名)">
                      <parameter name="protocolName" value="http"/> // 参数,根据不一样的扩展点不同
      </implementation>

   </extension>

</plugin>

插件开工

知道了一个插件的结构后,咱们就能够依样画葫芦了。咱们定义插件的名字为protocol-test,插件实现的扩展点也是org.apache.nutch.protocol.Protocol

在src/plugin中新建目录protocol-test。

编写描述文件

新建build.xml:

<?xml version="1.0"?>
<project name="protocol-test" default="jar-core">

  <import file="../build-plugin.xml"/>

</project>

新建ivy.xml:

<?xml version="1.0" ?>
<ivy-module version="1.0">
  <info organisation="org.apache.nutch" module="${ant.project.name}">
    <license name="Apache 2.0"/>
    <ivyauthor name="Apache Nutch Team" url="http://nutch.apache.org"/>
    <description>
        Apache Nutch
    </description>
  </info>

  <configurations>
    <include file="../../..//ivy/ivy-configurations.xml"/>
  </configurations>

  <publications>
    <!--get the artifact from our module name-->
    <artifact conf="master"/>
  </publications>

  <dependencies>
  </dependencies>
</ivy-module>

新建plugin.xml:

<?xml version="1.0" encoding="UTF-8"?>
<plugin
   id="protocol-test"
   name="Protocol Plug-in Test"
   version="1.0.0"
   provider-name="mushan">

   <runtime>
      <library name="protocol-test.jar">
         <export name="*"/>
      </library>
   </runtime>

   <requires>
      <import plugin="nutch-extensionpoints"/>
   </requires>

   <extension id="com.mushan.protocol"
              name="ProtocolTest"
              point="org.apache.nutch.protocol.Protocol">

      <implementation id="com.mushan.protocol.Test"
                      class="com.mushan.protocol.Test">
                      <parameter name="protocolName" value="http"/>
      </implementation>

   </extension>

</plugin>

在eclipse中导入插件目录

在protocol-test中新建目录src/java/com/mushan/protocol,注意目录结构是和implementation中的class名称结构是同样的。

先刷新工程,而后打开nutch工程的属性,Java Build Path > Source > Add Folder...

sourcetab

在对话框中选择插件的代码目录并添加:

addpluginfolder

编写插件核心代码

新建Test类,记得选择接口为org.apache.nutch.protocol.Protocol,也就是要实现的扩展点:

newtestclass

Test类代码以下:

package com.mushan.protocol;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.Text;
import org.apache.nutch.crawl.CrawlDatum;
import org.apache.nutch.metadata.Metadata;
import org.apache.nutch.protocol.Content;
import org.apache.nutch.protocol.Protocol;
import org.apache.nutch.protocol.ProtocolOutput;
import org.apache.nutch.protocol.RobotRulesParser;

import crawlercommons.robots.BaseRobotRules;
public class Test implements Protocol {
    private Configuration conf = null;
    @Override
    public Configuration getConf() {
        return this.conf;
    }
    @Override
    public void setConf(Configuration conf) {
        this.conf = conf;
    }
    @Override
    public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum) {
        Content c = new Content(url.toString(),url.toString(),"hello world".getBytes(),"text/html",new Metadata(),this.conf);            // 返回的网页内容为"hello world"
        return new ProtocolOutput(c);
    }
    @Override
    public BaseRobotRules getRobotRules(Text url, CrawlDatum datum) {
        return RobotRulesParser.EMPTY_RULES;    // 没有robot规则
    }
}

以上,插件部分的代码就写完了。可是为了让nutch构建插件并加载插件,还得有些配置。

整合插件到nutch中

设置conf/nutch-site.xml文件,启用咱们的插件:

<configuration>
  <property>
    <name>http.agent.name</name>
    <value>MySpider</value>
  </property>

  <property>
    <name>plugin.folders</name>
    <value>build/plugins</value>
  </property>

  <property>
    <name>plugin.includes</name>
    <value>protocol-test|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>  // 把原来的protocol-html换成protocol-test
  </property>
</configuration>

在src/plugin/build.xml的<target name="deploy">中添加

<ant dir="protocol-test" target="deploy"/>

这样在ant构建的时候,就会生成插件的jar包了。

不过所有生成比较耗时,因此咱们定义一个临时ant task,只生成咱们的插件。在src/plugin/build.xml中添加:

<target name="my-plugin">
    <ant dir="protocol-test" target="deploy"/>
</target>

eclipse中执行ant build来构建插件:

antbuildsetting

anttaskselect

点击run,就会在plugins\protocol-test目录中生成plugin.xml和protocol-test.jar。若是遇到错误,参加错误记录中的ivy错误。

编写主类

有了插件,咱们定义一个主类来测试他。Main类执行的模拟抓取流程,并dump抓取的数据到data/readseg目录。

新建类Main.java:

package com.mushan;import java.io.File;import java.util.Arrays;import org.apache.hadoop.util.ToolRunner;import org.apache.nutch.crawl.Generator;import org.apache.nutch.crawl.Injector;import org.apache.nutch.fetcher.Fetcher;import org.apache.nutch.segment.SegmentReader;import org.apache.nutch.util.NutchConfiguration;public class Main {  public static void main(String[] args) {
    String[] injectArgs = {"data/crawldb","urls/"};
    String[] generatorArgs = {"data/crawldb","data/segments","-noFilter"};
    String[] fetchArgs = {"data/segments/"};
    String[] readsegArgs = {"-dump","data/segments/","data/readseg","-noparsetext","-noparse","-noparsedata"};

    File dataFile = new File("data");    if(dataFile.exists()){
      print("delete");
      deleteDir(dataFile);
    }    try {
      ToolRunner.run(NutchConfiguration.create(), new Injector(), injectArgs);
      ToolRunner.run(NutchConfiguration.create(), new Generator(), generatorArgs);
      File segPath = new File("data/segments");
      String[] list = segPath.list();
      print(Arrays.asList(list));
      fetchArgs[0] = fetchArgs[0]+list[0];
      ToolRunner.run(NutchConfiguration.create(), new Fetcher(), fetchArgs);

      readsegArgs[1]+=list[0];
      SegmentReader.main(readsegArgs);
    } catch (Exception e) {
      e.printStackTrace();
    }

  }   private static boolean deleteDir(File dir) {          if (dir.isDirectory()) {
              String[] children = dir.list();              for (int i=0; i<children.length; i++) {                  boolean success = deleteDir(new File(dir, children[i]));                  if (!success) {                      return false;
                  }
              }
          }          return dir.delete();
      }   public static final void print(Object text){
    System.out.println(text);
  }
}

运行Main类。精彩的时候到了!!由于你会遇到不少错误。。。

错误记录

ivy问题

BUILD FAILED
E:\project\java\nutch2015-2\apache-nutch-1.9\src\plugin\build.xml:81: The following error occurred while executing this line:
E:\project\java\nutch2015-2\apache-nutch-1.9\src\plugin\protocol-test\build.xml:4: The following error occurred while executing this line:
E:\project\java\nutch2015-2\apache-nutch-1.9\src\plugin\build-plugin.xml:47: Problem: failed to create task or type antlib:org.apache.ivy.ant:settingsCause: The name is undefined.

设置由于eclipse的内置ant没有安装ivy。Window > preference:

addivyjar

选择你的ivy.jar文件便可。

设置文件权限错误

Generator: java.io.IOException: Failed to set permissions of path: \tmp\hadoop-mzb\mapred\staging\mzb1466704581\.staging to 0700

参见在eclipse中运行nutch

验证

若是运行Main显示的是:

...
Injector: starting at 2015-02-12 09:48:04
Injector: crawlDb: data/crawldb

...
Fetcher: finished at 2015-02-12 09:48:18, elapsed: 00:00:05
SegmentReader: dump segment: data/segments/20150212094811
SegmentReader: done

说明抓取成功!

打开data/readseg/dump,这个文件dump了抓取的数据:

Recno:: 0
URL:: http://nutch.apache.org/index.html

CrawlDatum::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Thu Feb 12 09:48:07 CST 2015
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: 
  _ngt_=1423705689815

Content::
Version: -1
url: http://nutch.apache.org/index.html
base: http://nutch.apache.org/index.html
contentType: text/html
metadata: nutch.segment.name=20150212094811 _fst_=33 nutch.crawl.score=1.0 
Content:
hello worldCrawlDatum::
Version: 7
Status: 33 (fetch_success)
Fetch time: Thu Feb 12 09:48:14 CST 2015
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: 
  _ngt_=1423705689815
  _pst_=success(1), lastModified=0
  Content-Type=text/html

能够看到content就是咱们返回的hello world。

参考网址

相关文章
相关标签/搜索