ubuntu部署nutch1.4

时间 2019-11-20

标签 ubuntu 部署 nutch1.4 nutch 栏目 Ubuntu 繁體版

原文原文链接

    以前一直在学习网络爬虫heritrix与lucene，并励志用Heritrix+Lucene作毕业设计，自学挺累的，没有一个明确的方向，一直想找个作搜索的公司实习一段时间，眼看就要毕业了，实习的愿望也快泡汤了，如今只想着多接触一些新的东西。

    如今开始学习nutch1.4，因为网上的文章不多是关于1.4的，故写了这篇文章但愿对一些想学习网络爬虫的人有一些帮助，同时，也但愿大家不要向我同样走了不少弯路，废话少说，直接进入正题吧！

nutch官网http://wiki.apache.org/nutch/NutchTutorial有专门的讲解，我如今把它翻译过来，但愿对一些想学习的人有用，首先是安转nutch，这个就不介绍了，你们可上官网直接下载就是了。

     关于怎么安装JDK以及怎么配置环境变量，这里也很少作介绍，网上有不少的例子。下载完nutch1.4后，好比加压到/home/chenyanting/nutch目录，可以使用解压命令：tar zxvf apache-nutch-1.4-bin.tar.gz

解压完之后直接进入/home/chenyanting/nutch/apache-nutch-1.4-bin/runtime/local

在此目录下运行命令 ./bin/nutch 若没有出现下面的内容:
Usage: nutch [-core] COMMAND
where COMMAND is one of:
crawl             one-step crawler for intranets
readdb            read / dump crawl db
mergedb           merge crawldb-s, with optional filtering
readlinkdb        read / dump link db
inject            inject new urls into the database
generate          generate new segments to fetch from crawl db
freegen           generate new segments to fetch from text files
fetch             fetch a segment's pages
parse             parse a segment's pages
readseg           read / dump segment data
mergesegs         merge several segments, with optional filtering and slicing
updatedb          update crawl db from segments after fetching
invertlinks       create a linkdb from parsed segments
mergelinkdb       merge linkdb-s, with optional filtering
solrindex         run the solr indexer on parsed segments and linkdb
solrdedup         remove duplicates from solr
solrclean         remove HTTP 301 and 404 documents from solr
parsechecker      check the parser for a given url
indexchecker      check the indexing filters for a given url
domainstats       calculate domain statistics from crawldb
webgraph          generate a web graph from existing segments
linkrank          run a link analysis program on the generated web graph
scoreupdater      updates the crawldb with linkrank scores
nodedumper        dumps the web graph's node scores
plugin            load a plugin and run one of its classes main()
junit             runs the given JUnit test
or
CLASSNAME         run the class named CLASSNAME
Most commands print help when invoked w/o parame

则要修改nutch解压目录中的runtime/local/bin/nutch脚本的执行权限   chmod 755 nutch

而后在设置JAVA_HOME

export JAVA_HOME='java路径'

而后修改这个目录下的conf/nutch-site.xml文件，加入以下属性：
<property>
 <name>http.agent.name</name>
 <value>My Nutch Spider</value>
</property>

建立存放url的目录

mkdir -p urls
cd urls
在里面新建文件seeds.txt
往这个文件里面加入你要爬取的地址好比：
```
http://nutch.apache.org/
```
修改文件conf/regex-urlfilter.txt，在最后加上

+^http://([a-z0-9]*\.)*nutch.apache.org/(把最后一行覆盖掉)


   接着退回到local目录，运行命令:
bin/nutch crawl urls -dir crawl -depth 3 -topN 5

本文出自 “陈砚羲” 博客，转载请与做者联系！java