定时抓取网页链接,提取网页内容,存入数据库
流程
- 提供要抓取的网页地址(列表)
- 提取网页列表中目标全部LINK
- 抓取LINK中的全部网页(爬虫)
- 解析正文内容
- 存入数据库
1、抓取任务(主程序)
- package com.test;
-
- import java.text.SimpleDateFormat;
- import java.util.Date;
- import java.util.List;
-
-
-
- public class CatchJob {
-
- public String catchJob(String url){
-
- String document= null;
- List allLinks = null;
- try {
-
-
- document = ExtractPage.getContentByUrl(url);
-
- allLinks = ExtractPage.getLinksByConditions(document, "http://www.free9.net/others/gift/");
- if(allLinks!=null&&!allLinks.isEmpty()){
- for(int i=0;i<allLinks.size();i++){
- String link = (String)allLinks.get(i);
- String content = ExtractPage.getContentByUrl(link);
- ExtractPage.readByHtml(content);
- }
- }
-
- } catch (Exception e) {
-
- e.printStackTrace();
- }
-
-
- return "success";
-
-
- }
-
-
-
-
-
-
-
- public static void main(String[] args){
- Long startTime = System.currentTimeMillis();
- System.out.println(">>start.......");
- String httpProxyHost = "211.167.0.131";
- String httpProxyPort = "80";
- System.getProperties().setProperty( "http.proxyHost", httpProxyHost);
- System.getProperties().setProperty( "http.proxyPort", httpProxyPort);
- CatchJob job = new CatchJob();
-
- System.out.println(job.catchJob("http://www.free9.net/others/gift/"));
- Date date = new Date(System.currentTimeMillis()-startTime);
- SimpleDateFormat sdf = new SimpleDateFormat("HH:mm:ss ");
- String s = sdf.format(date);
- System.out.println(">>end.......USE"+s+"秒");
- }
-
- }
-
2、抓取网页内容,并解析
欢迎关注本站公众号,获取更多信息