API例子：用Java/JavaScript下载内容提取器

时间 2019-12-06

标签 api 例子 java javascript 下载内容提取栏目 Java 繁體版

原文原文链接

1，引言

本文讲解怎样用Java和JavaScript使用 GooSeeker API 接口下载内容提取器，这是一个示例程序。什么是内容提取器？为何用这种方式？源自Python即时网络爬虫开源项目：经过生成内容提取器，大幅节省程序员时间。具体请参看《内容提取器的定义》。html

2，用Java下载内容提取器

这是一系列实例程序中的一个，就目前编程语言发展来看，Java实现网页内容提取并不合适，除了语言不够灵活便捷之外，整个生态不够活跃，可选的类库增加缓慢。另外，要从JavaScript动态网页中提取内容，Java也很不方便，须要一个JavaScript引擎。用JavaScript下载内容提取器能够直接跳到第3部分的内容。node

具体实现git

注解：程序员

使用Java类库 jsoup（1.8.3以上版本），能够很便利、快速的获取网页dom。github
经过GooSeeker API 获取xslt（参考 1分钟快速生成用于网页内容提取的xslt）ajax
使用Java自带的类TransformerFactory执行网页内容转换编程

源代码以下：segmentfault

public static void main(String[] args)
{
    InputStream xslt = null;
    try
    {
        String grabUrl = "http://m.58.com/cs/qiuzu/22613961050143x.shtml"; // 抓取网址
        String resultPath = "F:/temp/xslt/result.xml"; // 抓取结果文件的存放路径
        // 经过GooSeeker API接口得到xslt
        xslt = getGsExtractor();
        // 抓取网页内容转换结果文件
        convertXml(grabUrl, xslt, resultPath);
    } catch (Exception e)
    {
        e.printStackTrace();
    } finally
    {
        try
        {
            if (xslt != null)
                xslt.close();
        } catch (IOException e)
        {
            e.printStackTrace();
        }
    }
}

`/**`
 `* @description dom转换`
 `*/`
public static void convertXml(String grabUrl, InputStream xslt, String resultPath) throws Exception
{
    // 这里的doc对象指的是jsoup里的Document对象
    org.jsoup.nodes.Document doc = Jsoup.parse(new URL(grabUrl).openStream(), "UTF-8", grabUrl);
    W3CDom w3cDom = new W3CDom();
    // 这里的w3cDoc对象指的是w3c里的Document对象
    org.w3c.dom.Document w3cDoc = w3cDom.fromJsoup(doc);
    Source srcSource = new DOMSource(w3cDoc);
    TransformerFactory tFactory =   TransformerFactory.newInstance();
    Transformer transformer = tFactory.newTransformer(new StreamSource(xslt));
    transformer.transform(srcSource, new StreamResult(new FileOutputStream(resultPath)));
}

`/**`
 `* @description 获取API返回结果`
 `*/`
public static InputStream getGsExtractor()
{
    // api接口
    String apiUrl = "http://www.gooseeker.com/api/getextractor";
    // 请求参数
    Map<String,Object> params = new HashMap<String, Object>();
    params.put("key", "xxx");  // Gooseeker会员中心申请的API KEY
    params.put("theme", "xxx");  // 提取器名，就是用MS谋数台定义的规则名
    params.put("middle", "xxx");  // 规则编号，若是相同规则名下定义了多个规则，需填写
    params.put("bname", "xxx"); // 整理箱名，若是规则含有多个整理箱，需填写
    String httpArg = urlparam(params);
    apiUrl = apiUrl + "?" + httpArg;
    InputStream is = null;
    try
    {
        URL url = new URL(apiUrl);
        HttpURLConnection urlCon = (HttpURLConnection) url.openConnection();
        urlCon.setRequestMethod("GET");
        is = urlCon.getInputStream();
    } catch (ProtocolException e)
    {
        e.printStackTrace();
    } catch (IOException e)
    {
        e.printStackTrace();
    }
    return is;
}

`/**`
 `* @description 请求参数`
 `*/`
public static String urlparam(Map<String, Object> data)
{
    StringBuilder sb = new StringBuilder();
    for (Map.Entry<String, Object> entry : data.entrySet())
    {
        try
        {
            sb.append(entry.getKey()).append("=").append(URLEncoder.encode(entry.getValue() + "", "UTF-8")).append("&");
        } catch (UnsupportedEncodingException e)
        {
            e.printStackTrace();
        }
    }
    return sb.toString();
}

返回结果以下：
api

3，用JavaScript下载内容提取器

请注意，若是本例的JavaScript代码是在网页上运行的，由于跨域问题，是没法实现非本站网页内容爬取的。因此，要运行在具备特权的JavaScript引擎上，好比，浏览器扩展程序、自研的浏览器、本身的程序中含有JavaScript引擎等。跨域

本例为了实验方便，仍然放在网页上运行，为了绕开跨域问题，是把目标网页存下来并进行修改，把JavaScript插入进去。这么多人工操做，仅仅是为了实验，正式使用的时候须要考虑别的手段。

具体实现

注解：

引用 jQuery 类库（jQuery-1.9.0 以上）
为了解决跨域问题，把目标网页预先保存到硬盘上
在目标网页中插入JavaScript代码
使用GooSeeker API，把内容提取器下载下来，内容提取器是一个xslt程序，下例使用了jQuery的ajax方法从api得到xslt
用xslt处理器做内容提取

下面是源代码：

// 目标网页网址为http://m.58.com/cs/qiuzu/22613961050143x.shtml，预先保存成本地html文件，并插入下述代码
$(document).ready(function(){
    $.ajax({
        type: "get", 
        url: "http://www.gooseeker.com/api/getextractor?key=申请的appKey&theme=规则主题名", 
        dataType: "xml", 
        success: function(xslt)
            {
            var result = convertXml(xslt, window.document);
            alert("result:" + result);
        } 
    });  
});

/* 用xslt将dom转换为xml对象 */
function convertXml(xslt, dom)
{
    // 定义XSLTProcessor对象
    var xsltProcessor = new XSLTProcessor();
    xsltProcessor.importStylesheet(xslt);
    // transformToDocument方式
    var result = xsltProcessor.transformToDocument(dom);
    return result;
}

返回结果截图以下

4，展望

一样能够用Python来获取指定网页内容，感受Python的语法更加简洁，后续增长Python语言的示例，有兴趣的小伙伴能够加入一块儿研究。

5，相关文档

1， Python即时网络爬虫：API说明

6，集搜客GooSeeker开源代码下载源

1， GooSeeker开源Python网络爬虫GitHub源

7，文档修改历史

1，2016-06-24：V1.0