前几天看到有人在博问上求全部成语,想到恰好看了jsoup,就动手实践了一下,提问者给出了网站,一看很简单,就两种页面,一种是包含某个字的成语连接页面,一个是具体的包含某个字的成语的页面java
下面是个人代码,用到了jsoup的jar包node
package cnblogs.spider; import java.io.BufferedWriter; import java.io.File; import java.io.FileOutputStream; import java.io.IOException; import java.io.OutputStreamWriter; import java.net.URL; import java.util.ArrayList; import java.util.List; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class IdiomScratch { public static void main(String[] args) { final String url = "http://www.hydcd.com/cy/chengyu/cy.htm"; final String urlSub = "http://www.hydcd.com/cy/chengyu/"; BufferedWriter writer = null; try { Document doc = Jsoup.parse(new URL(url).openStream(), "gb18030", "http://www.hydcd.com"); Element cyTable = doc.getElementById("table1"); Elements aElements = cyTable.getElementsByTag("a"); List<String> aHrefs = new ArrayList<String>(); if(null != aElements && aElements.size() > 0) { for(int i = 0, size = aElements.size(); i < size; i++) { aHrefs.add(urlSub + aElements.get(i).attr("href")); } File cytxt = new File("c://cengyu.txt"); if(!cytxt.exists()) { cytxt.createNewFile(); } writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(cytxt))); String cy = null; for(int i = 0, size = aHrefs.size(); i < size; i++) { doc = Jsoup.parse(new URL(aHrefs.get(i)).openStream(), "gb18030", "http://www.hydcd.com"); cyTable = doc.getElementById("table1"); aElements = cyTable.getElementsByTag("a"); if(null != aElements && aElements.size() > 0) { int b = 0; for(int j = 0, size2 = aElements.size(); j < size2; j++) { cy = aElements.get(j).text(); writer.write(cy + " "); b++; if(b == 8) { b = 0; writer.write("\r\n"); } } writer.write("\r\n"); if(b != 0) { writer.write("\r\n"); } writer.flush(); } } } } catch(IOException e) { e.printStackTrace(); } finally { if(null != writer) { try { writer.close(); } catch(IOException e) { e.printStackTrace(); } } } } }
说一下碰到的坑,一开始没有注意编码问题,获得的txt结果中总有一些乱码,后来看网页源码显示编码是gb2312,就换成了gb2312,但仍是不对,一想gb2312是简体字的,确定不能包含全部的成语中的汉字啊,全部就查了一下汉字的编码,发现有gb18030,就用这个试了一下,果真没有乱码了ide
结果以下:网站
下面是全部成语的txt文件和代码:编码
全部成语+代码url