博客搬家系列（三）-爬取博客园博客

时间 2020-05-28 标签博客搬家系列

博客搬家系列（三）-爬取博客园博客css

一.前情回顾

博客搬家系列（一）-简介：http://www.javashuo.com/article/p-ctgxpaub-bu.htmlhtml

博客搬家系列（二）-爬取CSDN博客：http://www.javashuo.com/article/p-eegrpfzv-x.htmljava

博客搬家系列（四）-爬取简书文章：https://blog.csdn.net/rico_zhou/article/details/83619538git

博客搬家系列（五）-爬取开源中国博客：https://blog.csdn.net/rico_zhou/article/details/83619561github

博客搬家系列（六）-爬取今日头条文章：https://blog.csdn.net/rico_zhou/article/details/83619564web

博客搬家系列（七）-本地WORD文档转HTML：https://blog.csdn.net/rico_zhou/article/details/83619573spring

博客搬家系列（八）-总结：https://blog.csdn.net/rico_zhou/article/details/83619599
浏览器

二.开干（获取文章URL集合）

爬取博客园的博客思路跟CSDN同样，且下载图片那一步更为简单，任何header都不须要设置，一样，咱们以ricozhou的主页为例分析源码https://www.cnblogs.com/ricozhou/ springboot

咱们能够看到文章列表以下，依然是很简洁的url，咱们找一个博主文章较多的看看，方便分析规律，如https://www.cnblogs.com/xdp-gacl/框架

当咱们点击下一页的时候，url以下

显然最后的2是页数，这样咱们就找到了页面url规律，一样右击查看源码，分析找到文章都位于哪一个标签内

观察发现，文章url均位于class为postTitle的标签内，代码以下：

/**
	 * @date Oct 17, 2018 12:30:46 PM
	 * @Desc
	 * @param blogMove
	 * @param oneUrl
	 * @return
	 * @throws IOException
	 * @throws MalformedURLException
	 * @throws FailingHttpStatusCodeException
	 */
	public void getCnBlogArticleUrlList(Blogmove blogMove, String oneUrl, List<String> urlList)
			throws FailingHttpStatusCodeException, MalformedURLException, IOException {
		// 模拟浏览器操做
		// 建立WebClient
		WebClient webClient = new WebClient(BrowserVersion.CHROME);
		// 关闭css代码功能
		webClient.getOptions().setThrowExceptionOnScriptError(false);
		webClient.getOptions().setCssEnabled(false);
		// 如如有可能找不到文件js则加上这句代码
		webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
		// 获取第一级网页html
		HtmlPage page = webClient.getPage(oneUrl);
		// System.out.println(page.asXml());
		Document doc = Jsoup.parse(page.asXml());
		Elements pageMsg = doc.select("div.postTitle");
		Element linkNode;
		for (Element e : pageMsg) {
			linkNode = e.select("a.postTitle2").first();
			if (linkNode == null) {
				continue;
			}
			if (urlList.size() < blogMove.getMoveNum()) {
				urlList.add(linkNode.attr("href"));
			} else {
				break;
			}
		}
		return;
	}

获取url集合以下

三.开干（获取文章具体信息）

一样，咱们仍是打开一篇博文，以使用爬虫框架htmlunit整合springboot出现的一个不兼容问题为例，使用Chrome打开，咱们能够看到一些基本信息，如文章的类型为原创，标题，时间，做者，阅读数，文章文字信息，图片信息等

一样，右击查看源码找到对应的元素，而后获取内容

部分代码

/**
	 * @date Oct 17, 2018 12:46:52 PM
	 * @Desc 获取详细信息
	 * @param blogMove
	 * @param url
	 * @return
	 * @throws IOException
	 * @throws MalformedURLException
	 * @throws FailingHttpStatusCodeException
	 */
	public Blogcontent getCnBlogArticleMsg(Blogmove blogMove, String url, List<Blogcontent> bList)
			throws FailingHttpStatusCodeException, MalformedURLException, IOException {
		Blogcontent blogcontent = new Blogcontent();
		blogcontent.setArticleSource(blogMove.getMoveWebsiteId());
		// 模拟浏览器操做
		// 建立WebClient
		WebClient webClient = new WebClient(BrowserVersion.CHROME);
		// 关闭css代码功能
		webClient.getOptions().setThrowExceptionOnScriptError(false);
		webClient.getOptions().setCssEnabled(false);
		// 如如有可能找不到文件js则加上这句代码
		webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
		// 获取第一级网页html
		HtmlPage page = webClient.getPage(url);

		Document doc = Jsoup.parse(page.asXml());
		// 获取标题
		String title = BlogMoveCnBlogUtils.getCnBlogArticleTitle(doc);
		// 是否重复去掉
		if (blogMove.getMoveRemoveRepeat() == 0) {
			// 判断是否重复
			if (BlogMoveCommonUtils.articleRepeat(bList, title)) {
				return null;
			}
		}
		blogcontent.setTitle(title);
		// 获取做者
		blogcontent.setAuthor(BlogMoveCnBlogUtils.getCnBlogArticleAuthor(doc));
		// 获取时间
		if (blogMove.getMoveUseOriginalTime() == 0) {
			blogcontent.setGtmCreate(BlogMoveCnBlogUtils.getCnBlogArticleTime(doc));
		} else {
			blogcontent.setGtmCreate(new Date());
		}
		blogcontent.setGtmModified(new Date());
		// 获取类型
		blogcontent.setType(BlogMoveCnBlogUtils.getCnBlogArticleType(doc));
		// 获取正文
		blogcontent.setContent(BlogMoveCnBlogUtils.getCnBlogArticleContent(doc, blogMove, blogcontent));

		// 设置其余
		blogcontent.setStatus(blogMove.getMoveBlogStatus());
		blogcontent.setBlogColumnName(blogMove.getMoveColumn());
		// 特殊处理
		blogcontent.setArticleEditor(blogMove.getMoveArticleEditor());
		blogcontent.setShowId(DateUtils.format(new Date(), DateUtils.YYYYMMDDHHMMSSSSS));
		blogcontent.setAllowComment(0);
		blogcontent.setAllowPing(0);
		blogcontent.setAllowDownload(0);
		blogcontent.setShowIntroduction(1);
		blogcontent.setIntroduction("");
		blogcontent.setPrivateArticle(1);

		return blogcontent;
	}

详细信息

/**
	 * @date Oct 17, 2018 1:10:19 PM
	 * @Desc 获取标题
	 * @param doc
	 * @return
	 */
	public static String getCnBlogArticleTitle(Document doc) {
		// 标题
		Element pageMsg2 = doc.select("div#post_detail").first().select("h1.postTitle").first().select("a").first();
		return pageMsg2.ownText();
	}

	/**
	 * @date Oct 17, 2018 1:10:28 PM
	 * @Desc 获取做者
	 * @param doc
	 * @return
	 */
	public static String getCnBlogArticleAuthor(Document doc) {
		Element pageMsg2 = doc.select("div.postDesc").first().select("a").first();
		return pageMsg2.ownText();
	}

	/**
	 * @date Oct 17, 2018 1:10:33 PM
	 * @Desc 获取时间
	 * @param doc
	 * @return
	 */
	public static Date getCnBlogArticleTime(Document doc) {
		Element pageMsg2 = doc.select("div.postDesc").first().select("span#post-date").first();
		String date = pageMsg2.ownText().trim();
		// 这地方时间格式变化太多暂时不实现
		Date d = DateUtils.formatStringDate(date, DateUtils.YYYY_MM_DD_HH_MM_SS4);
		// 注意有些格式不正确
		return d == null ? new Date() : d;
	}

	/**
	 * @date Oct 17, 2018 1:10:37 PM
	 * @Desc 获取类型
	 * @param doc
	 * @return
	 */
	public static String getCnBlogArticleType(Document doc) {
		// Element pageMsg2 =
		// doc.select("div.article-detail").first().select("h1.header").first().select("div.horizontal")
		// .first();
		// if ("原".equals(pageMsg2.html())) {
		// return "原创";
		// } else if ("转".equals(pageMsg2.html())) {
		// return "转载";
		// } else if ("译".equals(pageMsg2.html())) {
		// return "翻译";
		// }
		return "原创";
	}

/**
	 * @date Oct 17, 2018 1:10:41 PM
	 * @Desc 获取正文
	 * @param doc
	 * @param object
	 * @param blogcontent
	 * @return
	 */
	public static String getCnBlogArticleContent(Document doc, Blogmove blogMove, Blogcontent blogcontent) {
		Element pageMsg2 = doc.select("div#post_detail").first().select("div#cnblogs_post_body").first();
		String content = pageMsg2.toString();
		String images;
		// 注意是否须要替换图片
		if (blogMove.getMoveSaveImg() == 0) {
			// 保存图片到本地
			// 先获取全部图片链接，再按照每一个连接下载图片，最后替换原有连接
			// 先建立一个文件夹
			// 先建立一个临时文件夹
			String blogFileName = String.valueOf(UUID.randomUUID());
			FileUtils.createFolder(FilePathConfig.getUploadBlogPath() + File.separator + blogFileName);
			blogcontent.setBlogFileName(blogFileName);
			// 匹配出全部连接
			List<String> imgList = BlogMoveCommonUtils.getArticleImgList(content);
			// 下载并返回从新生成的imgurllist
			List<String> newImgList = BlogMoveCommonUtils.getArticleNewImgList(blogMove, imgList, blogFileName);
			// 拼接文章全部连接
			images = BlogMoveCommonUtils.getArticleImages(newImgList);
			blogcontent.setImages(images);
			// 替换全部连接按顺序
			content = getCnBlogNewArticleContent(content, imgList, newImgList);

		}

		return content;
	}

代码共用，再也不多放。仍是同样的步骤，获取正文源码html匹配img，下载img，替换img连接，返回替换后的HTML

本人网站效果图：

欢迎交流学习！

完整源码请见github：https://github.com/ricozhou/blogmove