python-80：获取文章的内容

时间 2019-11-29

原文原文链接

获取文章的内容是这个实例的第二步，可是这个看起来不难实现，由于，咱们要获取的文章都是发布在伯乐在线这个网站里面的，也就是说，他们的网页代码的形式和组织结构都是同样的，这就意味着，咱们只须要一个公式，就能适用于全部的文章，若是，咱们要获取的网页不是同一个站点发布的，那么，每一个站点的编码风格可能会不同，这给咱们获取正文内容带来必定的困难，幸亏，咱们如今不用面对这种状况。html

那么，这个公式是什么呢？咱们须要经过分析网页源代码的规律才能总结出来，为了提升准确率，咱们应该多分析一些网页python

好了，这里不会写出过程，可是我仍是但愿可以本身去分析一遍，这并不会花费多少时间，就直接给出结果吧ide

首先，文章的标题能够在这里得到：post

<title>这样的谷歌街景，你确定没见过 - 博客 - 伯乐在线</title>

或者你也能够在这段代码中同时得到文章的标题和内容：网站

<div class="grid-8">    
    <!-- BEGIN .post -->
<div class="post-97162 post type-post status-publish format-standard hentry category-geeks tag-4751 tag-4750 odd" id="post-97162">
    
    <!-- BEGIN .entry-header -->
    <div class="entry-header">            
        <h1>这样的谷歌街景，你确定没见过</h1>                        
    </div>
    <!-- BEGIN .entry-header -->
    <!-- BEGIN .entry-meta -->
    <div class="entry-meta">
        <p class="entry-meta-hide-on-mobile">
            2016/01/14 &middot;  <a href="http://blog.jobbole.com/category/geeks/" title="查看 极客 中的所有文章" rel="category tag">极客</a>
                            &middot; <a href="#article-comment"> 2 评论 </a>
             &middot;  <a href="http://blog.jobbole.com/tag/%e5%be%ae%e7%bc%a9%e6%99%af%e8%a7%82/">微缩景观</a>, <a href="http://blog.jobbole.com/tag/%e8%b0%b7%e6%ad%8c%e8%a1%97%e6%99%af/">谷歌街景</a>   
</p>
<!-- JiaThis Button BEGIN -->
<div class="jiathis_style" style="display: block; margin: 0 0px; clear: both;"><span class="jiathis_txt">分享到：</span>
<a class="jiathis_button_tsina"></a>
<a class="jiathis_button_weixin"></a>
<a class="jiathis_button_qzone"></a>
<a class="jiathis_button_fb"></a>
<a class="jiathis_button_douban"></a>
<a class="jiathis_button_readitlater"></a>
<a class="jiathis_button_evernote"></a>
<a class="jiathis_button_ydnote"></a>
<a href="http://www.jiathis.com/share?uid=1745061" class="jiathis jiathis_txt jiathis_separator jtico jtico_jiathis" target="_blank"></a>
<a class="jiathis_counter_style"></a>
</div>
<!-- JiaThis Button END -->
    </div>
    <!-- END .entry-meta -->
    <!-- BEGIN .entry -->
    <div class="entry">
        <script src="http://www.imooc.com/open/courselistrandjs"></script><span style='display:block;margin-bottom:10px;'></span>
        <div class='copyright-area'>本文做者： <a href='http://blog.jobbole.com'>伯乐在线</a> - <a href='http://www.jobbole.com/members/aoi'>伯小乐</a> 。未经做者许可，禁止转载！<br/>欢迎加入伯乐在线<a href='http://group.jobbole.com/category/feedback/writer-team/' target='_blank'>做者团队</a>。</div><p>在德国港口城市汉堡有个历史悠久的城区叫库房区，其中有一个著名的旅游景点 —— Miniatur Wunderland（微缩仙境）。它是世界上最大的铁路微缩模型系统，因此也被称之为「微缩火车乐园」。</p>
<p>「微缩仙境」由格里特·布劳恩和弗雷德里克·布劳恩（他俩仍是双胞胎哦）从 2000 年开始投资修建，于 2001 年 8 月完成 3 个主题展区的建设，当年开始对外开放接纳游客。</p>
<p>修完 3 个主题展区后，布劳恩两兄弟还在一直扩建。根据维基百科上的最新数据，「微缩仙境」目前已建完 8 个主题展区，2016 年春季预计将开放「意大利」展区。详情看下表：</p>
<table>
<tbody>

在 class="entry-header" 里面得到文章标题，而后在 class="entry"里面获取正文内容，并且，他们都被包含在<div class="grid-8">里面，至于这个结论是怎么来的，就像以前说过的同样是一个找规律的过程，先看网页上的标题是什么，正文第一句是什么，而后在源码中搜索这些字段，找出他们被包含在哪一个代码块里，而后通过对比肯定是否是想要的结果，这个过程很简单，认真分析过两三次就会了ui

因此，咱们获取正文的代码是这样的：this

#!/usr/bin/env python
# -*- coding:UTF-8 -*-
__author__ = '217小月月坑'

'''
get the contents of the artical
'''

import urllib2
from bs4 import BeautifulSoup

import sys
reload(sys)
sys.setdefaultencoding( "utf-8" )

url = "http://blog.jobbole.com/97183/"
request = urllib2.Request(url)
response = urllib2.urlopen(request)

soup = BeautifulSoup(response.read())
title = soup.title.string
print title

contents = soup.find("div", attrs={"class":"entry"})
print contents.get_text()

结果以下：编码

第二部分也这样简简单单的完成了，咱们接下来会讲最有趣的部分，交互url