一分钟搞懂你的博客为何没人看

时间 2020-06-14

标签一分钟博客为何没人繁體版

原文原文链接

　　关于博客访问量的问题，影响因素有不少，例如你的权重，你的博客数量，包括你的标题是否吸引人都是一个衡量的标准。html

这些东西须要的是日积月累，今天咱们从其中的一个维度入手：发帖时间。相信你们都明白，不管是csdn，博客园这种技术博客python

仍是今日头条百度贴吧或者抖音快手这种娱乐论坛，都有本身的在线高峰期。例如百度贴吧，用户年龄段广泛偏小，“夜猫子”占据主力。sql

21-23点是在线高峰期，这个时间的阅读量以及评论量也是最多的，自媒体人确定会选择在这个时间发帖已获得更多的阅读及评论。数据库

　　那咱们的博客园呢？目前咱们还不知道，既然园子里面都是程序猿，数据统计咱就要拿出点技术人员该有的样子，接下来咱们c#

写一个爬虫统计全部的发帖时间以及阅读数量。mvc

　　所需语言：app

　　　　python框架

　　　　c#dom

　　　　sql serverscrapy

爬取数据

咱们打开博客园首页，首页的文章列表有发帖时间，阅读数，博客园最多只有200页，咱们只要将这200页的全部文章阅读数，发帖时间爬取到就ok。

下面咱们用python+scrapy 来编写爬虫代码。

环境配置：

pip install scrapy 安装爬虫框架，scrapy安装容易遇到坑，scrapy教程与常见坑，不懂scrapy看连接。

scrapy startproject csblog 建立项目

scrapy gensider scblogSpider “csblogs.com” 建立爬虫文件

修改csblog下面的items.py

title:文章标题

read：阅读数

date：发帖时间

# -*- coding: utf-8 -*-

# Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html

import scrapy class CnblogsItem(scrapy.Item): title = scrapy.Field() read = scrapy.Field() date = scrapy.Field()

而后咱们编写爬虫代码，首先审查下首页的html结构。

首先吐槽下翻页遇到的坑，https://www.cnblogs.com/#p4，表面看上去#p4是页码，可是屡次尝试变化页码爬取，都无效果，始终为第一页。

通过调试工具查看请求才发现，这个url是被重写过得，想要翻页得这么发请求。

接下来就容易多了，向这个地址发请求，在返回的html中取得相应的数据就行了，贴代码。

# -*- coding: utf-8 -*-
import scrapy from cnblogs.items import CnblogsItem class CsblogSpider(scrapy.Spider): name = 'csblog' allowed_domains = ['cnblogs.com'] start_urls= ['https://www.cnblogs.com/mvc/AggSite/PostList.aspx'] PageIndex = 1
    

    def start_requests(self): url = self.start_urls[0] #由于博客园只容许200页
        for each in range(1,200): print("抓取页码") print(each) post_data ={ 'CategoryId':'808', 'CategoryType':"SiteHome", 'ItemListActionName':"PostList", 'PageIndex':str(each), 'ParentCategoryId':'0', 'TotalPostCount':'400' } yield scrapy.FormRequest(url=url, formdata=post_data) def parse(self, response): items = [] #全部文章都在<div class="post_item">中
        for each in response.xpath("/html/body/div[@class='post_item']"): #提取标题
            title = each.xpath('div[@class="post_item_body"]/h3/a/text()').extract() #提取发布日期
            date = each.xpath('div[@class="post_item_body"]/div/text()').extract() #提取阅读数
            read = each.xpath('div[@class="post_item_body"]/div/span[@class="article_view"]/a/text()').extract() title = title[0] #去除无用的字符
            date = str(date).replace("[' \\r\\n ', ' \\r\\n",'').replace(" \\r\\n ']","").replace("发布于 ","").lstrip() read = read[0].replace("阅读(","").replace(")","") item = CnblogsItem() item['title'] = title item['read'] = read item['date'] = date items.append(item) return items

爬虫的代码很简单，这也是python的强大之处。

运行 scrapy crawl csblog -o data.xml 将爬取到的数据保存为xml。

咱们已经将抓取到的数据保存到本地xml了，接下来要作的事情就是数据统计了。所谓“术业有专攻”，作统计没有比sql 更强大的语言了，python的任务到此结束。

数据存储

为了方便的对数据进项统计查询，咱们把xml保存到MS Sql Server中，作个这个事情没有比Sql server的老伙计C#更合适的了，没啥好说的简简单单的几个方法。

　　　　static void Main(string[] args) { data d = (data)Deserialize(typeof(data), File.OpenRead(@"D:/MyCode/cnblogs/cnblogs/data.xml")); DataTable dt = ToDataTable<data.item>(d.items); dt.TableName = "t_article"; dt.Columns.Remove("date"); SqlHelper.ExecuteNonQuery(dt); } /// <summary>
        /// Convert a List{T} to a DataTable. /// </summary>
        private static DataTable ToDataTable<T>(List<T> items) { var tb = new DataTable(typeof(T).Name); PropertyInfo[] props = typeof(T).GetProperties(BindingFlags.Public | BindingFlags.Instance); foreach (PropertyInfo prop in props) { Type t = GetCoreType(prop.PropertyType); tb.Columns.Add(prop.Name, t); } foreach (T item in items) { var values = new object[props.Length]; for (int i = 0; i < props.Length; i++) { values[i] = props[i].GetValue(item, null); } tb.Rows.Add(values); } return tb; } /// <summary>
        /// Determine of specified type is nullable /// </summary>
        public static bool IsNullable(Type t) { return !t.IsValueType || (t.IsGenericType && t.GetGenericTypeDefinition() == typeof(Nullable<>)); } /// <summary>
        /// Return underlying type if type is Nullable otherwise return the type /// </summary>
        public static Type GetCoreType(Type t) { if (t != null && IsNullable(t)) { if (!t.IsValueType) { return t; } else { return Nullable.GetUnderlyingType(t); } } else { return t; } } /// 反序列化 /// </summary>  
        /// <param name="type"></param>  
        /// <param name="xml"></param>  
        /// <returns></returns>  
        public static object Deserialize(Type type, Stream stream) { XmlSerializer xmldes = new XmlSerializer(type); return xmldes.Deserialize(stream); }

数据已经成功的存储到sql server，接下来的数据统计是重头戏了。

数据统计

--200页码帖子总数量
select COUNT(*) from t_article

--查询的哪一个时间段阅读量最多 --查询结果显示早9点阅读量是最多的，并不意外 --而早6点（5180）与7点（55144）相差了近10倍 --7点与8点相比差了也有三倍，这说明程序猿们陆续 --开始上班了，上班敲代码必定是查资料的高峰期， --果不其然，8,9,10,11,15,16是阅读量最高峰的几个时间段 --都分布在上班时间，而出乎意料的事22点的阅读量也不低 --看来程序猿们回家后也很努力的嘛（应该是在加班）
select 
CONVERT(INT, CONVERT(varchar(2),time, 108)) as count, SUM([read]) as [read]
from t_article group by 
CONVERT(INT, CONVERT(varchar(2),time, 108)) order by [read] desc

--查询阅读量在一个星期内的分布状况 --结果一点都不意外，星期三比另六天 --高得多，星期一到星期五是工做日 --天天的阅读量都很高，周末阅读量下滑 --的厉害，由于休息了嘛（竟然没在加班）
select 
datename(weekday, time) as weekday, SUM([read]) as [read]
from t_article group by 
datename(weekday, time) order by [read] desc

--按照阅读数量排行 --阅读数量与发帖时间基本成正比 --这意味着，你辛辛苦苦写的文章 --没人看，没有关系。时间不会辜负你
select 
CONVERT(varchar(100), time, 111), sum([read]) from t_article group by CONVERT(varchar(100), time, 111) order by sum([read])

总结

阅读的最高峰时段是早9点，因此这也是发帖的最优时间，8,9,10都是不错的时间，若是你想要更多的阅读，不要错过呦。

阅读数量最少的是星期六跟星期日，这两天能够不用发帖了，能够给本身放个假。

阅读数量会随着时间慢慢变多，也就是说一开始没有阅读也不要紧，只要帖子里有干货，随着时间推移依然还会有许多阅读从搜索引擎跳转过来，阅读量会慢慢上去的。

源码以及数据库下载地址

原文出处：https://www.cnblogs.com/abountme/p/10300737.html