Scrapy爬虫框架。
Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架。 能够应用在包括数据挖掘,信息处理或存储历史数据等一系列的程序中。
具体学习请参考:https://scrapy-chs.readthedocs.io/zh_CN/latest/index.html#html
Python语法基础。
具体学习请参考:https://www.runoob.com/python3/python3-tutorial.htmlpython
获取链家北京石景山区苹果园地区的3000条成交记录,为后续数据清洗和机器学习作准备。
使用Scrapy的基本方法,没有应用的高级方法,但愿学习Scrapy库的请绕行。web
房屋成交的时间、价格、户型、面积等数据定义。json
class HomelinkItem(Item): # define the fields for your item here like: deal_time = Field() #成交时间 deal_totalPrice = Field() #成交价格 deal_unitPrice = Field() #成交单价 household_style = Field() #房屋户型 gross_area = Field() #建筑面积 usable_area = Field() #使用面积 house_orientation = Field() #房屋朝向 floor_number = Field() #所在楼层 build_year = Field() #建筑年代 year_of_property = Field() #产权年限 with_elevator = Field() #配备电梯 house_usage = Field() #房屋用途 is_two_five = Field() #满二满五
class LianjiaSpider(scrapy.Spider): name = 'lianjia' allowed_domains = ['bj.lianjia.com'] start_urls = ['http://bj.lianjia.com/chengjiao/'] regions = {'pingguoyuan1': '苹果园'} def start_requests(self): for region in list(self.regions.keys()): url = "https://bj.lianjia.com/chengjiao/" + region + "/" yield Request(url=url, callback=self.parse, meta={'region': region}) #用来获取页码 def parse(self, response): region = response.meta['region'] selector = etree.HTML(response.text) sel = selector.xpath("//div[@class='page-box house-lst-page-box']/@page-data")[0] # 返回的是字符串字典 sel = json.loads(sel) # 转化为字典 total_pages = sel.get("totalPage") for i in range(int(total_pages)): url_page = "https://bj.lianjia.com/chengjiao/{}/pg{}/".format(region, str(i + 1)) yield Request(url=url_page, callback=self.parse_sale) def parse_sale(self, response): selector = etree.HTML(response.text) house_urls = selector.xpath("//div[@class='content']//div[@class='title']//a/@href") # 返回列表 for house_url in house_urls: yield Request(url=house_url, callback=self.parse_content) def parse_content(self, response): item = HomelinkItem() # 成交时间 item["deal_time"] = ''.join(response.xpath("//section//p[@class='record_detail']/text()").re(r"\d{4}[-]\d{2}[-]\d{2}")) # 成交总价 item["deal_totalPrice"] = response.xpath("//section//span/i/text()").extract_first() # 成交单价 item["deal_unitPrice"] = response.xpath("//section//div[@class='price']/b/text()").extract_first() # 其余成交信息 deal_info = response.xpath("//section//ul/li/text()") # response.xpath返回选择器对象,selector.xpath有区别 item["household_style"] = deal_info.extract()[0].strip() # 房屋户型 item["gross_area"] = deal_info.extract()[2].strip() # 建筑面积 item["usable_area"] = deal_info.extract()[4].strip() # 使用面积 item["house_orientation"] = deal_info.extract()[6].strip() # 房屋朝向 item["build_year"] = deal_info.extract()[7].strip() # 所在楼层 item["floor_number"] = deal_info.extract()[1].strip() # 建筑年代 item["year_of_property"] = deal_info.extract()[12].strip() # 产权年限 item["with_elevator"] = deal_info.extract()[13].strip() # 配备电梯 item["house_usage"] = deal_info.extract()[17].strip() # 房屋用途 item["is_two_five"] = deal_info.extract()[18].strip() # 满二满五 yield item
run.py 编写 cmdline.execute(“scrapy crawl lianjia -o linajia.csv”.split()),运行。框架
获取数据示例:dom
build_year,deal_time,deal_totalPrice,deal_unitPrice,floor_number,gross_area,house_orientation,house_usage,household_style,is_two_five,usable_area,with_elevator,year_of_property 1999,2019-03-022012-11-25,269,40801,高楼层(共7层),65.93㎡,南 北,普通住宅,1室1厅1厨1卫,满五年,58.15㎡,无,70年 1994,2019-03-02,359,41876,顶层(共16层),85.73㎡,东 南 北,普通住宅,3室1厅1厨1卫,满两年,暂无数据,有,70年 1997,2019-03-02,296,50651,中楼层(共16层),58.44㎡,东 南,普通住宅,2室1厅1厨1卫,暂无数据,暂无数据,有,70年
一共获取3000条房源成交数据机器学习
使用Scrapy获取链家的房源成交数据,本文应用的Scrapy的基本程序方法。
本文获取数据的目的是为后续数据清洗和机器学习使用,顾不在Scrapy高级用法上作深刻研究。scrapy