数据提取--JSON

时间 2019-11-30

标签数据提取 json 栏目 JavaScript 繁體版

原文原文链接

什么是数据提取？html

　　简单的来讲，数据提取就是从响应中获取咱们想要的数据的过程python

非结构化的数据：html等	结构化数据：json，xml等
处理方法：正则表达式、xpath	处理方法：转化为python数据类型

因为把json数据转化为python内建数据类型很简单，因此爬虫中，若是咱们可以找到返回json数据的URL，就会尽可能使用这种URL正则表达式

JSON(JavaScript Object Notation) 是一种轻量级的数据交换格式，它使得人们很容易的进行阅读和编写。同时也方便了机器进行解析和生成。chrome

适用于进行数据交互的场景，好比网站前台与后台之间的数据交互。json

那么问题来了：哪里能找到返回json的url呢？api

一、使用chrome切换到手机页面app

二、抓包手机app的软件网站

具备 read() 或者 write() 方法的对象就是类文件对象 f = open(“a.txt”,”r”) f就是类文件对象url

url = "https://m.douban.com/rexxar/api/v2/subject_collection/movie_showing/items?start=0&count=18&loc_id=108288"
html_str = parse_url(url)

# json.loads把json字符串转化为python类型
ret1 = json.loads(html_str)
# pprint(ret1)
# print(type(ret1))

# json.dumps可以把python类型转化为json字符串
with open("douban.json","w",encoding="utf-8") as f:
    f.write(json.dumps(ret1,ensure_ascii=False,indent=4))
    # f.write(str(ret1))

# with open("douban.json","r",encoding="utf-8") as f:
#     ret2 = f.read()
#     ret3 = json.loads(ret2)
#     print(ret3)
#     print(type(ret3))


# 使用json。load提取类文件对象中的数据
with open("douban.json","r",encoding="utf-8") as f:
    ret4 = json.load(f)
    print(ret4)
    print(type(ret4))

#json.dump可以把python类型放入类文件对象中
with open("douban1.json","w",encoding="utf-8") as f:
    json.dump(ret1,f,ensure_ascii=False,indent=2)

Json在数据交换中起到了一个载体的做用，承载着相互传递的数据spa