from urllib.request import Request, urlopen from urllib.parse import quote def get_html(html): headers = { "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0" } request = Request(html, headers=headers) response = urlopen(request) return response.read().decode() def save_html(html,filename): with open(filename,'w',encoding='utf-8') as f: f.write(html) def main(): content = input("请输入想要获取哪一个贴吧:") num = int(input("请输入想要获取多少页:")) for i in range(num): url = 'https://tieba.baidu.com/f?fr=ala0&kw='+quote(content)+'&tpl={}'.format(i * 50) html = get_html(url) filename = '第'+ str(i+1) +'页.html' save_html(html,filename) if __name__ == '__main__': main()
1.爬取页面,须要有main方法做为入口,须要获取页面方法(get_html)和保存页面方法(save_html)html
2.在get_html方法中设定请求头(header)以达到避免页面发现爬虫痕迹;response响应读取返回页面的html代码。python
3.在save_html方法中以写的方式将爬取到的页面代码写入自定义的filename文件中url
4.在main方法中接收须要的数据,在字符串拼接的过程当中注意:要哪一个页面(eg:百度贴吧、python)-->而后经过quote进行文字转换成指定字符串; 添加页码(以format的形式进行接收)code