第一个爬虫程序

时间 2021-08-12

标签 html 数组 markdown ide url code htm beautifulsoup class 栏目网络爬虫繁體版

原文原文链接

使用的库

from urllib.request import urlopen
from bs4 import BeautifulSoup as bf

发出请求，获取html（获取到的是字节，须要转换）
html

html=urlopen("http://www.baidu.com")

数组

用beautifulsoup将获取的内容转换为结构化内容

obj=bf(html.read(),'html.parser')markdown

find_all方法能够找出全部的对应标签

logo_pic_info=obj.findall('img',class="index-logo-src")

ide

logo_url="http:"+logo_pic_info[1]['src']

url

用find_all获取的标签是一个数组，能够用数组的访问方法访问。code