会用到的点javascript
1 hashlib模块--加密。php
update(string.encode('utf-8')) m.hexdigest()css
2 requests模块html
https://blog.csdn.net/shanzhizi/article/details/50903748java
r = requests.get(url, params = {}, headers={},cookies = cookies,proxies = proxies)node
cookies, proxies 都是字典格式python
搜索的关键字就以字典的形式,放在params参数中mysql
param = {'wd':'火影'} r = requests.get('https://www.baidu.com/s', params=param) print(r.status_code) print(r.url) 百度没有防爬虫措施,搜狗有。这里用百度简单的演示下。params就是get访问时每一个?后面的xx=xx
r = requests.post(url, data = {},headers={} ) react
headers = {git
'content-type':
'User-Agent':
'Referer':
'Cookie':
'Host':
}
r.encoding = '' 自定义编码,对文本内容进行解码。 和 r.text 好基友
r.text 字符串方式的响应体
r.content 字节方式的响应体
r.status_code
r.request 能够得到请求的相关信息
1 import requests 2 3 url = 'https://www.cnblogs.com/654321cc/p/11013243.html' 4 headers = { 5 'User-Agent':'User-Agent', 6 } 7 r = requests.get(url=url,headers=headers) 8 9 #获取请求头 10 print(r.request.headers) 11 #获取响应头 12 print(r.headers) 13 14 15 #获取请求的cookie 16 print(r.request._cookies) 17 18 #获取响应的cookie 19 print(r.cookies)
r.headers 以字典对象存储服务器响应头。不知道何时回用到
r.cookies
r.history
1 import requests 2 3 url = 'https://www.cnblogs.com/654321cc/p/11013243.html' 4 headers = { 5 'User-Agent':'User-Agent', 6 } 7 r = requests.get(url=url,headers=headers) 8 9 #获取请求头 10 print(r.request.headers) 11 #获取响应头 12 print(r.headers) 13 14 15 #获取请求的cookie 16 print(r.request._cookies) 17 18 #获取响应的cookie 19 print(r.cookies)
何时会用到? 有的时候302跳转。 302 Found 的定义 302 状态码表示目标资源临时移动到了另外一个 URI 上。因为重定向是临时发生的,因此客户端在以后的请求中还应该使用本来的 URI。 服务器会在响应 Header 的 Location 字段中放上这个不一样的 URI。浏览器可使用 Location 中的 URI 进行自动重定向。 能够看到跳转网页以前的状态码。即有时候r.status_code 是唬人的。 能够用一个参数,禁止自动跳转。 allow_redirects r = requests.get('http://www.baidu.com/link?url=QeTRFOS7TuUQRppa0wlTJJr6FfIYI1DJprJukx4Qy0XnsDO_s9baoO8u1wvjxgqN', allow_redirects = False) >>>r.status_code 302
r.headers 响应头内容
r.request.headers 请求头内容
假装请求头
headers = {'User-Agent': 'liquid'} r = requests.get('http://www.zhidaow.com', headers = headers) print r.request.headers['User-Agent']
会话对象
s = requests.Session()
s.get()
s.post()
会话对象让你可以跨请求保持某些参数,最方便的是在同一个Session实例发出的全部请求之间保持cookies,且这些都是自动处理的,甚是方便。 import requests headers = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate, compress', 'Accept-Language': 'en-us;q=0.5,en;q=0.3', 'Cache-Control': 'max-age=0', 'Connection': 'keep-alive', 'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100101 Firefox/22.0'} s = requests.Session() s.headers.update(headers) # s.auth = ('superuser', '123') s.get('https://www.kuaipan.cn/account_login.htm') _URL = 'http://www.kuaipan.cn/index.php' s.post(_URL, params={'ac':'account', 'op':'login'}, data={'username':'****@foxmail.com', 'userpwd':'********', 'isajax':'yes'}) r = s.get(_URL, params={'ac':'zone', 'op':'taskdetail'}) print(r.json()) s.get(_URL, params={'ac':'common', 'op':'usersign'})
超时与异常
timeout 参数
r = requests.get('https://m.hcomic1.com',timeout = 1)
3 json模块--轻量级数据交换格式
文件 dump load
字符串 dumps loads
4 re模块
re.S 表示 “.”(不包含外侧双引号)的做用扩展到整个字符串,包括“\n”。. 默认匹配除换行符之外全部的字符,re.S 模式中,. 连换行符都会匹 配。
re.I 表示忽略字符的大小写
正则表达式中,“.”的做用是匹配除“\n”之外的任何字符,也就是说,它是在一行中进行匹配。这里的“行”是以“\n”进行区分的。a字符串有每行的末尾有一个“\n”,不过它不可见。
若是不使用re.S参数,则只在每一行内进行匹配,若是一行没有,就换下一行从新开始,不会跨行。而使用re.S参数之后,正则表达式会将这个字符串做为一个总体,将“\n”当作一个普通的字符加入到这个字符串中,在总体中进行匹配。
import re s = 'agejaoigeaojdnghaw2379273589hahjhgoiaioeg87t98w825tgha9e89aye835yyaghe9857ahge878ahsohga9e9q30gja9eu73hga9w7ga8w73hgna9geuahge9aoi753uajghe9as' \ '8837t5hga8u83758uaga98973gh8e' res1 = re.findall('\d{2,3}[a-zA-Z]{1,}?\d{2,3}',s) # [ ] 字符集,只能取其中一个 # {m,n} 量词,对前面一个字符重复m到n此 # 量词后面加?,为非贪婪匹配 # print(res1) ['589hahjhgoiaioeg87', '98w825', '89aye835', '857ahge878', '758uaga989'] res2 = re.search('(\d{2,3})[a-zA-Z]{1,}?(\d{2,3})',s) # print(res2) res:<re.Match object; span=(25, 43), match='589hahjhgoiaioeg87'> print(res2.group()) # match 和 search 只会返回第一个。 print(res2.group(1)) #The first parenthesized subgroup. 使用group传参,正则匹配必须有响应的 ( ) print(res2.group(2)) #The second parenthesized subgroup.
# res3 = re.finditer('\d{2,3}[a-zA-Z]{1,}?\d{2,3}',s) # print(res3) # res3:<callable_iterator object at 0x000001DE04A9E048> 返回迭代对象,省内存
5 flask_httpauth模块--认证模块?
from flask_httpauth import HTTPBasicAuth from flask_httpauth import HTTPDigestAuth
6 beautifulsoup模块
Beautiful Soup 是一个能够从HTML或XML文件中提取数据的Python库.它可以经过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方 式.Beautiful Soup会帮你节省数小时甚至数天的工做时间.
https://www.cnblogs.com/linhaifeng/articles/7783586.html#_label2
# beautifulsoup的使用经常使用t套路。经过find_all 获取tag列表。这一步叫搜索文档树 # 对获取到的tag列表,取想要的数据:文本,超连接等。这一步叫获取tag的属性,名称,内容等。 # 通常这样使用就足够了 from bs4 import BeautifulSoup import requests URL = 'https://www.duodia.com/daxuexiaohua/' def get_page(url): r = requests.get(url) if r.status_code == 200: return r.text content = get_page(URL) soup = BeautifulSoup(content,'lxml') # 1 搜索文档树 name tag名。class_ ,类名。class是关键字 a_s = soup(name='a',class_='thumbnail-container') # print(type(a_s[0])) # 注意 a_s[0]的数据类型 <class 'bs4.element.Tag'> # 2 获取tag的属性,名称,内容等。 for a in a_s: print(a.attrs['href']) print(a.text) print(a.name)
try: a = 1 b = 's' c = a + b except TypeError as e: # 可能 except as 比较好,能区分出什么错误来 print('TypeError %s' %e) else: # 没有异常才会走 print('else') finally: # 有没有异常都会走 print('finally')
from threading import Thread def foo(): print(os.getpid()) print('f00') time.sleep(2) def bar(): print(os.getpid()) print('bar') time.sleep(5) if __name__ == '__main__': t = time.time() t1 = Thread(target=foo) t2 = Thread(target=bar) t1.start() t2.start() t1.join() # 线程的join起阻塞进程的做用,t1,t2线程跑完后,才会回到主进程,执行print()语句 t2.join() # print('time cost {}'.format(time.time() - t))
进程的join也是如此。
from multiprocessing import Pool import time import os import random def foo(n): time.sleep(random.random()) return {'name':'foo','length':n} def save(dic): f = open('a.txt','a',encoding='utf-8') f.write('name:{},length:{}\n'.format(dic['name'],dic['length'])) f.close() if __name__ == '__main__': n = os.cpu_count() pool = Pool(n) # print(p) p: <multiprocessing.pool.Pool object at 0x000001DDE9D3E0B8> task_list = [] for i in range(20): task = pool.apply_async(foo,args=(i,),callback=save) # print(task) task:<multiprocessing.pool.ApplyResult object at 0x0000026084D5AFD0> task_list.append(task) pool.close() pool.join() for task in task_list: print(task.get()) p = Pool() task = p.apply_async(func=,args=,kwds=,callback=) 注意 task是什么相似,异步添加任务。 callback有且惟一参数是func的返回值,用的好的话,省不少事 p.close() p.join() task.get()
from concurrent.futures import ProcessPoolExecutor,ThreadPoolExecutor import requests def get(url): r=requests.get(url) return {'url':url,'text':r.text} def parse(future): dic=future.result() #future对象调用result方法取其值、 f=open('db.text','a') date='url:%s\n'%len(dic['text']) f.write(date) f.close() if __name__ == '__main__': executor=ThreadPoolExecutor() url_l = ['http://cn.bing.com/', 'http://www.cnblogs.com/wupeiqi/', 'http://www.cnblogs.com/654321cc/', 'https://www.cnblogs.com/', 'http://society.people.com.cn/n1/2017/1012/c1008-29581930.html', 'http://www.xilu.com/news/shaonianxinzangyou5gedong.html', ] for url in url_l: executor.submit(get,url).add_done_callback(parse) #与Pool进程池回调函数接收的是A函数的返回值(对象ApplyResult.get()获得的值)。 executor.shutdown() #这里回调函数parse,接收的参数是submit生成的 Future对象。 print('主')
['<title>Quotes to Scrape</title>']
>>> for quote in response.css("div.quote"): ... text = quote.css("span.text::text").get() ... author = quote.css("small.author::text").get() ... tags = quote.css("div.tags a.tag::text").getall() ... print(dict(text=text, author=author, tags=tags))
//node[1] selects all the nodes occurring first under their respective parents. (//node)[1] selects all the nodes in the document, and then gets only the first of them. Example: >>> from scrapy import Selector >>> sel = Selector(text=""" ....: <ul class="list"> ....: <li>1</li> ....: <li>2</li> ....: <li>3</li> ....: </ul> ....: <ul class="list"> ....: <li>4</li> ....: <li>5</li> ....: <li>6</li> ....: </ul>""") >>> xp = lambda x: sel.xpath(x).getall() This gets all first <li> elements under whatever it is its parent: >>> xp("//li[1]") ['<li>1</li>', '<li>4</li>'] And this gets the first <li> element in the whole document: >>> xp("(//li)[1]") ['<li>1</li>'] This gets all first <li> elements under an <ul> parent: >>> xp("//ul/li[1]") ['<li>1</li>', '<li>4</li>'] And this gets the first <li> element under an <ul> parent in the whole document: >>> xp("(//ul/li)[1]") ['<li>1</li>']
next_page = response.urljoin(next_page)
Now, after extracting the data, the parse() method looks for the link to the next page, builds a full absolute URL using the urljoin() method (since the links can be relative) and yields a new request to the next page, registering itself as callback to handle the data extraction for the next page and to keep the crawling going through all the pages. What you see here is Scrapy’s mechanism of following links: when you yield a Request in a callback method, Scrapy will schedule that request to be sent and register a callback method to be executed when that request finishes.
import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/', ] def parse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').get(), 'author': quote.css('small.author::text').get(), 'tags': quote.css('div.tags a.tag::text').getall(), } next_page = response.css('li.next a::attr(href)').get() if next_page is not None: next_page = response.urljoin(next_page) yield scrapy.Request(next_page, callback=self.parse)
response.follow(next_page, callback=self.parse)
Unlike scrapy.Request, response.follow supports relative URLs directly - no need to call urljoin. Note that response.follow just returns a Request instance; you still have to yield this Request. You can also pass a selector to response.follow instead of a string; this selector should extract necessary attributes:
import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/', ] def parse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').get(), 'author': quote.css('span small::text').get(), 'tags': quote.css('div.tags a.tag::text').getall(), } next_page = response.css('li.next a::attr(href)').get() if next_page is not None: yield response.follow(next_page, callback=self.parse)
for href in response.css('li.next a::attr(href)'): yield response.follow(href, callback=self.parse)
Here is another spider that illustrates callbacks and following links, this time for scraping author information: import scrapy class AuthorSpider(scrapy.Spider): name = 'author' start_urls = ['http://quotes.toscrape.com/'] def parse(self, response): # follow links to author pages for href in response.css('.author + a::attr(href)'): yield response.follow(href, self.parse_author) # follow pagination links for href in response.css('li.next a::attr(href)'): yield response.follow(href, self.parse) def parse_author(self, response): def extract_with_css(query): return response.css(query).get(default='').strip() yield { 'name': extract_with_css('h3.author-title::text'), 'birthdate': extract_with_css('.author-born-date::text'), 'bio': extract_with_css('.author-description::text'), } This spider will start from the main page, it will follow all the links to the authors pages calling the parse_author callback for each of them, and also the pagination links with the parse callback as we saw before. Here we’re passing callbacks to response.follow as positional arguments to make the code shorter; it also works for scrapy.Request. The parse_author callback defines a helper function to extract and cleanup the data from a CSS query and yields the Python dict with the author data.
Spiders are classes which define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items). In other words, Spiders are the place where you define the custom behaviour for crawling and parsing pages for a particular site (or, in some cases, a group of sites). For spiders, the scraping cycle goes through something like this: You start by generating the initial Requests to crawl the first URLs, and specify a callback function to be called with the response downloaded from those requests. The first requests to perform are obtained by calling the start_requests() method which (by default) generates Request for the URLs specified in the start_urls and the parse method as callback function for the Requests. In the callback function, you parse the response (web page) and return either dicts with extracted data, Item objects, Request objects, or an iterable of these objects. Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback. In callback functions, you parse the page contents, typically using Selectors (but you can also use BeautifulSoup, lxml or whatever mechanism you prefer) and generate items with the parsed data. Finally, the items returned from the spider will be typically persisted to a database (in some Item Pipeline) or written to a file using Feed exports.
start_requests()
as a generatorIf you want to change the Requests used to start scraping a domain, this is the method to override. For example, if you need to start by logging in using a POST request, you could do: class MySpider(scrapy.Spider): name = 'myspider' def start_requests(self): return [scrapy.FormRequest("http://www.example.com/login", formdata={'user': 'john', 'pass': 'secret'}, callback=self.logged_in)] def logged_in(self, response): # here you would extract links to follow and return Requests for # each of them, with another callback pass
import scrapy class MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com'] start_urls = [ 'http://www.example.com/1.html', 'http://www.example.com/2.html', 'http://www.example.com/3.html', ] def parse(self, response): self.logger.info('A response from %s just arrived!', response.url) for h3 in response.xpath('//h3').getall(): yield {"title": h3} for href in response.xpath('//a/@href').getall(): yield scrapy.Request(response.urljoin(href), self.parse) Instead of start_urls you can use start_requests() directly; to give data more structure you can use Items: import scrapy from myproject.items import MyItem class MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com'] def start_requests(self): yield scrapy.Request('http://www.example.com/1.html', self.parse) yield scrapy.Request('http://www.example.com/2.html', self.parse) yield scrapy.Request('http://www.example.com/3.html', self.parse) def parse(self, response): self.logger.info('A response from %s just arrived!', response.url) for h3 in response.xpath('//h3').getall(): yield MyItem(title=h3) for href in response.xpath('//a/@href').getall(): yield scrapy.Request(response.urljoin(href), self.parse)
scrapy genspider -t crawl xx xx.com
import scrapy from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor class MySpider(CrawlSpider): name = 'example.com' allowed_domains = ['example.com'] start_urls = ['http://www.example.com'] rules = ( # Extract links matching 'category.php' (but not matching 'subsection.php') # and follow links from them (since no callback means follow=True by default). Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))), # Extract links matching 'item.php' and parse them with the spider's method parse_item Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'), ) def parse_item(self, response): self.logger.info('Hi, this is an item page! %s', response.url) item = scrapy.Item() item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)') item['name'] = response.xpath('//td[@id="item_name"]/text()').get() item['description'] = response.xpath('//td[@id="item_description"]/text()').get() item['link_text'] = response.meta['link_text'] return item
Selectors
简介
Scrapy comes with its own mechanism for extracting data. They’re called selectors because they “select” certain parts of the HTML document specified either by XPath or CSS expressions. XPath is a language for selecting nodes in XML documents, which can also be used with HTML. CSS is a language for applying styles to HTML documents. It defines selectors to associate those styles with specific HTML elements.
A node converted to a string, however, puts together the text of itself plus of all its descendants: >>> sel.xpath("//a[1]").getall() # select the first node ['<a href="#">Click here to go to the <strong>Next Page</strong></a>'] >>> sel.xpath("string(//a[1])").getall() # convert it to string ['Click here to go to the Next Page'] So, using the .//text() node-set won’t select anything in this case: >>> sel.xpath("//a[contains(.//text(), 'Next Page')]").getall() [] But using the . to mean the node, works: >>> sel.xpath("//a[contains(., 'Next Page')]").getall() ['<a href="#">Click here to go to the <strong>Next Page</strong></a>']
XPath allows you to reference variables in your XPath expressions, using the $somevariable syntax. This is somewhat similar to parameterized queries or prepared statements in the SQL world where you replace some arguments in your queries with placeholders like ?, which are then substituted with values passed with the query.
Here’s another example, to find the “id” attribute of a <div> tag containing five <a> children (here we pass the value 5 as an integer): >>> response.xpath('//div[count(a)=$cnt]/@id', cnt=5).get() 'images'
Here’s an example to match an element based on its normalized string-value: >>> str_to_match = "Name: My image 3" >>> selector.xpath('//a[normalize-space(.)=$match]', ... match=str_to_match).get() '<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>'
Here’s another example using a position range passed as two integers: >>> start, stop = 2, 4 >>> selector.xpath('//a[position()>=$_from and position()<=$_to]', ... _from=start, _to=stop).getall() ['<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>', '<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>', '<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>']
Named variables can be useful when strings need to be escaped for single or double quotes characters. The example below would be a bit tricky to get right (or legible) without a variable reference: >>> html = u'''<html> ... <body> ... <p>He said: "I don't know why, but I like mixing single and double quotes!"</p> ... </body> ... </html>''' >>> selector = Selector(text=html) >>> >>> selector.xpath('//p[contains(., $mystring)]', ... mystring='''He said: "I don't know''').get() '<p>He said: "I don\'t know why, but I like mixing single and double quotes!"</p>'
Built-in Selectors reference
Selector objects
attrib:Return the attributes dictionary for underlying element.
register_namespace ?
remove_namespaces ?
SelectorList objects
attrib:Return the attributes dictionary for the first element. If the list is empty, return an empty dict.
... ...
Selecting element attributes
https://docs.scrapy.org/en/latest/topics/selectors.html#selecting-attributes
三种方法
xpath('//title/@href').getall()
The main goal in scraping is to extract structured data from unstructured sources, typically, web pages. Scrapy spiders can return the extracted data as Python dicts. While convenient and familiar, Python dicts lack structure: it is easy to make a typo in a field name or return inconsistent data, especially in a larger project with many spiders.
Field objects are used to specify metadata for each field. For example, the serializer function for the last_updated field illustrated in the example above. You can specify any kind of metadata for each field. There is no restriction on the values accepted by Field objects. For this same reason, there is no reference list of all available metadata keys. Each key defined in Field objects could be used by a different component, and only those components know about it. You can also define and use any other Field key in your project too, for your own needs. The main goal of Field objects is to provide a way to define all field metadata in one place. Typically, those components whose behaviour depends on each field use certain field keys to configure that behaviour. You must refer to their documentation to see which metadata keys are used by each component.
class Myitem(scrapy.Item): name = scrapy.Field() age = scrapy.Field() salary = scrapy.Field() item = Myitem({'name':'z','age':28,'salary':10000}) print(item.fields) print(Myitem.fields) ===> {'age': {}, 'name': {}, 'salary': {}} {'age': {}, 'name': {}, 'salary': {}} class Myitem(scrapy.Item): name = scrapy.Field() age = scrapy.Field() salary = scrapy.Field(dd='geagd') item = Myitem({'name':'z','age':28,'salary':10000}) print(item.fields) print(Myitem.fields) ===> {'age': {}, 'name': {}, 'salary': {'dd': 'geagd'}} {'age': {}, 'name': {}, 'salary': {'dd': 'geagd'}}
Items replicate the standard dict API, including its constructor. The only additional attribute provided by Items is:fields
The Field class is just an alias to the built-in dict class and doesn’t provide any extra functionality or attributes. In other words, Field objects are plain-old Python dicts. A separate class is used to support the item declaration syntax based on class attributes.
Extending Items
You can extend Items (to add more fields or to change some metadata for some fields) by declaring a subclass of your original Item. For example: class DiscountedProduct(Product): discount_percent = scrapy.Field(serializer=str) discount_expiration_date = scrapy.Field() You can also extend field metadata by using the previous field metadata and appending more values, or changing existing values, like this: class SpecificProduct(Product): name = scrapy.Field(Product.fields['name'], serializer=my_serializer) That adds (or replaces) the serializer metadata key for the name field, keeping all the previously existing metadata values.
Item Loader
简介 --快速填充数据。应该是新的scrapy的新的功能。不用你在手动的extract data,而后给Item的field赋值了。我擦,省老多事了,牛逼。
In other words, Items provide the container of scraped data, while Item Loaders provide the mechanism for populating that container. Item Loaders are designed to provide a flexible, efficient and easy mechanism for extending and overriding different field parsing rules, either by spider, or by source format (HTML, XML, etc) without becoming a nightmare to maintain.
简单例子
from scrapy.loader import ItemLoader from myproject.items import Product def parse(self, response): l = ItemLoader(item=Product(), response=response) #首先实例化 #ItemLoader.default_item_class属性控制默认实例化的类 l.add_xpath('name', '//div[@class="product_name"]') # 获取值的三种方式 l.add_xpath('name', '//div[@class="product_title"]') l.add_xpath('price', '//p[@id="price"]') l.add_css('stock', 'p#stock]') l.add_value('last_updated', 'today') # you can also use literal values return l.load_item() # 返回item populated with the data #previously extracted
To use an Item Loader, you must first instantiate it. You can either instantiate it with a dict-like object (e.g. Item or dict) or without one, in which case an Item is automatically instantiated in the Item Loader constructor using the Item class specified in the ItemLoader.default_item_class attribute. Then, you start collecting values into the Item Loader, typically using Selectors. You can add more than one value to the same item field; the Item Loader will know how to “join” those values later using a proper processing function. Here is a typical Item Loader usage in a Spider, using the Product item declared in the Items chapter: from scrapy.loader import ItemLoader from myproject.items import Product def parse(self, response): l = ItemLoader(item=Product(), response=response) l.add_xpath('name', '//div[@class="product_name"]') l.add_xpath('name', '//div[@class="product_title"]') l.add_xpath('price', '//p[@id="price"]') l.add_css('stock', 'p#stock]') l.add_value('last_updated', 'today') # you can also use literal values return l.load_item() By quickly looking at that code, we can see the name field is being extracted from two different XPath locations in the page: //div[@class="product_name"] //div[@class="product_title"] In other words, data is being collected by extracting it from two XPath locations, using the add_xpath() method. This is the data that will be assigned to the name field later. Afterwards, similar calls are used for price and stock fields (the latter using a CSS selector with the add_css() method), and finally the last_update field is populated directly with a literal value (today) using a different method: add_value(). Finally, when all data is collected, the ItemLoader.load_item() method is called which actually returns the item populated with the data previously extracted and collected with the add_xpath(), add_css(), and add_value() calls.
name
field in the item.)This processor provides a convenient way to compose functions that only work with single values (instead of iterables). For this reason the MapCompose processor is typically used as input processor, since data is often extracted using the extract() method of selectors, which returns a list of unicode strings.
>>> def filter_world(x): ... return None if x == 'world' else x ... >>> from scrapy.loader.processors import MapCompose >>> proc = MapCompose(filter_world, str.upper) >>> proc(['hello', 'world', 'this', 'is', 'scrapy']) ['HELLO, 'THIS', 'IS', 'SCRAPY']
from scrapy.loader import ItemLoader from scrapy.loader.processors import TakeFirst, MapCompose, Join class ProductLoader(ItemLoader): default_output_processor = TakeFirst() name_in = MapCompose(unicode.title) name_out = Join() price_in = MapCompose(unicode.strip)
import scrapy from scrapy.loader.processors import Join, MapCompose, TakeFirst from w3lib.html import remove_tags def filter_price(value): if value.isdigit(): return value class Product(scrapy.Item): name = scrapy.Field( input_processor=MapCompose(remove_tags), output_processor=Join(), ) price = scrapy.Field( input_processor=MapCompose(remove_tags, filter_price), output_processor=TakeFirst(), ) >>> from scrapy.loader import ItemLoader >>> il = ItemLoader(item=Product()) >>> il.add_value('name', [u'Welcome to my', u'<strong>website</strong>']) >>> il.add_value('price', [u'€', u'<span>1000</span>']) >>> il.load_item() {'name': u'Welcome to my website', 'price': u'1000'}
loader = ItemLoader(item=Item()) # load stuff not in the footer footer_loader = loader.nested_xpath('//footer') footer_loader.add_xpath('social', 'a[@class = "social"]/@href') footer_loader.add_xpath('email', 'a[@class = "email"]/@href') # no need to call footer_loader.load_item() loader.load_item()
scrapy shell
设置 shell 工具为 ipython
在scrapy.cfg中设置
[settings]
shell = bpython
from scrapy.shell import inspect_response
Here’s an example of how you would call it from your spider: import scrapy class MySpider(scrapy.Spider): name = "myspider" start_urls = [ "http://example.com", "http://example.org", "http://example.net", ] def parse(self, response): # We want to inspect one specific response. if ".org" in response.url: from scrapy.shell import inspect_response inspect_response(response, self) # Rest of parsing code. When you run the spider, you will get something similar to this: 2014-01-23 17:48:31-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.com> (referer: None) 2014-01-23 17:48:31-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.org> (referer: None) [s] Available Scrapy objects: [s] crawler <scrapy.crawler.Crawler object at 0x1e16b50> ... >>> response.url 'http://example.org' Then, you can check if the extraction code is working: >>> response.xpath('//h1[@class="fn"]') [] Nope, it doesn’t. So you can open the response in your web browser and see if it’s the response you were expecting: >>> view(response) True Finally you hit Ctrl-D (or Ctrl-Z in Windows) to exit the shell and resume the crawling: >>> ^D 2014-01-23 17:50:03-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.net> (referer: None) ...
Typical uses of item pipelines are: cleansing HTML data validating scraped data (checking that the items contain certain fields) checking for duplicates (and dropping them) storing the scraped item in a database
is a Python class that implements a simple method. They receive an item and perform an action over it, also deciding if the item should continue through the pipeline or be dropped and no longer processed.
This method is called for every item pipeline component. process_item() must either: return a dict with data, return an Item (or any descendant class) object, return a Twisted Deferred or raise DropItem exception. Dropped items are no longer processed by further pipeline components
If present, this classmethod is called to create a pipeline instance from a Crawler. It must return a new instance of the pipeline. Crawler object provides access to all Scrapy core components like settings and signals; it is a way for pipeline to access them and hook its functionality into Scrapy.
from scrapy.exceptions import DropItem class PricePipeline(object): vat_factor = 1.15 def process_item(self, item, spider): if item.get('price'): if item.get('price_excludes_vat'): item['price'] = item['price'] * self.vat_factor return item else: raise DropItem("Missing price in %s" % item)
import pymongo class MongoPipeline(object): collection_name = 'scrapy_items' def __init__(self, mongo_uri, mongo_db): self.mongo_uri = mongo_uri self.mongo_db = mongo_db @classmethod def from_crawler(cls, crawler): return cls( mongo_uri=crawler.settings.get('MONGO_URI'), mongo_db=crawler.settings.get('MONGO_DATABASE', 'items') ) def open_spider(self, spider): self.client = pymongo.MongoClient(self.mongo_uri) self.db = self.client[self.mongo_db] def close_spider(self, spider): self.client.close() def process_item(self, item, spider): self.db[self.collection_name].insert_one(dict(item)) return item
from scrapy.exceptions import DropItem class DuplicatesPipeline(object): def __init__(self): self.ids_seen = set() def process_item(self, item, spider): if item['id'] in self.ids_seen: raise DropItem("Duplicate item found: %s" % item) else: self.ids_seen.add(item['id']) return item
A dict that contains arbitrary metadata for this request. This dict is empty for new Requests, and is usually populated by different Scrapy components (extensions, middlewares, etc). So the data contained in this dict depends on the extensions you have enabled.
(boolean) – indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to False.
A dictionary that contains arbitrary metadata for this request. Its contents will be passed to the Request’s callback as keyword arguments. It is empty for new Requests, which means by default callbacks only get a Response object as argument.
Passing additional data to callback functions The callback of a request is a function that will be called when the response of that request is downloaded. The callback function will be called with the downloaded Response object as its first argument. Example: def parse_page1(self, response): return scrapy.Request("http://www.example.com/some_page.html", callback=self.parse_page2) def parse_page2(self, response): # this would log http://www.example.com/some_page.html self.logger.info("Visited %s", response.url) In some cases you may be interested in passing arguments to those callback functions so you can receive the arguments later, in the second callback. The following example shows how to achieve this by using the Request.cb_kwargs attribute: def parse(self, response): request = scrapy.Request('http://www.example.com/index.html', callback=self.parse_page2, cb_kwargs=dict(main_url=response.url)) request.cb_kwargs['foo'] = 'bar' # add more arguments for the callback yield request def parse_page2(self, response, main_url, foo): yield dict( main_url=main_url, other_url=response.url, foo=foo, ) Caution Request.cb_kwargs was introduced in version 1.7. Prior to that, using Request.meta was recommended for passing information around callbacks. After 1.7, Request.cb_kwargs became the preferred way for handling user information, leaving Request.meta for communication with components like middlewares and extensions.
import scrapy from scrapy.spidermiddlewares.httperror import HttpError from twisted.internet.error import DNSLookupError from twisted.internet.error import TimeoutError, TCPTimedOutError class ErrbackSpider(scrapy.Spider): name = "errback_example" start_urls = [ "http://www.httpbin.org/", # HTTP 200 expected "http://www.httpbin.org/status/404", # Not found error "http://www.httpbin.org/status/500", # server issue "http://www.httpbin.org:12345/", # non-responding host, timeout expected "http://www.httphttpbinbin.org/", # DNS error expected ] def start_requests(self): for u in self.start_urls: yield scrapy.Request(u, callback=self.parse_httpbin, errback=self.errback_httpbin, dont_filter=True) def parse_httpbin(self, response): self.logger.info('Got successful response from {}'.format(response.url)) # do something useful here... def errback_httpbin(self, failure): # log all failures self.logger.error(repr(failure)) # in case you want to do something special for some errors, # you may need the failure's type: if failure.check(HttpError): # these exceptions come from HttpError spider middleware # you can get the non-200 response response = failure.value.response self.logger.error('HttpError on %s', response.url) elif failure.check(DNSLookupError): # this is the original request request = failure.request self.logger.error('DNSLookupError on %s', request.url) elif failure.check(TimeoutError, TCPTimedOutError): request = failure.request self.logger.error('TimeoutError on %s', request.url)
The FormRequest class extends the base Request with functionality for dealing with HTML forms.
class scrapy.http.FormRequest(url[, formdata, ...]) The FormRequest class adds a new argument to the constructor. The remaining arguments are the same as for the Request class and are not documented here. Parameters: formdata (dict or iterable of tuples) – is a dictionary (or iterable of (key, value) tuples) containing HTML Form data which will be url-encoded and assigned to the body of the request.
下面这个例子可能有用。
import scrapy def authentication_failed(response): # TODO: Check the contents of the response and return True if it failed # or False if it succeeded. pass class LoginSpider(scrapy.Spider): name = 'example.com' start_urls = ['http://www.example.com/users/login.php'] def parse(self, response): return scrapy.FormRequest.from_response( response, formdata={'username': 'john', 'password': 'secret'}, callback=self.after_login ) def after_login(self, response): if authentication_failed(response): self.logger.error("Login failed") return # continue scraping with authenticated session..
The Request object that generated this response. This attribute is assigned in the Scrapy engine, after the response and the request have passed through all Downloader Middlewares. In particular, this means that: HTTP redirections will cause the original request (to the URL before redirection) to be assigned to the redirected response (with the final URL after redirection). Response.request.url doesn’t always equal Response.url This attribute is only available in the spider code, and in the Spider Middlewares, but not in Downloader Middlewares (although you have the Request available there by other means) and handlers of the response_downloaded signal.
A shortcut to the Request.meta attribute of the Response.request object (ie. self.request.meta). Unlike the Response.request attribute, the Response.meta attribute is propagated along redirects and retries, so you will get the original Request.meta sent from your spider.
link extractors
https://docs.scrapy.org/en/latest/topics/link-extractors.html
from scrapy.linkextractors import LinkExtractor
The only public method that every link extractor has is extract_links, which receives a Response object and returns a list of scrapy.link.Link objects. Link extractors are meant to be instantiated once and their extract_links method called several times with different responses to extract links to follow.
参数
allow
deny
restrict_xpath
process_value
When you use Scrapy, you have to tell it which settings you’re using. You can do this by using an environment variable, SCRAPY_SETTINGS_MODULE.
The value of SCRAPY_SETTINGS_MODULE should be in Python path syntax, e.g. myproject.settings. Note that the settings module should be on the Python import search path.
1 Command line options (most precedence) 2 Settings per-spider 3 Project settings module 4 Default settings per-command 5 Default global settings (less precedence)
Arguments provided by the command line are the ones that take most precedence, overriding any other options. You can explicitly override one (or more) settings using the -s (or --set) command line option. Example: scrapy crawl myspider -s LOG_FILE=scrapy.log
Spiders (See the Spiders chapter for reference) can define their own settings that will take precedence and override the project ones. They can do so by setting their custom_settings attribute: class MySpider(scrapy.Spider): name = 'myspider' custom_settings = { 'SOME_SETTING': 'some value', }
In a spider, the settings are available through self.settings: class MySpider(scrapy.Spider): name = 'myspider' start_urls = ['http://example.com'] def parse(self, response): print("Existing settings: %s" % self.settings.attributes.keys())
Note The settings attribute is set in the base Spider class after the spider is initialized. If you want to use the settings before the initialization (e.g., in your spider’s __init__() method), you’ll need to override the from_crawler() method. Settings can be accessed through the scrapy.crawler.Crawler.settings attribute of the Crawler that is passed to from_crawler method in extensions, middlewares and item pipelines: class MyExtension(object): def __init__(self, log_is_enabled=False): if log_is_enabled: print("log is enabled!") @classmethod def from_crawler(cls, crawler): settings = crawler.settings return cls(settings.getbool('LOG_ENABLED')) The settings object can be used like a dict (e.g., settings['LOG_ENABLED']), but it’s usually preferred to extract the setting in the format you need it to avoid type errors, using one of the methods provided by the Settings API.
Default: 100 Maximum number of concurrent items (per response) to process in parallel in the Item Processor (also known as the Item Pipeline).
Default: 16
The maximum number of concurrent (ie. simultaneous) requests that will be performed by the Scrapy downloader.
Default: 8
The maximum number of concurrent (ie. simultaneous) requests that will be performed to any single domain.
Default: 0 The maximum number of concurrent (ie. simultaneous) requests that will be performed to any single IP. If non-zero, the CONCURRENT_REQUESTS_PER_DOMAIN setting is ignored, and this one is used instead. In other words, concurrency limits will be applied per IP, not per domain. This setting also affects DOWNLOAD_DELAY and AutoThrottle extension: if CONCURRENT_REQUESTS_PER_IP is non-zero, download delay is enforced per IP, not per domain.
Default: { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', } The default headers used for Scrapy HTTP Requests. They’re populated in the DefaultHeadersMiddleware.
Default:: {} A dict containing the downloader middlewares enabled in your project, and their orders.
Default: 0 The amount of time (in secs) that the downloader should wait before downloading consecutive pages from the same website. This can be used to throttle the crawling speed to avoid hitting servers too hard. Decimal numbers are supported. Example: DOWNLOAD_DELAY = 0.25 # 250 ms of delay This setting is also affected by the RANDOMIZE_DOWNLOAD_DELAY setting (which is enabled by default). By default, Scrapy doesn’t wait a fixed amount of time between requests, but uses a random interval between 0.5 * DOWNLOAD_DELAY and 1.5 * DOWNLOAD_DELAY. When CONCURRENT_REQUESTS_PER_IP is non-zero, delays are enforced per ip address instead of per domain. You can also change this setting per spider by setting download_delay spider attribute.
Default: {} A dict containing the request downloader handlers enabled in your project.
Default: 180 The amount of time (in secs) that the downloader will wait before timing out. Note This timeout can be set per spider using download_timeout spider attribute and per-request using download_timeout Request.meta key.
Default: 'scrapy.dupefilters.RFPDupeFilter' The class used to detect and filter duplicate requests. The default (RFPDupeFilter) filters based on request fingerprint using the scrapy.utils.request.request_fingerprint function. In order to change the way duplicates are checked you could subclass RFPDupeFilter and override its request_fingerprint method. This method should accept scrapy Request object and return its fingerprint (a string). You can disable filtering of duplicate requests by setting DUPEFILTER_CLASS to 'scrapy.dupefilters.BaseDupeFilter'. Be very careful about this however, because you can get into crawling loops. It’s usually a better idea to set the dont_filter parameter to True on the specific Request that should not be filtered.
Default:: {} A dict containing the extensions enabled in your project, and their orders.
Default: {} A dict containing the item pipelines to use, and their orders. Order values are arbitrary, but it is customary to define them in the 0-1000 range. Lower orders process before higher orders. Example: ITEM_PIPELINES = { 'mybot.pipelines.validate.ValidateMyItem': 300, 'mybot.pipelines.validate.StoreMyItem': 800, }
Default: False
Whether to enable memory debugging.
Default: [] When memory debugging is enabled a memory report will be sent to the specified addresses if this setting is not empty, otherwise the report will be written to the log. Example: MEMDEBUG_NOTIFY = ['user@example.com']
Default: True If enabled, Scrapy will wait a random amount of time (between 0.5 * DOWNLOAD_DELAY and 1.5 * DOWNLOAD_DELAY) while fetching requests from the same website. This randomization decreases the chance of the crawler being detected (and subsequently blocked) by sites which analyze requests looking for statistically significant similarities in the time between their requests. The randomization policy is the same used by wget --random-wait option. If DOWNLOAD_DELAY is zero (default) this option has no effect.
Default:: {} A dict containing the spider middlewares enabled in your project, and their orders.
Default: "Scrapy/VERSION (+https://scrapy.org)" The default User-Agent to use when crawling, unless overridden.
exception scrapy.exceptions.DropItem
The exception that must be raised by item pipeline stages to stop processing an Item. For more information see Item Pipeline.
CloseSpider
exception scrapy.exceptions.CloseSpider(reason='cancelled') This exception can be raised from a spider callback to request the spider to be closed/stopped. Supported arguments: Parameters: reason (str) – the reason for closing For example: def parse_page(self, response): if 'Bandwidth exceeded' in response.body: raise CloseSpider('bandwidth_exceeded')
...
built-in service
Logging
import logging logging.warning("This is a warning")
import logging logging.log(logging.WARNING, "This is a warning")
logger
within each Spider instance, which can be accessed and used like this:
import scrapy class MySpider(scrapy.Spider): name = 'myspider' start_urls = ['https://scrapinghub.com'] def parse(self, response): self.logger.info('Parse function called on %s', response.url)
import logging import scrapy logger = logging.getLogger('mycustomlogger') class MySpider(scrapy.Spider): name = 'myspider' start_urls = ['https://scrapinghub.com'] def parse(self, response): logger.info('Parse function called on %s', response.url)
Loggers on their own don’t manage how messages sent through them are displayed. For this task, different “handlers” can be attached to any logger instance and they will redirect those messages to appropriate destinations, such as the standard output, files, emails, etc.
上面这句话一样适用于 logging 模块
Because Scrapy uses stdlib logging module, you can customize logging using all features of stdlib logging. For example, let’s say you’re scraping a website which returns many HTTP 404 and 500 responses, and you want to hide all messages like this: 2016-12-16 22:00:06 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <500 http://quotes.toscrape.com/page/1-34/>: HTTP status code is not handled or not allowed The first thing to note is a logger name - it is in brackets: [scrapy.spidermiddlewares.httperror]. If you get just [scrapy] then LOG_SHORT_NAMES is likely set to True; set it to False and re-run the crawl. Next, we can see that the message has INFO level. To hide it we should set logging level for scrapy.spidermiddlewares.httperror higher than INFO; next level after INFO is WARNING. It could be done e.g. in the spider’s __init__ method: import logging import scrapy class MySpider(scrapy.Spider): # ... def __init__(self, *args, **kwargs): logger = logging.getLogger('scrapy.spidermiddlewares.httperror') logger.setLevel(logging.WARNING) super().__init__(*args, **kwargs) If you run this spider again then INFO messages from scrapy.spidermiddlewares.httperror logger will be gone.
这里说的应该是将httperror的显示log的等级设置为WARNING,因此等级为INFO的就不显示了。
class ExtensionThatAccessStats(object): def __init__(self, stats): self.stats = stats @classmethod def from_crawler(cls, crawler): return cls(crawler.stats)
from scrapy.mail import MailSender mailer = MailSender()
mailer = MailSender.from_settings(settings)
Yes, Scrapy receives and keeps track of cookies sent by servers, and sends them back on subsequent requests, like any regular web browser does.
import scrapy from scrapy.crawler import CrawlerProcess class MySpider1(scrapy.Spider): # Your first spider definition ... class MySpider2(scrapy.Spider): # Your second spider definition ... process = CrawlerProcess() process.crawl(MySpider1) process.crawl(MySpider2) process.start() # the script will block here until all crawling jobs are finished
import scrapy from twisted.internet import reactor from scrapy.crawler import CrawlerRunner from scrapy.utils.log import configure_logging class MySpider1(scrapy.Spider): # Your first spider definition ... class MySpider2(scrapy.Spider): # Your second spider definition ... configure_logging() runner = CrawlerRunner() runner.crawl(MySpider1) runner.crawl(MySpider2) d = runner.join() d.addBoth(lambda _: reactor.stop()) reactor.run() # the script will block here until all crawling jobs are finished
Scrapy default settings are optimized for focused crawls, not broad crawls. However, due to its asynchronous architecture, Scrapy is very well suited for performing fast broad crawls. This page summarizes some things you need to keep in mind when using Scrapy for doing broad crawls, along with concrete suggestions of Scrapy settings to tune in order to achieve an efficient broad crawl.
Scrapy’s default scheduler priority queue is 'scrapy.pqueues.ScrapyPriorityQueue'. It works best during single-domain crawl. It does not work well with crawling many different domains in parallel To apply the recommended priority queue use: SCHEDULER_PRIORITY_QUEUE = 'scrapy.pqueues.DownloaderAwarePriorityQueue'
Disable Javascript while inspecting the DOM looking for XPaths to be used in Scrapy (in the Developer Tools settings click Disable JavaScript) Never use full XPath paths, use relative and clever ones based on attributes (such as id, class, width, etc) or any identifying features like contains(@href, 'image'). Never include <tbody> elements in your XPath expressions unless you really know what you’re doing
Scrapy supports this functionality out of the box by providing the following facilities: a scheduler that persists scheduled requests on disk a duplicates filter that persists visited requests on disk an extension that keeps some spider state (key/value pairs) persistent between batches
The order does matter because each middleware performs a different action and your middleware could depend on some previous (or subsequent) middleware being applied.
Here we will implement a simple extension to illustrate the concepts described in the previous section. This extension will log a message every time: a spider is opened a spider is closed a specific number of items are scraped The extension will be enabled through the MYEXT_ENABLED setting and the number of items will be specified through the MYEXT_ITEMCOUNT setting. Here is the code of such extension: import logging from scrapy import signals from scrapy.exceptions import NotConfigured logger = logging.getLogger(__name__) class SpiderOpenCloseLogging(object): def __init__(self, item_count): self.item_count = item_count self.items_scraped = 0 @classmethod def from_crawler(cls, crawler): # first check if the extension should be enabled and raise # NotConfigured otherwise if not crawler.settings.getbool('MYEXT_ENABLED'): raise NotConfigured # get the number of items from settings item_count = crawler.settings.getint('MYEXT_ITEMCOUNT', 1000) # instantiate the extension object ext = cls(item_count) # connect the extension object to signals crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened) crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed) crawler.signals.connect(ext.item_scraped, signal=signals.item_scraped) # return the extension object return ext def spider_opened(self, spider): logger.info("opened spider %s", spider.name) def spider_closed(self, spider): logger.info("closed spider %s", spider.name) def item_scraped(self, item, spider): self.items_scraped += 1 if self.items_scraped % self.item_count == 0: logger.info("scraped %d items", self.items_scraped)
Core API
Scrapy uses signals extensively to notify when certain events occur. You can catch some of those signals in your Scrapy project (using an extension, for example) to perform additional tasks or extend Scrapy to add functionality not provided out of the box.
from scrapy import signals from scrapy import Spider class DmozSpider(Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/", ] @classmethod def from_crawler(cls, crawler, *args, **kwargs): spider = super(DmozSpider, cls).from_crawler(crawler, *args, **kwargs) crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed) return spider def spider_closed(self, spider): spider.logger.info('Spider closed: %s', spider.name) def parse(self, response): pass
import pymysql import random from bs4 import BeautifulSoup import re from urllib.request import urlopen import datetime conn = pymysql.connect(host='localhost',user='root',passwd='123',db='zuo',charset='utf8') cur = conn.cursor() cur.execute('SELECT title FROM pages WHERE id = 2 ') print(cur.fetchone()) def store(title,content): cur.execute("INSERT INTO pages (title,content) VALUES (%s,%s)",(title,content)) cur.connection.commit() def getLinks(url): html = urlopen(url=url) bs = BeautifulSoup(html,'lxml') title = bs.find_all('title') content = bs.find_all('content') store(title,content) return bs.find_all('links') links = getLinks('www.baidu.com') try: while len(links) > 0: newLink = 'x' print(newLink) links = getLinks('url') finally: cur.close() conn.close()
链接对象(conn)和 光标对象(cur) 链接/光标模式是数据库编程中经常使用的模式。 链接模式除了要链接数据库以外,还要发送数据库信息,处理回滚操做(当一个查询或一组查询被中断时,数据库须要回到初始状态,通常用事务控制手段实现状态回滚),建立光标对象,等等。 而一个链接能够有不少个光标,一个光标跟踪一种状态(state)信息,好比正在使用的是哪一个数据库。若是你有多个数据库,且须要向全部数据库写内容,就须要多个光标来进行处理。光标还会包含最后一次查询执行的结果,经过调用光标函数,好比 cur.fetchone(),能够获取查询结果。
jQuery是一个快速、简洁的JavaScript框架。jQuery设计的宗旨是“write Less,Do More”,即倡导写更少的代码,作更多的事情。它封装JavaScript经常使用的功能代码,提供一种简便的JavaScript设计模式,优化HTML文档操做、事件处理、动画设计和Ajax交互。
AJAX 是一种用于建立快速动态网页的技术。
经过在后台与服务器进行少许数据交换,AJAX 可使网页实现异步更新。这意味着能够在不从新加载整个网页的状况下,对网页的某部分进行更新。
传统的网页(不使用 AJAX)若是须要更新内容,必需重载整个网页面。
div 部分用于显示来自服务器的信息。当按钮被点击时,它负责调用名为 loadXMLDoc() 的函数: <html> <body> <div id="myDiv"><h3>Let AJAX change this text</h3></div> <button type="button" onclick="loadXMLDoc()">Change Content</button> </body> </html> 接下来,在页面的 head 部分添加一个 <script> 标签。该标签中包含了这个 loadXMLDoc() 函数: <head> <script type="text/javascript"> function loadXMLDoc() { .... AJAX script goes here ... } </script> </head>
var xmlhttp; if (window.XMLHttpRequest) {// code for IE7+, Firefox, Chrome, Opera, Safari xmlhttp=new XMLHttpRequest(); } else {// code for IE6, IE5 xmlhttp=new ActiveXObject("Microsoft.XMLHTTP"); }
xmlhttp.open("GET","test1.txt",true); xmlhttp.send();
document.getElementById("myDiv").innerHTML=xmlhttp.responseText;
xmlDoc=xmlhttp.responseXML; txt=""; x=xmlDoc.getElementsByTagName("ARTIST"); for (i=0;i<x.length;i++) { txt=txt + x[i].childNodes[0].nodeValue + "<br />"; } document.getElementById("myDiv").innerHTML=txt;
xmlhttp.onreadystatechange=function() { if (xmlhttp.readyState==4 && xmlhttp.status==200) { document.getElementById("myDiv").innerHTML=xmlhttp.responseText; } }
因为JavaScript框架变的愈来愈广泛,不少HTML建立任务从原来的服务器处理变成了由浏览器处理。服务器可能给用户浏览器发送一个硬编码的HTML模板,可是还须要单独的AJAX请求来加载内容。并将这些内容放在HTML模板中正确的位置。全部这些都发生在浏览器/客户端上。
最初,上述机制对于网络爬虫是一个麻烦的问题。过去,爬虫请求一个HTML页面时,获取的就是原封不动的HTML页面,全部的内容都在HTML页面上,而如今爬虫得到的是一个不带任何内容的HTML模板。
selenium能够解决这个问题。
然而,因为整个内容管理系统基本上已经移动的浏览器端,连最简单的网站均可以激增到几兆节的内容和十几个HTTP请求。
此外,当时用selenium时,用户不须要的“额外信息”也被加载了。加载侧边栏广告,图像,CSS,第三方的字体。这些内容看起来很好,可是当你编写一个须要快速移动,抓取特定数据并尽量对Web服务器形成较小负担的爬虫时,这可能会加载比你实际所需多上百倍的数据。
可是 对于JavaScript,Ajax和现代化Web来讲仍有一线但愿:由于服务器再也不将数据处理成HTML格式, 因此它们一般做为数据库自己的一个弱包装器。该弱包装器简单的从数据库中抽取数据,并经过一个API将数据返回给页面。
固然,这些API并未打算供除网页自己之外的任何人或者任何事使用,所以,开发者并未这个鞋API提供文档。
<form action="form_action.asp" method="get"> <p>First name: <input type="text" name="fname" /></p> <p>Last name: <input type="text" name="lname" /></p> <input type="submit" value="Submit" /> </form>
能够用requests模块的post方法模拟表单登陆。
form表单的action属性就是post方法的url。
r = requests.post('url',data=params)
import requests from requests.auth import AuthBase from requests.auth import HTTPBasicAuth auth = HTTPBasicAuth('zuo','password') r = requests.post('xx',auth=auth) print(r.text)
虽然这里看着像一个普通的post请求,可是有一个HTTPBasicAuth对象做为auth参数传递到了请求中。显示的结果将是用户名和密码验证成功的页面。若是验证失败,就是一个拒绝接入页面。
其它表单内容
网页表单敌网络恶意机器人酷爱的网站切入点。你固然不但愿机器人建立垃圾帐号,占用昂贵的服务器资源,或者在博客上提交垃圾评论。所以,现代网站常常在HTML中采起不少安全措施,让表单不能被快速穿越。
验证码
蜜罐(honey pot)
隐藏字段(hidden field)
else
from selenium import webdriver from selenium.webdriver.remote.webelement import WebElement driver = webdriver.Chrome() driver.get('url') links = driver.find_element_by_tag_name('a') for link in links: if not links.is_displayed(): print('The link {} is a trap'.format(link.get_attribute('href'))) fields = driver.find_elements_by_tag_name('input') for field in fields: if not field.is_displayed(): print('Do not change value of {}'.format(field.get_attribute('name')))
from selenium import webdriver ... pageSource = webdriver.page_source bs = BeautfulSoup(pageSource,'html.parser') print(bs.find(id='content').get_text())