Python Simple Crawler css
Using XML.DOM or XML.sax to parser XML files. (https://www.tutorialspoint.com/python/python_xml_processing.htm )html
CSS (Cascading Style Sheets) 是级联样式表,它是用于解决如何显示HTML元素。要解决若是显示html元素,就要解决若是对html元素定位。node
为何要使用CSS来定义HTML元素,而不是直接用属性设置元素。python
直接使用属性:<P font-size=” ” “color = > abcd</P>。 这样你要手动一个个去修改属性。数据库
而CSS能够经过ID class或其余方法快速定位到一大批元素。json
h1就是一个selector。color和front-size就是属性,red和14px就是值。数组
元素选择器+类选择器浏览器
ID选择器:能够给每一个元素一个ID。ID选择器与class选择器很像,但class是能够共享,如,不一样的标签(head,body,p,等等)能够有共同的class选择器,如上,P.important和h1.important,但id是全局惟一的。app
id选择器和class选择器都是属性选择器的特殊选择器。dom
这里css文件中,只适用了class选择器。
XPath
http://www.w3school.com.cn/xpath/index.asp
节点之间的关系像是文件系统中的文件路径的方式,可看做一个目录树,能够更方便的遍历和查找节点。这是CSS作不到的,CSS只能顺序,而XPath能够顺序也能够逆序。
html的语言和xml语言长得很像,可是html语言的语法要求不是很严格的,对浏览器而言,它会尽可能解析。若是把一段html放到xml解析器里,颇有可能会失败。
JQuery 来解析xml。JQuery是用xpath和CSS来定位元素的。
#outcome:
字典是不保证顺序的,因此当用json导数据的时候,顺序是随机的。
量小用DOM,量大就用SAX。
https://www.tutorialspoint.com/python/python_xml_processing.htm
Here is the easiest way to quickly load an XML document and to create a minidom object using the xml.dom module. The minidom object provides a simple parser method that quickly creates a DOM tree from the XML file.
The sample phrase calls the parse( file [,parser] ) function of the minidom object to parse the XML file designated by file into a DOM tree object.
books.xml
1 <?xml version="1.0" encoding="ISO-8859-1"?> 2 3 <bookstore shelf="new arrives"> 4 5 <book category="COOKING"> 6 <title lang="en">Everyday Italian</title> 7 <author>Giada De Laurentiis</author> 8 <year>2005</year> 9 <price>30.00</price> 10 </book> 11 12 <book category="CHILDREN"> 13 <title lang="en">Harry Potter</title> 14 <author>J K. Rowling</author> 15 <year>2005</year> 16 <price>29.99</price> 17 </book> 18 19 <book category="WEB"> 20 <title lang="en">XQuery Kick Start</title> 21 <author>James McGovern</author> 22 <author>Per Bothner</author> 23 <author>Kurt Cagle</author> 24 <author>James Linn</author> 25 <author>Vaidyanathan Nagarajan</author> 26 <year>2003</year> 27 <price>49.99</price> 28 </book> 29 30 <book category="WEB"> 31 <title lang="en">Learning XML</title> 32 <author>Erik T. Ray</author> 33 <year>2003</year> 34 <price>39.95</price> 35 </book> 36 37 </bookstore>
XML_Dom.py
1 from xml.dom import minidom 2 import xml.dom.minidom 3 4 # Open XML document using minidom parser (parse, documentElement, xml.dom.minidom.Element) 5 doc = minidom.parse('books.xml') 6 root = doc.documentElement 7 print(type(root)) 8 print(root.nodeName) 9 10 # getting attribute values. (hasAttribute, getAttribute) 11 if root.hasAttribute("shelf"): 12 print ("Root element : %s" % root.getAttribute("shelf")) 13 else: 14 print('getting vlaue failure') 15 16 # Print detail of each books (childNodes[0].data, xml.dom.minicompat.NodeList) 17 books = root.getElementsByTagName('book') 18 book_list= [] 19 price_list = [] 20 for book in books: 21 if book.hasAttribute("category"): 22 print("**** %s ****" %book.getAttribute("category")) 23 titles = book.getElementsByTagName('title') 24 print(type(titles)) 25 book_list.append(titles[0].childNodes[0].data) 26 print("Book's Name: %s" %titles[0].childNodes[0].data) 27 prices = book.getElementsByTagName('price') 28 price_list.append(prices[0].childNodes[0].data) 29 print("prices: %s" %prices[0].childNodes[0].data) 30 print("\n") 31 #print(titles[0].childNodes[0].nodeValue,prices[0].childNodes[0].nodeValue 32 print(book_list) 33 print(price_list)
#outcome
1 <class 'xml.dom.minidom.Element'> 2 bookstore 3 Root element : new arrives 4 5 **** COOKING **** 6 <class 'xml.dom.minicompat.NodeList'> 7 Book's Name: Everyday Italian 8 prices: 30.00 9 10 11 **** CHILDREN **** 12 <class 'xml.dom.minicompat.NodeList'> 13 Book's Name: Harry Potter 14 prices: 29.99 15 16 17 **** WEB **** 18 <class 'xml.dom.minicompat.NodeList'> 19 Book's Name: XQuery Kick Start 20 prices: 49.99 21 22 23 **** WEB **** 24 <class 'xml.dom.minicompat.NodeList'> 25 Book's Name: Learning XML 26 prices: 39.95 27 28 29 ['Everyday Italian', 'Harry Potter', 'XQuery Kick Start', 'Learning XML'] 30 ['30.00', '29.99', '49.99', '39.95']
book.xml
<?xml version="1.0"?> <bookstore> #<bookstore> = <root>。 bookstore至关于root是一个根节点。 <book> <title>Learn Python</title> <price>100</price> </book> <book> <title>Learn XML</title> <price>100</price> </book> </bookstore>
XML_DOM.py
1 from xml.dom import minidom 2 3 doc = minidom.parse('book.xml') # <class 'xml.dom.minidom.Document'> doc是一个文档树,是从根节点开始遍历,对应的就是根节点。 4 root = doc.documentElement # <class 'xml.dom.minidom.Element'> 将doc对应的根节点传给root,因此root也对应根节点。 5 print(type(root)) 6 #print(dir(root)) # use 'dir' to see all methods of the function 7 print(root.nodeName) 8 books = root.getElementsByTagName('book') #getElements将全部元素给books,books是一个数组。<class 'xml.dom.minicompat.NodeList'>
9 for book in books: #books,titles,price are <class 'xml.dom.minicompat.NodeList'> 10 titles = book.getElementsByTagName('title') #在当前book节点下,找到全部元素名为title的元素。 11 prices = book.getElementsByTagName('price') 12 print(titles[0].childNodes[0].nodeValue,prices[0].childNodes[0].nodeValue)
#Outcome:
<class 'xml.dom.minidom.Element'> bookstore Learn Python 100 Learn XML 80
XML_SAX.py SAX处理方式在XML数据库里处理比较多,用于处理大型的xml页面。
1 import string 2 from xml.parsers.expat import ParserCreate 3 4 class DefaultSaxHandler(object): 5 def start_element(self,name,attrs): 6 self.name = name 7 print('element: %s, attrs: %s' %(name,str(attrs))) 8 def end_element(self,name): 9 print('end element: %s' %name) 10 def char_data(self,text): 11 if text.strip(): 12 print("%s's text is %s" %(self.name, text)) 13 14 handler = DefaultSaxHandler() 15 parser = ParserCreate() 16 parser.StartElementHandler = handler.start_element #<book> 17 parser.EndElementHandler = handler.end_element #</book> 18 parser.CharacterDataHandler = handler.char_data #<title>character data</title> 19 20 with open('book.xml','r') as f: 21 parser.Parse(f.read())
#Outcome:
element: book, attrs:{} element: title, attrs:{'lang':'eng'} title's text is Learn Python end element: title element: price, attrs:{} price's text is 100 end element: price end element: book ....
SAX Example: url = ""http://www.ip138.com/post/""
Regular Expression
http://www.javashuo.com/article/p-mdtezkro-d.html
Reg_EXP02.py
1 import re 2 3 #3位数字-3到8个数字 4 mr = re.match(r'\d{3}-\d{3,8}','010-23232323') 5 print(mr.string) 6 7 #加()分组 8 mr2 = re.match(r'(\d{3})-(\d{3,8})','010-23232323') 9 print(mr2.groups()) 10 print(mr2.group(0)) 11 print(mr2.group(1)) 12 print(mr2.group(2)) 13 14 t = '2:15:45' 15 tm = re.match(r'(\d{0,2}):(\d{0,2}):(\d{0,2})',t) 16 print(tm.groups()) 17 print(tm.group(0)) 18 19 #加()分组 20 tm = re.match(r'(\d{0,2}):(\d{0,2}):(\d{0,2})',t) 21 22 #分割字符串 23 p = re.compile(r'\d+') 24 print(p.split('one1two22three333'))
Selenium
https://selenium-python.readthedocs.io/index.html
回头看。