Python 爬虫 解决escape问题

爬取某个国外的网址,遇到的编码问题 ,在前段页面 返回的数据是    javascript

亞洲私人珍&#34255html

;賣,令仝好分享他為此java

所傾注的心血與熱愛。编码

 

爬虫源码是:url

url = 'http://www.bonhams.com/auctions/24026/lot/120/?category=list&length=100&page=1'

try:
    result = requests.get(url=url).text
except:
    result = requests.get(url=url).text
if 'javascript">setTimeout' in result:
    result = requests.get(url=url).text

如何处理?spa

  
url = 'http://www.bonhams.com/auctions/24026/lot/120/?category=list&length=100&page=1'

try:
    result = requests.get(url=url).text except: result = requests.get(url=url).text if 'javascript">setTimeout' in result: result = requests.get(url=url).text

from HTMLParser import HTMLParser result_HTMLParser = HTMLParser().unescape(result) print result_HTMLParser

打印原始网页代码code

发现编码格式正常htm

 

 

html = '<abc>'
用Python能够这样处理:

import HTMLParser
html_parser = HTMLParser.HTMLParser()
txt = html_parser.unescape(html) #这样就获得了txt = '<abc>'
若是还想转回去,能够这样:

import cgi
html = cgi.escape(txt) # 这样又回到了 html = '&lt;abc&gt'
相关文章
相关标签/搜索