1
2
3
4
5
6
7
8
9
10
11
|
<
li
>
<
a
href
=
"/Н"
>Н</
a
>:<
a
class
=
"det"
href
=
'/view/Н/ньютон'
>ньютон</
a
>
【物理】牛顿(力单位)
<
div
class
=
"satis"
style
=
"display:none"
>
<
span
>您对本词条的内容满意吗:</
span
>
<
font
>
<
a
href
=
"###"
tip-data
=
"good"
updateword
=
"ньютон"
satis
=
"245057"
>满意</
a
>
<
a
href
=
"###"
tip-data
=
"update"
updateword
=
"ньютон"
satis
=
"2"
>请改进</
a
>
</
font
>
</
div
>
</
li
>
|
遇到此段xml须要处理,查了些资料,现解决以下:
html
1
2
3
4
5
6
7
8
9
10
11
12
|
def
readFile(filen,decoding):
html
=
''
try
:
html
=
open
(filen).read().decode(decoding)
except
:
pass
return
html
def
extract(
file
,decoding, xpath):
html
=
readFile(
file
, decoding)
tree
=
etree.HTML(html)
return
tree.xpath(xpath)
|
两个函数,用于解决读取中文网页时出现的编码问题。
python
1
2
3
4
5
6
7
8
9
10
11
12
|
def
GetXpath1(url,xpath,saveFile):
response
=
urllib2.urlopen(url)
data
=
response.read()
f
=
file
(
"temp.txt"
,
'w'
)
f.write(data)
f.close()
sections
=
extract(
'temp.txt'
,
'utf-8'
, xpath)
print
len
(sections),
type
(sections)
#输出1 <type 'list'>
print
sections
#此处为元素[<Element a at 0x26c8948>]
print
sections[
0
].tag,sections[
0
].attrib,sections[
0
].attrib.get(
"href"
)
#输出a {'href': u'/view/\u041d/\u041d\u043e\u0432\u0433\u043e\u0440\u043e\u0434', 'class': 'det'} /view/Н/Новгород
print
type
(sections[
0
].attrib)
#<type 'lxml.etree._Attrib'>
|
此处关键地方,花了些时间解决,主要是为了提取
函数
<li><a href="/Н">Н</a>:<a class="det" href='/view/Н/ньютон'>ньютон</a>编码
中的俄语,须要注意的是Element的属性tag, attrib,get("")的使用url
到此基本就获取须要东西了spa