python xpath lxml调试

时间 2019-11-08

原文原文链接

 
   
    
      
      
        < 
        li 
        > 
       
 
        < 
        a 
        href 
        = 
        "/Н" 
        >Н</ 
        a 
        >:< 
        a 
        class 
        = 
        "det" 
        href 
        = 
        '/view/Н/ньютон' 
        >ньютон</ 
        a 
        > 
       
 
          
        【物理】牛顿(力单位) 
       
 
        < 
        div 
        class 
        = 
        "satis" 
        style 
        = 
        "display:none" 
        > 
       
 
        < 
        span 
        >您对本词条的内容满意吗：</ 
        span 
        > 
       
 
        < 
        font 
        > 
       
 
        < 
        a 
        href 
        = 
        "###" 
        tip-data 
        = 
        "good" 
        updateword 
        = 
        "ньютон" 
        satis 
        = 
        "245057" 
        >满意</ 
        a 
        > 
       
 
        < 
        a 
        href 
        = 
        "###" 
        tip-data 
        = 
        "update" 
        updateword 
        = 
        "ньютон" 
        satis 
        = 
        "2" 
        >请改进</ 
        a 
        > 
       
 
        </ 
        font 
        > 
       
 
        </ 
        div 
        > 
       
 
        </ 
        li 
        > 
       
 
    
 
   
 

遇到此段xml须要处理，查了些资料，现解决以下：
html

 
        def 
        readFile(filen,decoding):   
       
        html  
        = 
        ''   
       
        try 
        :   
       
        html  
        = 
        open 
        (filen).read().decode(decoding)   
       
        except 
        :   
       
        pass  
       
        return 
        html   
       
        def 
        extract( 
        file 
        ,decoding, xpath):   
       
        html  
        = 
        readFile( 
        file 
        , decoding)   
       
        tree  
        = 
        etree.HTML(html) 
       
        return 
        tree.xpath(xpath)

两个函数，用于解决读取中文网页时出现的编码问题。
python

 
        def 
        GetXpath1(url,xpath,saveFile): 
       
        response 
        = 
        urllib2.urlopen(url) 
       
        data 
        = 
        response.read() 
       
        f 
        = 
        file 
        ( 
        "temp.txt" 
        , 
        'w' 
        )     
       
        f.write(data) 
       
        f.close() 
       
        sections  
        = 
        extract( 
        'temp.txt' 
        ,  
        'utf-8' 
        , xpath) 
       
        print 
        len 
        (sections), 
        type 
        (sections) 
        #输出1 <type 'list'> 
       
        print 
        sections 
        #此处为元素[<Element a at 0x26c8948>] 
       
        print 
        sections[ 
        0 
        ].tag,sections[ 
        0 
        ].attrib,sections[ 
        0 
        ].attrib.get( 
        "href" 
        ) 
       
        #输出a {'href': u'/view/\u041d/\u041d\u043e\u0432\u0433\u043e\u0440\u043e\u0434', 'class': 'det'} /view/Н/Новгород 
       
        print 
        type 
        (sections[ 
        0 
        ].attrib) 
        #<type 'lxml.etree._Attrib'>

此处关键地方，花了些时间解决，主要是为了提取
函数

<li><a href="/Н">Н</a>:<a class="det" href='/view/Н/ньютон'>ньютон</a>编码

中的俄语，须要注意的是Element的属性tag, attrib,get("")的使用url

到此基本就获取须要东西了spa