一、打开oreilly free主页:javascript
http://www.oreilly.com/programming/free/html
在页面上检查元素,执行如下JS代码,得到书籍下载连接列表java
$.map($('body > article:nth-child(4) > div > section > div > a'), function(e){return e.href.replace(/free/, "free/files").replace(/csp.*/, "pdf")})
获得的列表以下 :python
["http://www.oreilly.com/programming/free/files/open-source-in-brazil.pdf", "http://www.oreilly.com/programming/free/files/ten-steps-to-linux-survival.pdf", "http://www.oreilly.com/programming/free/files/open-by-design.pdf", "http://www.oreilly.com/programming/free/files/getting-started-with-innersource.pdf", "http://www.oreilly.com/programming/free/files/microservices-in-production.pdf", "https://info.lightbend.com/COLL-20XX-Developing-Reactive-Microservices_Landing-Page.html?lst=OR", "http://www.oreilly.com/programming/free/files/microservices-antipatterns-and-pitfalls.pdf", "http://www.oreilly.com/programming/free/files/microservices-vs-service-oriented-architecture.pdf", "http://www.oreilly.com/programming/free/files/evolving-architectures-of-fintech.pdf", "http://www.oreilly.com/programming/free/files/software-architecture-patterns.pdf", "http://www.oreilly.com/programming/free/files/migrating-cloud-native-application-architectures.pdf", "http://www.oreilly.com/programming/free/files/reactive-microservices-architecture-orm.pdf"]
二、编写Python代码执行下载:react
初版代码:直接使用urllib库的urlretrieve函数进行下载,获得的列表中有可能存在非法值,在循环里进行判断并跳过。linux
import urllib path = "G:\\books\\auto_dowloading\\" def downloading(books): for book in books: tmp = book.split("/") if '.pdf' not in book: continue print "downloading %s" %(tmp[-1]) urllib.urlretrieve(book, path+tmp[-1]) print "download %s is over!" %(tmp[-1]) print "all job done"
第二版代码:经过输入网址连接,爬取全部书籍的地址列表,将列表传入进程池调用下载函数进行下载。app
import urllib import os import re from multiprocessing import Pool path = "G:\\books_new\\" job =[] def get_booklist(url): page = urllib.urlopen(url) html = page.read() tmp = re.findall(r'http://.*?\.csp',html) tmp2 = [i.replace('free','free/files').replace('csp','pdf') for i in tmp ] job.extend(tmp2) def download_book(url,path=path): if '.pdf' not in url: return name = url.split("/")[-1] print "downloading %s" %(name) urllib.urlretrieve(url, path+name) print "download %s is over!" %(name) if __name__=='__main__': get_booklist('http://www.oreilly.com/programming/free/') pool=Pool() pool.map(download_book,job) print('The documents have been downloaded successfully !')