网上的爬虫不能用,仍是先表达谢意,不过我比较懒不喜欢重复写别人写的教程,只贴出修改,怎么用本身看教程吧。html
我本身改了一版能够正常爬:python
#!/usr/bin/env python #coding=utf-8 # # Openwrt Package Grabber # # Copyright (C) 2016 sohobloo.me # import urllib2 import re import os import time # the url of package list page, end with "/" baseurl = 'https://downloads.openwrt.org/snapshots/trunk/ramips/mt7620/packages/' # which directory to save all the packages, end with "/" time = time.strftime("%Y%m%d%H%M%S", time.localtime()) savedir = './' + time + '/' pattern = r'<a href="([^\?].*?)">'
cnt = 0
def fetch(url, path = ''): if not os.path.exists(savedir + path): os.makedirs(savedir + path) print 'fetching package list from ' + url content = urllib2.urlopen(url + path, timeout=15).read() items = re.findall(pattern, content)for item in items: if item == '../': continue elif item.endswith('/'): fetch(url, path + item) else: cnt += 1 print 'downloading item %d: '%(cnt) + path + item if os.path.isfile(savedir + path + item): print 'file exists, ignored.' else: rfile = urllib2.urlopen(baseurl + path + item) with open(savedir + path + item, "wb") as code: code.write(rfile.read()) fetch(baseurl) print 'done!'
修改内容:fetch
1. 增长了一级当前时间格式的根目录url
2. 修改正则,过滤无效的地址(问号开头)spa
3. 改成递归爬目录结构code
另外很高兴Python知识终于能够用了,撒花。htm
想更新截图失败,博客园看上去是要死了。blog