连接是SEO的一个重要因素。为了在搜索引擎中获取更好的排名,必定要按期检查下网站中的连接是否依然有效。特别是因为一些巨大的改动可能会致使坏连接的出现。要检测这些站内的连接问题,能够经过一些在线的工具。好比Google Analytics,Bing Webmaster Tools,brokenlinkcheck.com等。尽管有现成的工具,咱们也能够本身来编写一个。使用Python会很是容易。
html
参考原文:How to Check Broken Links with 404 Error in Pythonpython
做者:Xiao Linggit
翻译:yushulxgithub
为了让网站更好的被搜索引擎抓取,通常的网站都会有一个sitemap.xml。因此基本步骤是:
app
读取sitemap.xml,获取全部的站内连接。工具
从每一个连接中再读取全部的连接,可能包含inbound link或者outbound link。网站
检查全部连接的状态。
ui
使用BeautifulSoup库来分析网页元素会很是方便:
搜索引擎
pip install beautifulsoup4
由于程序运行的时间可能会很长,要随时打断的话,须要注入键盘事件:
url
def ctrl_c(signum, frame): global shutdown_event shutdown_event.set() raise SystemExit('\nCancelling...') global shutdown_event shutdown_event = threading.Event() signal.signal(signal.SIGINT, ctrl_c)
使用BeautifulSoup来分析sitemap.xml:
pages = [] try: request = build_request("http://kb.dynamsoft.com/sitemap.xml") f = urlopen(request, timeout=3) xml = f.read() soup = BeautifulSoup(xml) urlTags = soup.find_all("url") print "The number of url tags in sitemap: ", str(len(urlTags)) for sitemap in urlTags: link = sitemap.findNext("loc").text pages.append(link) f.close() except HTTPError, URLError: print URLError.code return pages
分析HTML元素获取全部连接:
def queryLinks(self, result): links = [] content = ''.join(result) soup = BeautifulSoup(content) elements = soup.select('a') for element in elements: if shutdown_event.isSet(): return GAME_OVER try: link = element.get('href') if link.startswith('http'): links.append(link) except: print 'href error!!!' continue return links def readHref(self, url): result = [] try: request = build_request(url) f = urlopen(request, timeout=3) while 1 and not shutdown_event.isSet(): tmp = f.read(10240) if len(tmp) == 0: break else: result.append(tmp) f.close() except HTTPError, URLError: print URLError.code if shutdown_event.isSet(): return GAME_OVER return self.queryLinks(result)
检查link的response返回值:
def crawlLinks(self, links, file=None): for link in links: if shutdown_event.isSet(): return GAME_OVER status_code = 0 try: request = build_request(link) f = urlopen(request) status_code = f.code f.close() except HTTPError, URLError: status_code = URLError.code if status_code == 404: if file != None: file.write(link + '\n') print str(status_code), ':', link return GAME_OVER