因为WebMagic的网络请求是经过Apache HttpClient请求,只须要拿到该对象进行登录处理,后续的请求再使用同一个HttpClient对象请求,便可实现登录状态下的请求,登录相关的cookies不须要本身进行管理,HttpClient会自动处理git
查看源码后,发现HttpClient的在HttpClientDownloader中使用github
@ThreadSafe
public class HttpClientDownloader extends AbstractDownloader {
private Logger logger = LoggerFactory.getLogger(this.getClass());
private final Map<String, CloseableHttpClient> httpClients = new HashMap();
private HttpClientGenerator httpClientGenerator = new HttpClientGenerator();
private HttpUriRequestConverter httpUriRequestConverter = new HttpUriRequestConverter();
private ProxyProvider proxyProvider;
private boolean responseHeader = true;
public HttpClientDownloader() {
}
由源码可知CloseableHttpClient以Map的形式存在该downloader中,且为private变量,且getHttpClient方法也是private方法,没法从外部获取该对象进行登录等操做cookie
为解决这个问题,将HttpClientDownloader复制出来,继承AbstractDownloader,修改get方法为public便可网络
public CloseableHttpClient getHttpClient(Site site)
获得本身的类后,启动爬虫时,setDownloader便可ide
MyDownloader myDownloader = new MyDownloader(); Spider spider = Spider.create(new GithubRepoPageProcessor()).setDownloader(myDownloader).addUrl("https://github.com/code4craft").thread(5); CloseableHttpClient httpClient = myDownloader.getHttpClient(spider.getSite()); //TODO 使用httpClient进行登陆 //... //... spider.run();
在执行爬虫前,能够先经过getHttpClient获得HttpClient对象进行登录,该downloader可做为全局变量,在各个PageProcessor中使用,便可保存登录状态this