本文原地址:http://www.fullstackyang.com/...,转发请注明本博客地址或segmentfault地址,谢谢!html
在使用HttpClient进行抓取一些网页的时候,常常会保留从服务器端发回的Cookie信息,以便发起其余须要这些Cookie的请求。大多数状况下,咱们使用内置的cookie策略,便可以方便直接地获取这些cookie。
下面的一小段代码,就是访问http://www.baidu.com,并获取对应的cookie:segmentfault
@Test public void getCookie(){ CloseableHttpClient httpClient = HttpClients.createDefault(); HttpGet get=new HttpGet("http://www.baidu.com"); HttpClientContext context = HttpClientContext.create(); try { CloseableHttpResponse response = httpClient.execute(get, context); try{ System.out.println(">>>>>>headers:"); Arrays.stream(response.getAllHeaders()).forEach(System.out::println); System.out.println(">>>>>>cookies:"); context.getCookieStore().getCookies().forEach(System.out::println); } finally { response.close(); } } catch (IOException e) { e.printStackTrace(); }finally { try { httpClient.close(); } catch (IOException e) { e.printStackTrace(); } } }
打印结果服务器
>>>>>>headers: Server: bfe/1.0.8.18 Date: Tue, 12 Sep 2017 06:19:06 GMT Content-Type: text/html Last-Modified: Mon, 23 Jan 2017 13:28:24 GMT Transfer-Encoding: chunked Connection: Keep-Alive Cache-Control: private, no-cache, no-store, proxy-revalidate, no-transform Pragma: no-cache Set-Cookie: BDORZ=27315; max-age=86400; domain=.baidu.com; path=/ >>>>>>cookies: [version: 0][name: BDORZ][value: 27315][domain: baidu.com][path: /][expiry: null]
可是也有一些网站返回的cookie并不必定彻底符合规范,例以下面这个例子,从打印出的header中能够看到,这个cookie中的Expires属性是时间戳形式,并不符合标准的时间格式,所以,httpclient对于cookie的处理失效,最终没法获取到cookie,而且发出了一条警告信息:“Invalid ‘expires’ attribute: 1505204523”cookie
警告: Invalid cookie header: "Set-Cookie: yd_cookie=90236a64-8650-494b332a285dbd886e5981965fc4a93f023d; Expires=1505204523; Path=/; HttpOnly". Invalid 'expires' attribute: 1505204523 >>>>>>headers: Date: Tue, 12 Sep 2017 06:22:03 GMT Content-Type: text/html Connection: keep-alive Set-Cookie: yd_cookie=90236a64-8650-494b332a285dbd886e5981965fc4a93f023d; Expires=1505204523; Path=/; HttpOnly Cache-Control: no-cache, no-store Server: WAF/2.4-12.1 >>>>>>cookies:
虽然咱们能够利用header的数据,从新构造一个cookie出来,也有不少人确实也是这么作的,但这种方法不够优雅,那么如何解决这个问题?网上相关的资料又不多,因此就只能先从官方文档入手。在官方文档3.4小节custom cookie policy中讲到容许自定义的cookie策略,自定义的方法是实现CookieSpec接口,并经过CookieSpecProvider来完成在httpclient中的初始化和注册策略实例的工做。好了,关键的线索在于CookieSpec接口,咱们来看一下它的源码:dom
public interface CookieSpec { …… /** * Parse the {@code "Set-Cookie"} Header into an array of Cookies. * * <p>This method will not perform the validation of the resultant * {@link Cookie}s</p> * * @see #validate * * @param header the {@code Set-Cookie} received from the server * @param origin details of the cookie origin * @return an array of {@code Cookie}s parsed from the header * @throws MalformedCookieException if an exception occurs during parsing */ List<Cookie> parse(Header header, CookieOrigin origin) throws MalformedCookieException; …… }
在源码中咱们发现了一个parse方法,看注释就知道正是这个方法,将Set-Cookie的header信息解析为Cookie对象,天然地再了解一下在httplcient中的默认实现DefaultCookieSpec,限于篇幅,源码就不贴了。在默认的实现中,DefaultCookieSpec主要的工做是判断header中Cookie规范的类型,而后再调用具体的某一个实现。像上述这种Cookie,最终是交由NetscapeDraftSpec的实例来作解析,而在NetscapeDraftSpec的源码中,定义了默认的expires时间格式为“EEE, dd-MMM-yy HH:mm:ss z”ide
public class NetscapeDraftSpec extends CookieSpecBase { protected static final String EXPIRES_PATTERN = "EEE, dd-MMM-yy HH:mm:ss z"; /** Default constructor */ public NetscapeDraftSpec(final String[] datepatterns) { super(new BasicPathHandler(), new NetscapeDomainHandler(), new BasicSecureHandler(), new BasicCommentHandler(), new BasicExpiresHandler( datepatterns != null ? datepatterns.clone() : new String[]{EXPIRES_PATTERN})); } NetscapeDraftSpec(final CommonCookieAttributeHandler... handlers) { super(handlers); } public NetscapeDraftSpec() { this((String[]) null); } …… }
到这里已经比较清楚了,咱们只须要将Cookie中expires的时间转换为正确的格式,而后再送入默认的解析器就能够了。网站
解决方法:ui
实现以下(URL就不公开了,已经隐去)this
public class TestHttpClient { String url = sth; class MyCookieSpec extends DefaultCookieSpec { @Override public List<Cookie> parse(Header header, CookieOrigin cookieOrigin) throws MalformedCookieException { String value = header.getValue(); String prefix = "Expires="; if (value.contains(prefix)) { String expires = value.substring(value.indexOf(prefix) + prefix.length()); expires = expires.substring(0, expires.indexOf(";")); String date = DateUtils.formatDate(new Date(Long.parseLong(expires) * 1000L),"EEE, dd-MMM-yy HH:mm:ss z"); value = value.replaceAll(prefix + "\\d{10};", prefix + date + ";"); } header = new BasicHeader(header.getName(), value); return super.parse(header, cookieOrigin); } } @Test public void getCookie() { CloseableHttpClient httpClient = HttpClients.createDefault(); Registry<CookieSpecProvider> cookieSpecProviderRegistry = RegistryBuilder.<CookieSpecProvider>create() .register("myCookieSpec", context -> new MyCookieSpec()).build();//注册自定义CookieSpec HttpClientContext context = HttpClientContext.create(); context.setCookieSpecRegistry(cookieSpecProviderRegistry); HttpGet get = new HttpGet(url); get.setConfig(RequestConfig.custom().setCookieSpec("myCookieSpec").build()); try { CloseableHttpResponse response = httpClient.execute(get, context); try{ System.out.println(">>>>>>headers:"); Arrays.stream(response.getAllHeaders()).forEach(System.out::println); System.out.println(">>>>>>cookies:"); context.getCookieStore().getCookies().forEach(System.out::println); } finally { response.close(); } } catch (IOException e) { e.printStackTrace(); }finally { try { httpClient.close(); } catch (IOException e) { e.printStackTrace(); } } } }
再次运行,顺利地打印出正确的结果,完美!url
>>>>>>headers: Date: Tue, 12 Sep 2017 07:24:10 GMT Content-Type: text/html Connection: keep-alive Set-Cookie: yd_cookie=9f521fc5-0248-4ab3ee650ca50b1c7abb1cd2526b830e620f; Expires=1505208250; Path=/; HttpOnly Cache-Control: no-cache, no-store Server: WAF/2.4-12.1 >>>>>>cookies: [version: 0][name: yd_cookie][value: 9f521fc5-0248-4ab3ee650ca50b1c7abb1cd2526b830e620f][domain: www.sth.com][path: /][expiry: Tue Sep 12 17:24:10 CST 2017]