用Jsoup在获取一些网站的数据时,起初获取很顺利,可是在访问某浪的数据是Jsoup报错,应该是请求头里面的请求类型(ContextType)不符合要求。javascript
请求代码以下:html
private static void testOuGuanMatch() throws IOException{ Document doc = Jsoup.connect("个人URL").userAgent("Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.2.15)").timeout(5000).get(); System.out.println(doc); }
能看到我这里设置了请求代理和相应时间。java
报错信息以下:app
org.jsoup.UnsupportedMimeTypeException: Unhandled content type. Must be text/*, application/xml, or application/xhtml+xml. Mimetype=application/javascript, URL=.... at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:472) at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:424) at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:178) at org.jsoup.helper.HttpConnection.get(HttpConnection.java:167) at calendarSpider.SpiderTest.testOuGuanMatch(SpiderTest.java:174) at calendarSpider.SpiderTest.main(SpiderTest.java:39)
在google上查找到了解决方法:添加ignoreContentType(true)ide
修改后代码:网站
private static void testOuGuanMatch() throws IOException{ Document doc = Jsoup.connect("个人URL").ignoreContentType(true).userAgent("Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.2.15)").timeout(5000).get(); System.out.println(doc); }
那这里的ignoreContentType(true)看词就知道忽略ContextType的检查google