这几天一直在研究新浪微博的爬虫,发现爬取微博的数据首先要登陆。原本打算是经过帐号和密码模拟浏览器登陆。可是如今微博的登陆机制比较复杂。经过帐号密码尚未登陆成功QAQ。因此就先记录下,经过cookie直接访问本身的微博主页。html
微博登陆的细节在其余的博客里已经有了详细的介绍。大概就是用户输入帐号和密码后与服务器产生几回会话。若认证成功后,微博的服务器会返回给浏览器一个cookie。在以后访问微博的其余内容时,经过发送这个cookie就能正常访问微博了。因此用过cookie访问微博,过程就简化为了获取cookie,而后经过程序模拟浏览器访问微博首页。java
经过抓包软件或浏览器自带的调试工具均可以抓取网页的cookie。本文使用的是火狐浏览器的HttpFox 插件来获取微博的cookie。正则表达式
1,打开微博首页,打开HttpFox
2,输入用户名和密码,勾选“记住我”,点击登陆。点击登陆后咱们能够看到HttpFox下产生了不少的URL。进入主页后在HTTPFox中找到你主页对应的URL,以下图:
点击主页的URL后,咱们能够看见左下方的一些信息。包括“Headers”,“Cookies”等。
3,在“Headers”中能够看到有一条“Cookie”的信息。这个就是咱们所须要的cookie了。点击右键保存cookie。
至此,就获取了咱们登陆时所要的cookie了!apache
因为咱们是直接经过cookie进行的登陆。因此少了不少认证的过程。直接使用HttpClient的相关包,带上以前获取的cookie就能够访问我的首页。获取了首页,咱们就能够经过正则表达式来分析微博数据了。浏览器
import java.io.IOException; import java.net.URI; import java.net.URISyntaxException; import org.apache.http.HttpEntity; import org.apache.http.HttpResponse; import org.apache.http.client.ClientProtocolException; import org.apache.http.client.HttpClient; import org.apache.http.client.methods.HttpGet; import org.apache.http.client.methods.HttpPost; import org.apache.http.config.Registry; import org.apache.http.config.RegistryBuilder; import org.apache.http.cookie.CookieSpec; import org.apache.http.cookie.CookieSpecProvider; import org.apache.http.impl.client.BasicCookieStore; import org.apache.http.impl.client.HttpClients; import org.apache.http.impl.cookie.DefaultCookieSpec; import org.apache.http.message.BasicHeader; import org.apache.http.protocol.HttpContext; import org.apache.http.util.EntityUtils; /** * * * @author zkw * */ public class cookieLogin { private HttpClient client; private HttpPost post; private HttpGet get; private BasicCookieStore cookieStore; public cookieLogin() { //cookie策略,不设置会拒绝cookie rejected,设置策略保存cookie信息 cookieStore = new BasicCookieStore(); CookieSpecProvider myCookie = new CookieSpecProvider() { public CookieSpec create(HttpContext context) { return new DefaultCookieSpec(); } }; Registry<CookieSpecProvider> rg = RegistryBuilder.<CookieSpecProvider> create().register("myCookie", myCookie) .build(); client = HttpClients.custom().setDefaultCookieStore(cookieStore).setDefaultCookieSpecRegistry(rg).build(); get = new HttpGet(); post = new HttpPost(); } public void Login() throws ClientProtocolException, IOException, URISyntaxException { String LoginUrl = "你的微博主页网址"; get.setURI(new URI(LoginUrl)); get.addHeader("Host", "weibo.com"); get.addHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0"); get.addHeader("Accept", "*/*"); get.addHeader("Accept-Language", "zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3"); get.addHeader("Accept-Encoding", "gzip, deflate"); get.addHeader("Referer", "http://weibo.com/"); get.addHeader(new BasicHeader("Cookie", "上述获取的cookie值")); HttpResponse resp = client.execute(get); HttpEntity entity = resp.getEntity(); String cont = EntityUtils.toString(entity); System.out.println("获取的微博内容:" + cont); } public HttpClient getClient() { return client; } public void setClient(HttpClient client) { this.client = client; } public HttpPost getPost() { return post; } public void setPost(HttpPost post) { this.post = post; } public HttpGet getGet() { return get; } public void setGet(HttpGet get) { this.get = get; } public BasicCookieStore getCookieStore() { return cookieStore; } public void setCookieStore(BasicCookieStore cookieStore) { this.cookieStore = cookieStore; } public static void main(String[] args) throws ClientProtocolException, IOException, URISyntaxException { new cookieLogin().Login(); } }
经过cookie登陆微博是一种快捷方式,可是存在很多问题。因此博主还在研究微博帐号认证过程,但愿过几天能有所突破QAQ。服务器