京东商品:https://item.jd.com/100005603...
先试试下面这个代码:javascript
import requests url = 'https://item.jd.com/100005603522.html' try: r = requests.get(url) r.raise_for_status() r.encoding = r.apparent_encoding print(r.text[:1000]) except: print('爬取失败')
亚马逊商品:
先按京东的这个方法来试一下html
>>> r = requests.get('https://www.amazon.cn/dp/B076SRZY65/ref=sr_1_1?__mk_zh_CN=%E4%BA%9A%E9%A9%AC%E9%80%8A%E7%BD%91%E7%AB%99&keywords=%E7%BA%A2%E6%A5%BC%E6%A2%A6&qid=1581427290&sr=8-1') >>> r.status_code 503
咱们看到返回的状态码是503,说明服务器拒绝了咱们的访问。
咱们看看究竟是哪里出了问题,首先改变一下返回数据的编码。java
>>> r.encoding 'ISO-8859-1' >>> r.encoding = r.apparent_encoding >>> r.text <p class="a-last">抱歉,咱们只是想确认一下当前访问者并不是自动程序。为了达到最佳效果,请确保您浏览器上的 Cookie 已启用。</p>
到这里,咱们已经知道,服务器知道了咱们是用程序访问,因此拒绝了。
咱们知道,Response对象包含了咱们发送的请求的所有信息,如今看看咱们发送的请求的头部信息是什么样的。python
>>> r.request.headers {'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
咱们的程序,忠实地告诉了服务器,这个请求是由Python的requests库进行访问的,因此被拒绝了。下面咱们从新构造一个头部信息,模拟成浏览器再访问一次。浏览器
kv = {'User-Agent': 'Mozilla/5.0'} >>> r = requests.get('https://www.amazon.cn/dp/B076SRZY65/ref=sr_1_1?__mk_zh_CN=%E4%BA%9A%E9%A9%AC%E9%80%8A%E7%BD%91%E7%AB%99&keywords=%E7%BA%A2%E6%A5%BC%E6%A2%A6&qid=1581427290&sr=8-1',headers=kv) >>> r.status_code 200 >>> r.request.headers {'User-Agent': 'Mozilla/5.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'} >>> r.text[:1000] '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n \n\n\n\n\n\n\n\n\n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n \n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n <!doctype html><html class="a-no-js" data-19ax5a9jf="dingo">\n <head>\n<script type="text/javascript">var ue_t0=ue_t0||+new Date();</script>\n<script type="text/javascript">\nwindow.ue_ihb = (window.ue_ihb || window.ueinit || 0) + 1;\nif (window.ue_ihb === 1) {\nvar ue_hob=+new Date();\nvar ue_id=\'FXYNFB591SCNSB56G63B\',\nue_csm = window,\nue_err_chan = \'jserr-rw\',\nue = {};\n(function(d){var e=d.ue=d.ue||{},f=Date.now||function(){return+new Date};e.d=function(b){return f()-(b?0:d.ue_t0)};e.stub=function(b,a){if(!b[a]){var c=[];b[a]=function(){c.push([c.slice.call(arguments),e.d(),d.ue_id])};b[a].replay=function(b){for(var a;a=c.shift();)b(a[0],a[1],a[2])};b[a].isStub=1}};e.exec=function(b,a){return function(){try{return b.apply(this,arguments)}catch(c){ueLogError(c,{attribution:a||"undefined",logLevel:"WARN"})}}}})(ue_csm);\n\nue.stub(ue,"log");ue.stub(ue,"onunload");ue.stub(ue,'
下面给出访问亚马逊商品信息的所有代码:服务器
import requests url = 'https://www.amazon.cn/dp/B076SRZY65/ref=sr_1_1?__mk_zh_CN=%E4%BA%9A%E9%A9%AC%E9%80%8A%E7%BD%91%E7%AB%99&keywords=%E7%BA%A2%E6%A5%BC%E6%A2%A6&qid=1581427290&sr=8-1' try: kv = {'User-Agent': 'Mozilla/5.0'} r = requests.get(url,headers = kv) r.raise_for_status() r.encoding = r.apparent_encoding print(r.text[2000:3000]) except: print('爬取失败')
360搜索关键词提交:
360的关键词接口:http://www.so.com/s?q=keyword
那么只须要构造一个URL就能够了.网络
>>> kv = {'q':'python'} >>> r = requests.get('http://www.so.com/s',params=kv) >>> r.status_code 200 >>> len(r.text) 346592 >>> r.url 'https://www.so.com/s?q=python'
网络图片的爬取和存储:
图片地址:http://img0.dili360.com/ga/M00/48/F7/wKgBy1llvmCAAQOVADC36j6n9bw622.tub.jpg
app
>>> path = "/Users/liuneng/Pictures/abc.jpg" >>> url = "http://img0.dili360.com/ga/M00/48/F7/wKgBy1llvmCAAQOVADC36j6n9bw622.tub.jpg" >>> r = requests.get(url) >>> r.status_code 200 >>> with open(path,'wb') as f: f.write(r.content) 389618
这样就能够了,下面给出全代码this
import requests import os url = "http://img0.dili360.com/ga/M00/48/F7/wKgBy1llvmCAAQOVADC36j6n9bw622.tub.jpg" root = "/Users/liuneng/Pictures/" path = root + url.split('/')[-1] #将最后一个反斜杠后的内容提取出来 try: if not os.path.exists(root): #判断根目录是否存在,不存在就建一个 os.mkdir(root) if not os.path.exists(path): #判断文件名是否存在,若是不存在再开始爬取 r = requests.get(url) with open (path,'wb') as f: f.write(r.content) f.close() print('文件保存成功') else: print('文件已存在') except: print('爬取失败')
IP地址归属地查询
连接为:http://www.ip138.com/iplookup...
要查询IP地址已明文方式存储在URL中。编码