原创文章,欢迎分享! http://my.oschina.net/u/2306127/blog/613875
python
最近空气污染严重,也为了演练一下Orange插件编写和数据处理的学习成果,准备开发一个AQI数据获取和分析的插件。目前作出来的一个样子以下,还有点酷吧?[下一步完善后,会将源码共享,目前暂不拿来误人,感兴趣的可交流]git
研究过程当中,也发现一个重要的趋势:北京的空气质量在整个华北平原地区,几乎任什么时候候都是最好的!web
这里主要介绍研究过程,目前结论只是初步观察,后面研究再提供相应的分析图表。
正则表达式
过程当中遇到的问题和处理办法,与你们分享,也有一些未决的问题,看哪位牛人能够解决:json
数据来源用的http://aqicn.org。使用requests这个库进行数据抓取,功能很强,尤为是能够自定义Header。若是不自定义header,因为这个网站采用了反抓取技术,只返回过时的老数据,是没法获得最新的数据的。代码以下:
网络
#Get AQI data from web,by a region. def getaqidata(left,right,bottom,top): aqi_url = geturl(left,right,bottom,top) aqi = requests.get(aqi_url,headers=gethead()) raqi = aqi.text raqi2 = re.search(r'\[\{.*\}\]',raqi) cities = json.loads(raqi2.group(0)) return cities
具体的Header能够打开FireFox的“开发者”功能,选择“网络”,再选中当前的数据访问请求列表,便可看到全部的消息。而后选择“原始头“,便可将相应的head拷贝下来,放到gethead()函数下,作成一个辞典返回。而后调用:数据结构
aqi = requests.get(aqi_url,headers=gethead())
返回的值是一个json的字符串,可是有一些头信息,以下:app
mapShowLevel2Makers([{"lat":"38.871","lon":"115.521","aqi":"112", "utime":" on Thursday, Feb 4th 2016, 16:00 pm","stamp":1454572800, "city":"City Monitoring Station, Baoding", "img":"_c_az8khNSs3Uf7J_7tN1s57uaNIH4uezJz7b2v189UwA", "pol":"pm25","tz":"+0800","idx":781,"x":668}, {"lat":"38.896","lon":"115.522","aqi":"93", "utime":" on Thursday, Feb 4th 2016, 16:00 pm","stamp":1454572800, "city":"Huadian II, Baoding", "img":"_AR8A4P9DTjpIZWJlaS_kv53lrprluIIv5Y2O55S15LqM5Yy6", "pol":"pm25","tz":"+0800","idx":783,"x":670}, ... {"lat":"40.152","lon":"118.311","aqi":"48", "utime":" on Thursday, Feb 4th 2016, 16:00 pm","stamp":1454572800, "city":"Qianxi EPA, Tangshan", "img":"_ASUA2v9DTjpIZWJlaS_llJDlsbHluIIv6L-B6KW_546v5L-d5bGAKCop", "pol":"pm25","tz":"+0800","idx":823,"x":4640}], [7.8,0]);
使用正则表达式把数据提取出来,放到cities中。函数
raqi2 = re.search(r'\[\{.*\}\]',raqi)
提取的cities内容以下:性能
[{"lat":"38.871","lon":"115.521","aqi":"112", "utime":" on Thursday, Feb 4th 2016, 16:00 pm","stamp":1454572800, "city":"City Monitoring Station, Baoding", "img":"_c_az8khNSs3Uf7J_7tN1s57uaNIH4uezJz7b2v189UwA", "pol":"pm25","tz":"+0800","idx":781,"x":668}, {"lat":"38.896","lon":"115.522","aqi":"93", "utime":" on Thursday, Feb 4th 2016, 16:00 pm","stamp":1454572800, "city":"Huadian II, Baoding", "img":"_AR8A4P9DTjpIZWJlaS_kv53lrprluIIv5Y2O55S15LqM5Yy6", "pol":"pm25","tz":"+0800","idx":783,"x":670}, ... {"lat":"40.152","lon":"118.311","aqi":"48", "utime":" on Thursday, Feb 4th 2016, 16:00 pm","stamp":1454572800, "city":"Qianxi EPA, Tangshan", "img":"_ASUA2v9DTjpIZWJlaS_llJDlsbHluIIv6L-B6KW_546v5L-d5bGAKCop", "pol":"pm25","tz":"+0800","idx":823,"x":4640}]
cities是一个标准的列表,其中包含一个dict对象,里面有若干个key-value数值对。
cities可使用标准的json操做或者python的list进行访问。
pandas有很是丰富的数据操做函数,pandas能够直接将上面的cities数据结构转为一个pandas.DataFrame。
import pandas as pd df = pandas.DataFrame(cities)
也可使用pandas.DataFrame.to_csv()将数据保存到csv文件中,或者直接存为excel的表格,而后...能够干不少事了。
GeoPandas带有Geometry字段,能够保存几何对象信息。能够将pandas.DataFrame的lon/lat字段转为点对象,可是保存到shp时会出现失败,将文本字段去除后就能够(查看数据发现拼音等字符,有可能未处理当成非法字符了),暂时想了个办法绕过去。
def aqi2geopandas(cities): df = pd.DataFrame(cities) ps = [] ps0 = [1] ns = [] ns0 = [1] for index, row in df.iterrows(): print(index,':',row['lat'],'-',row['lon']) ps0[0] = Point(float(row['lon']),float(row['lat'])) addr = row["city"].split(",") if len(addr) >= 1: ns0[0] = addr[len(addr)-1] else: ns0[0] = "noname" ps.append(ps0[0]) ns.append(ns0[0]) gs = GeoSeries(ps,crs={'init': 'epsg:4326', 'no_defs': True}) geodf = GeoDataFrame({'id' : df["x"],'name' : ns, 'lon' : df["lon"],'lat' : df["lat"], 'aqi' : df["aqi"],'utime' : df["utime"],'tz' : df["tz"], 'geometry' : gs }) return geodf
若是能够直接转换,上面的代码还能够大大简化的。先实现获得数据再说,功能代码后面再去研究、优化。
#获得GeoPandas对象。 gdf = aqi2geopandas(cities) #fshp是要保存的文件名。 gdf.to_file(fshp)
这个过程当中遇到一些问题,主要是Orange.data.Table对象构造时文本对象加不进去,有些API不知道用法,看了源代码没有彻底明白,后面再研究。目前采用保存到.tab文件,再读入的方法,试过能够用,只是须要建立临时文件,性能上会有不足。
def reformcity_tab(i,city): rinfo = str(i+1)+"\t" rinfo = rinfo+city["lat"]+"\t" rinfo = rinfo+city["lon"]+"\t" rinfo = rinfo+city["aqi"]+"\t" rinfo = rinfo+city["city"]+"\t" addr = city["city"].split(",") if len(addr) == 0: rinfo = rinfo+"\t-\t-\t-\t" if len(addr) == 1: rinfo = rinfo+addr[0]+"\t-\t-\t" if len(addr) == 2: rinfo = rinfo+addr[1]+"\t"+addr[0]+"\t-\t" if len(addr) >= 3: rinfo = rinfo+addr[2]+"\t"+addr[1]+"\t"+addr[0]+"\t" rinfo = rinfo+city["utime"]+"\t" rinfo = rinfo+city["tz"] #print("$",rinfo) return rinfo def writecityname_tab(cities,Filename): print("#Write to File:",Filename,"...") f = open(Filename, 'w') f.write("ID\tLatitude\tLongitude\tAQI\tNAME\tPROV\tCONT\tSTA\tUTIME\tTZ" + "\n") f.write("discrete\tdiscrete\tdiscrete\tdiscrete\tdiscrete\tdiscrete\tdiscrete\tdiscrete\tdiscrete\tdiscrete" + "\n") f.write(" \t \t \t \t \t \t \t \t" + "\n") for i, city in enumerate(cities): try: rinfo = reformcity_tab(int(city["x"]),city) f.write(rinfo + "\n") #print(city) except Exception as err: print("#ERROR: ",err) continue f.close() print("#Write AQI to Orange.data.Table Finished.")
而后读入.tab文件:
# ftable为上面保存的文件名,必定要同样哦。 self.table = Orange.data.Table(ftable)
目前已经能够从网上按照指定区域抓取AQI数据,而后转为Orange.data.Table,以及Pandas.DataFrame和 GeoPandas.DataFrame的数据对象,而且经过GeoPandas.DataFrame.to_file(fname)转为shp文件,而后能够在各类GIS软件和R等数据分析软件中打开,进行后续的分析和制图等操做,我使用QGIS打开了,没有问题。