对于记录的数据,如何用 Python 进行分析、或图形化呢?html
本文将介绍 numpy, matplotlib, pandas, scipy 几个包,进行数据分析、与图形化。python
Python 环境建议用 Anaconda 发行版,下载地址:git
Anaconda 是一个用于科学计算的 Python 发行版,已经包含了众多流行的科学计算、数据分析的 Python 包。github
能够 conda list
列出已有的包,会发现本文要介绍的几个包都有了:bash
$ conda list | grep numpy numpy 1.17.2 py37h99e6662_0 $ conda list | grep "matplot\|seaborn\|plotly" matplotlib 3.1.1 py37h54f8f79_0 seaborn 0.9.0 py37_0 $ conda list | grep "pandas\|scipy" pandas 0.25.1 py37h0a44026_0 scipy 1.3.1 py37h1410ff5_0
若是已有 Python 环境,那么 pip 安装一下它们:3d
pip install numpy matplotlib pandas scipy # pypi 镜像: https://mirrors.tuna.tsinghua.edu.cn/help/pypi/
本文环境为: Python 3.7.4 (Anaconda3-2019.10)code
本文假设了以下格式的数据 data0.txt
:orm
id, data, timestamp 0, 55, 1592207702.688805 1, 41, 1592207702.783134 2, 57, 1592207702.883619 3, 59, 1592207702.980597 4, 58, 1592207703.08313 5, 41, 1592207703.183011 6, 52, 1592207703.281802 ...
CSV 格式:逗号分隔,读写简单, Excel 可打开。htm
以后,咱们会一块儿达成以下几个目标:blog
numpy 可用 loadtxt
直接读取 CSV 数据,
import numpy as np # id, (data), timestamp datas = np.loadtxt(p, dtype=np.int32, delimiter=",", skiprows=1, usecols=(1))
dtype=np.int32
: 数据类型 np.int32
delimiter=","
: 分隔符 ","skiprows=1
: 跳过第 1 行usecols=(1)
: 读取第 1 列若是读取多列,
# id, (data, timestamp) dtype = {'names': ('data', 'timestamp'), 'formats': ('i4', 'f8')} datas = np.loadtxt(path, dtype=dtype, delimiter=",", skiprows=1, usecols=(1, 2))
dtype
说明可见: https://numpy.org/devdocs/reference/arrays.dtypes.html
numpy 计算均值、样本标准差:
# average data_avg = np.mean(datas) # data_avg = np.average(datas) # standard deviation # data_std = np.std(datas) # sample standard deviation data_std = np.std(datas, ddof=1) print(" avg: {:.2f}, std: {:.2f}, sum: {}".format( data_avg, data_std, np.sum(datas)))
只需四行,就能图形化显示了:
import sys import matplotlib.pyplot as plt import numpy as np def _plot(path): print("Load: {}".format(path)) # id, (data), timestamp datas = np.loadtxt(path, dtype=np.int32, delimiter=",", skiprows=1, usecols=(1)) fig, ax = plt.subplots() ax.plot(range(len(datas)), datas, label=str(i)) ax.legend() plt.show() if __name__ == "__main__": if len(sys.argv) < 2: sys.exit("python data_plot.py *.txt") _plot(sys.argv[1])
ax.plot(x, y, ...)
横坐标 x
取的数据下标 range(len(datas))
。
完整代码见文末 Gist 地址的 data_plot.py
。运行效果以下:
$ python data_plot.py data0.txt Args nonzero: False Load: data0.txt size: 20 avg: 52.15, std: 8.57, sum: 1043
能够读取多个文件,一块儿显示:
$ python data_plot.py data*.txt Args nonzero: False Load: data0.txt size: 20 avg: 52.15, std: 8.57, sum: 1043 Load: data1.txt size: 20 avg: 53.35, std: 6.78, sum: 1067
x
, y
两组数据,用 scipy 进行插值,平滑成曲线:
from scipy import interpolate xnew = np.arange(xvalues[0], xvalues[-1], 0.01) ynew = interpolate.interp1d(xvalues, yvalues, kind='cubic')
完整代码见文末 Gist 地址的 data_interp.py
。运行效果以下:
python data_interp.py data0.txt
matplotlib
图像化时如何配置、延迟、保存,可见代码与注释。
这儿须要读取 timestamp 列数据,
# id, data, (timestamp) stamps = np.loadtxt(path, dtype=np.float64, delimiter=",", skiprows=1, usecols=(2))
numpy 计算先后差值,
stamps_diff = np.diff(stamps)
pandas 统计每秒个数,
stamps_int = np.array(stamps, dtype='int') stamps_int = stamps_int - stamps_int[0] import pandas as pd stamps_s = pd.Series(data=stamps_int) stamps_s = stamps_s.value_counts(sort=False)
办法:把时间戳直接变整秒数,再 pandas 统计相同值。
完整代码见文末 Gist 地址的 stamp_diff.py
。运行效果以下:
python stamp_diff.py data0.txt
matplotlib
图形化时怎么显示多个图表,也可见代码。
本文代码 Gist 地址: https://gist.github.com/ikuokuo/8629cc28079199c65e0eedb0d02a9e74