Energy Consumption (能源消耗）数据集预测（2）: 传统季节性信号分解PK prophet

时间 2020-05-31

标签 energy consumption 能源消耗数据预测传统季节性信号分解 prophet 繁體版

原文原文链接

前言

上一篇咱们讲解了如何在2小时快速完成时间序列的预测。采用的算法是facebook的prophet。文末，咱们留了一个问题：如何知道咱们快速建立的prophet的模型相对是好仍是坏？html

我一直信奉的原则是：机器学习，尤为是优化问题，必定要先找到一个baseline。以后咱们就能够找到咱们改善的力度有多大？或者是我咱们的差距有多大。这样你就能说服客户为你的解决方案付款。redis

因为本文中涉及一些信号处理相关的知识，因此会混用信号和时间序列两个词。算法

prophet预测新数据集

上文中咱们采用的数据集是AEP_hourly.csv。很遗憾，我在有限的时间内没有找到网上有人作过该数据集预测。不少人都预测的是PJME_hourly.csv。bash

为了对比标杆，咱们也用这个数据集快速作预测。代码仍是和上一篇代码一致，只是更改数据集的名称以及columns的名称，属于很基本的操做。这里就只贴出代码和结果，不作详细的解释。机器学习

from statsmodels.tsa.seasonal import seasonal_decompose
import numpy as np
from scipy.fft import fft
import pandas as pd
import matplotlib.pyplot as plt
from fbprophet import Prophet
from sklearn.metrics import mean_absolute_error


# we only analyze the dataset pjm_hourly_est.csv
file = 'data/PJME_hourly.csv'
raw_df = pd.read_csv(file)
ax = raw_df.plot(kind='line')
plt.show()


# predict new dataset with prophet
dt_format = '%Y-%m-%d %H:%M:%S'
raw_df['Datetime']= pd.to_datetime(raw_df['Datetime'],format=dt_format)
raw_df.rename({'PJME_MW':'y','Datetime':'ds'},axis=1,inplace=True)
split_dt = pd.Timestamp('2015-01-01 00:00:00')
train_df = raw_df[raw_df['ds']< split_dt]
test_df = raw_df[raw_df['ds']>= split_dt]
model = Prophet()
model= model.fit(train_df)
y_test  = model.predict(test_df)
fig  = model.plot(y_test)
print(mean_absolute_error(test_df['y'].values,y_test['yhat']))
model.plot_components(y_test)
复制代码

MAE 偏差为5183。看到kaggle上置顶的prophet预测，ta的结果是5181.782050398612。ide

也就是说，咱们两个小时速成的预测模型，和“大神”预测的差很少。或许prophet的模型基本成绩就是这个水平。那么这个水平在模型界属于什么地位呢？函数

Convergence detected: relative gradient magnitude is below tolerance
5183.653998193845
复制代码

传统的季节性信号分解

咱们能够试试传统的季节性信号分解的模型预测该时间集。prophet的核心思想也是信号分解，将一个时间序列中抽取不一样周期的季节性信号。工具

时间序列的圣经中提到，时间序列通常会被分解为trend + seasonal + remiander （趋势+季节性+残余信号）。post

statsmodels模块中的seasonal_decompose 已经有成熟的分解方法：移动平均值法。学习

下面咱们对该函数进行说明：

statsmodels.tsa.seasonal.seasonal_decompose(x, model='additive', filt=None, period=None, two_sided=True, extrapolate_trend=0)[source]¶
Seasonal decomposition using moving averages.

复制代码

主要参数：

X：就是要分析的时间序列
model: 模型有两种，加法（additive）或（multiplicative）。所谓的加法或乘法，都是指的季节性信号的幅值变化，若是是幅值变化不大，那就是加法。若是幅值是变化的，那就是乘法。
period：信号要分解的周期

输出的结果为：

observe（原始信号）
seasonal（季节信号）
trend（趋势信号）
resid（残余信号）

基本了解了函数的用法，咱们就能够进行实战了。

传统信号分解建模

首先咱们准备好数据，该数据和prophet采用的数据集一致。

咱们先分解出半天的信号。由于白天和晚上的用电量很明显不一样，这是常识。因此有了这个先验知识，咱们尝试去分解，period 设为12。

half_day_result = seasonal_decompose(raw_df['PJME_MW'], model='additive', period=12)
half_day_result.plot()
复制代码

从分解结果中能够看到，seasonal 信号是幅值恒定的一个周期信号。咱们能够plot更少的点，让其显示更清楚。由于咱们假定的周期是12小时，因此咱们plot 24小时的信号。能够看到这是几乎很完美的正（余）弦信号（由于相位不清楚）。

half_day_result.seasonal.iloc[0:24].plot()
复制代码

当咱们对比trend信号和原始信号时，发现trend相比来讲，趋势仍不明显，可是相对原始信号来讲有点进步。

回到咱们信号分解的基本思路：若是咱们能预测分解后的每一个信号，那么合并起来的信号，咱们就能预测。 seasonal信号，是正（余）弦信号，只须要知道幅值，频率，相位和偏移量便可。

傅里叶变换求解

关于傅里叶变换这个深奥的信号分析工具，网上有太多的讲解。这里咱们只须要知道的是：傅里叶变换能够将时域信号变成频域信号。白话文就是：傅里叶变换能够帮咱们快速求解幅值，频率。我写了一个函数，思路是：

首先对信号进行fft变换
以后找到幅值最大的点对应的频率和幅值
顺便返回时间信号的均值，做为偏移量

def find_period(x, plot=True):
    yf = fft(x)
    xf = np.linspace(0, len(x) // 2, len(x) // 2)
    yf = 2.0 / len(x) * np.abs(yf[0:len(x) // 2])
    all_df = pd.DataFrame([])
    all_df['time'] = xf
    all_df['freq'] = yf
    period = len(x) / all_df.loc[all_df['freq'].idxmax(), 'time']
    all_df.sort_values(by='freq',inplace=True,ascending=False)
    all_df.reset_index(inplace=True)
    all_df['period']= len(x)/all_df['time']
    all_df.replace([np.inf, -np.inf], 0,inplace=True)
    period = len(x) / all_df.loc[all_df['freq'].idxmax(), 'time']
    if plot:
        plt.plot(all_df['period'], yf)
    return period,all_df['freq'].max(),np.mean(x)
复制代码

咱们来调用函数对seaonal信号进行变换。

half_day_season_period = find_period(half_day_result_seasonal)
print('*' * 30)
print(f'half_day_season_period: {half_day_season_period}')
复制代码

周期为11.99 小时,也就是12小时，幅值为554，偏移量为0。咱们能够用数学表达式表示这个信号了，可是还缺了相位信息。

******************************
half_day_season_period: (11.999669803533104, 554.1702081757178, 5.666447331079827e-05)
复制代码

seaonal信号咱们(几乎）能够表示了，其余的信号怎么办？看不懂？？继续分解！！直到看懂（信号简单明了，或者单调，或者周期）。

继续分解trend 以及resid信号。

后面咱们会继续分解天天的信号，每周的信号，每个月，每半年，每一年，而后对seasonal 信号进行fft变换，求出幅值和频率。代码几乎雷同，因此不一一讲解。

# split day compose
day_result = seasonal_decompose(half_day_result.trend.dropna(), model='additive', period=24)
day_result.plot()

# find period of day cyclic
day_result_seasonal = day_result.seasonal.values
day_season_period = find_period(day_result_seasonal)
print('*' * 30)
print(f'day_season_period: {day_season_period}')
# find the period of redis
day_redis_period = find_period(day_result.resid.dropna().values)
print(f'day_redis_period: {day_redis_period}')

# find period of week cyclic
week_result = seasonal_decompose(
    day_result.trend.dropna(),
    model='additive',
    period=7 * 24)
week_result.plot()
week_season_period = find_period(week_result.seasonal.values)
print('*' * 30)
print('week split result')
print(f'week_season_period: {week_season_period}')
week_redis_period = find_period(week_result.resid.dropna().values)
print(f'week_redis_period: {week_redis_period}')


month_result = seasonal_decompose(
    day_result.trend.dropna(),
    model='additive',
    period=30 * 24)
month_result.plot()
month_season_period = find_period(month_result.seasonal.values)
print('*' * 30)
print('month split result')
print(f'month_season_period: {month_season_period}')

month_redis_period = find_period(month_result.resid.dropna().values)
print(f'month_redis_period: {month_redis_period}')



half_year_result = seasonal_decompose(
    month_result.trend.dropna(),
    model='additive',
    period=365 * 12)
half_year_result.plot()
half_year_season_period = find_period(half_year_result.seasonal.values)
print('*' * 30)
print('half year split result')
print(f'half_year_season_period: {half_year_season_period}')

half_year_redis_period = find_period(half_year_result.resid.dropna().values)
print(f'half_year_redis_period: {half_year_redis_period}')

year_result = seasonal_decompose(
    half_year_result.trend.dropna(),
    model='additive',
    period=365 * 24)
year_result.plot()
year_season_period = find_period(year_result.seasonal.values)
print('*' * 30)
print('yearly split result')
print(f'year_season_period: {year_season_period}')
复制代码

咱们能够看一下信号分解的结果。

每日份量分解后的信号图：

每周份量分解的信号图：

每个月份量分解的信号图：

半年份量分解的信号图：

能够看到，分解到最后，trend几乎是单调递减的信号。

fft结果以下：

******************************
day_season_period: (24.00132100396301, 655.682736730418, 0.022879188988473007)
day_redis_period: (23.993396070662044, 2129.021843848859, 0.006176697324649406)
******************************
week split result
week_season_period: (168.0092485549133, 169.52905012773715, 0.010345771248929707)
week_redis_period: (169.1841491841492, 1035.9007684637975, 0.34519449979470296)
******************************
month split result
month_season_period: (719.4455445544555, 148.40567700918697, -0.09128365805888033)
month_redis_period: (169.13216374269007, 1144.5380532729368, -1.1388580520215974)
******************************
half year split result
half_year_season_period: (4382.060606060606, 4156.997946762282, 0.6291791975756035)
half_year_redis_period: (2921.416666666667, 848.789097533505, 22.611620632847078)
******************************
yearly split result
year_season_period: (8764.25, 837.7800370550049, 0.3809718680342369)
复制代码

构建数学模型

提到数学模型，我会头大。不过这里的数学模型很简单：原始时间信号=咱们分解出来的全部seasonal信号+ trend份量。 trend份量咱们就用简单的线性拟合，seasonal信号咱们假定都是正弦信号，可是相位信息也很重要，这里咱们采用优化求解的算法，获得每一个正弦信号的相位。具体见代码

from sklearn.linear_model import LinearRegression
print(np.linspace(0,year_result.trend.shape[0], year_result.trend.shape[0]).shape)
y = year_result.trend.dropna()
lnr = LinearRegression().fit(np.linspace(0,y.shape[0], y.shape[0]).reshape(-1,1),y)
print(f'{lnr.coef_},{lnr.intercept_}')


from scipy.optimize import basinhopping
from sklearn.metrics import mean_squared_error


def energy_consumption_math_model(coefs, X, y):  # kwargs without ** to make it as positional para
    y_hat = -0.01029388 * X + 32815.50 \
            + 2*1280.4696176751042 * np.sin(X / 24 * np.pi * 2 + coefs[0]) \
            + 2 * 2129.021843848859 * np.sin(X / 24 * np.pi * 2 + coefs[1]) \
            + 2 * 1035.9007684637975 * np.sin(X / 24 / 7 * np.pi * 2 + coefs[2]) \
            + 2 * 1144.5380532729368 * np.sin(X / 24 / 7 * np.pi * 2 + coefs[3]) \
            + 2*4156.997946762282 * np.sin(X / 24 / 183 * np.pi * 2 + coefs[4]) + 22\
            +2*848.789097533505 * np.sin(X / 2921 * np.pi * 2 + coefs[5]) + \
            +2*837.7800370550049 * np.sin(X / 24 / 365 * np.pi * 2 + coefs[6])     + coefs[7]

    error = mean_squared_error(y_hat,y)
    return  error
coef = np.zeros(8)
minimizer_kwargs = {"method": "BFGS"}
minimizer_kwargs['args'] = (raw_df.index.to_numpy(), raw_df['PJME_MW'])
opt_result = basinhopping(energy_consumption_math_model, coef, minimizer_kwargs=minimizer_kwargs)
print(opt_result)
复制代码

优化的结果以下：

coefs = np.array([  0.92350449,  -2.21809036,   4.51689507,   7.65849104,
         2.54063322,  -0.67270678,  -0.92850058, -48.15463512])
复制代码

咱们将求解出来的coef 从新带入y_hat 公式：

y_hat = -0.01029388 * X + 32815.50 \
        + 2 * 1280.4696176751042 * np.sin(X / 24 * np.pi * 2 + coefs[0]) \
        + 2 * 2129.021843848859 * np.sin(X / 24 * np.pi * 2 + coefs[1]) \
        + 2 * 1035.9007684637975 * np.sin(X / 24 / 7 * np.pi * 2 + coefs[2]) \
        + 2 * 1144.5380532729368 * np.sin(X / 24 / 7 * np.pi * 2 + coefs[3]) \
        + 2 * 4156.997946762282 * np.sin(X / 24 / 183 * np.pi * 2 + coefs[4]) + 22 \
        + 2 * 848.789097533505 * np.sin(X / 2921 * np.pi * 2 + coefs[5]) + \
        +2 * 837.7800370550049 * np.sin(X / 24 / 365 * np.pi * 2 + coefs[6]) + coefs[7]
复制代码

以上就是咱们创建的数学模型（包含1个线性趋势+7个正弦信号+常量），只要输入x的值，就会计算（预测）出对应的y值（用电量）。

结果展现与对比

由于咱们有了数学模型，能够很方便对全部的数据进行预测。咱们对比整个数据集上的预测结果与原始数据，看起来拟合效果还不错，周期捕捉的很准确，幅值也相差不大。

用数据来讲话，咱们对数据集测试部分进行求MAE，结果是4743！很差意思，简单的季节性信号分解加上简单的正弦信号拟合，居然赛过了prophet的预测模型，提升了精度达到9%。

4743.114944224841

复制代码

总结

本文咱们采用传统的季节性信号分解方法来分解信号，对分解后的季节性信号用正弦函数拟合，对于最后的趋势信号，采用线性拟合。

一样的测试集，传统季节性信号分解方法居然赛过了prophet。

prophet真的这么弱吗？可否挽回这个模型的颜面？后面咱们能够作成尝试。