如何在python中计算点过程的残差答案

【问题标题】：How to compute residuals of a point process in python如何在python中计算点过程的残差
【发布时间】：2014-09-07 06:15:50
【问题描述】：

我正在尝试在 python 中重现 http://jheusser.github.io/2013/09/08/hawkes.html 的工作，但数据不同。我编写了代码来模拟泊松过程以及他们描述的霍克斯过程。

为了进行 Hawkes 模型 MLE，我将对数似然函数定义为

def loglikelihood(params, data):
    (mu, alpha, beta) = params
    tlist = np.array(data)
    r = np.zeros(len(tlist))
    for i in xrange(1,len(tlist)):
        r[i] = math.exp(-beta*(tlist[i]-tlist[i-1]))*(1+r[i-1])
    loglik  = -tlist[-1]*mu
    loglik = loglik+alpha/beta*sum(np.exp(-beta*(tlist[-1]-tlist))-1)
    loglik = loglik+np.sum(np.log(mu+alpha*r))
    return -loglik

使用一些虚拟数据，我们可以计算 Hawkes 过程的 MLE

atimes=[58.98353497,   59.28420225,   59.71571013,   60.06750179,   61.24794134,
61.70692463,   61.73611983,   62.28593814,   62.51691723,   63.17370423
,63.20125152,   65.34092403,  214.24934446,  217.0390236,   312.18830525,
319.38385604,  320.31758188,  323.50201334,  323.76801537,  323.9417007]

res = minimize(loglikelihood, (0.01, 0.1,0.1),method='Nelder-Mead',args = (atimes,))
print res

但是，我不知道如何在 python 中执行以下操作。

如何进行等效的 evalCIF 以获得与它们类似的拟合强度与经验强度图？
如何计算 Hawkes 模型的残差，以制作与他们拥有的 QQ 图相当的值。他们说他们使用了一个名为 ptproc 的 R 包，但我找不到对应的 python。

【问题讨论】：

你确定这是一个编程问题吗？您能否至少说出您正在尝试做什么而不强迫每个人阅读您所指代码的文档？作为一名科学家，如果您需要，我建议您与作者联系以获得一些帮助。但怀疑他们是否会为您将他们的代码翻译成 Python。
@AleksanderLidtke 我想比较霍克斯模型和泊松模型与我的数据的拟合程度。他们似乎有一个很好的方法来做到这一点，但这一切都在 R 中。我想在我的 python 数据中重现它。 evalCIF 代码似乎以某种方式将经验强度（即数据）与拟合模型进行比较。
我知道 statsmodels 会绘制 QQ 图，但这无助于创建残差。
好吧，为什么不通过你拥有的数据来拟合一个模型，然后将经验数据与它进行比较呢？你有这样的模型吗？如果没有办法解决它，我发现高斯混合非常适合通过几乎任何东西来拟合它们，并且似乎在 Python 中有一个实现（未经我自己测试）：scikit-learn.org/0.5/modules/gmm.html
@AleksanderLidtke 我正在测试的两个模型是泊松过程（只有一个参数，强度）和霍克斯模型，它有三个。我在问题中给出的代码显示了如何计算霍克斯模型的 MLE。我知道如何将这些模型与经验数据进行比较的唯一方法是绘制他们使用 evalCIF 所做的事情或计算残差并进行 QQ 图。但我不知道如何在 python 中做到这一点。高斯混合与自激点过程相关吗？我不是这方面的专家

标签： python r statistics scipy stochastic-process

【解决方案1】：

好的，所以您可能希望做的第一件事是绘制数据。为了简单起见，我复制了this figure，因为它只发生了 8 个事件，因此很容易看到系统的行为。以下代码：

import numpy as np
import math, matplotlib
import matplotlib.pyplot
import matplotlib.lines

mu = 0.1 # Parameter values as found in the article http://jheusser.github.io/2013/09/08/hawkes.html Hawkes Process section.
alpha = 1.0
beta = 0.5

EventTimes = np.array([0.7, 1.2, 2.0, 3.8, 7.1, 8.2, 8.9, 9.0])

" Compute conditional intensities for all times using the Hawkes process. "
timesOfInterest = np.linspace(0.0, 10.0, 100) # Times where the intensity will be sampled.
conditionalIntensities = [] # Conditional intensity for every epoch of interest.
for t in timesOfInterest:
     conditionalIntensities.append( mu + np.array( [alpha*math.exp(-beta*(t-ti)) if t > ti else 0.0 for ti in EventTimes] ).sum() ) # Find the contributions of all preceding events to the overall chance of another one occurring. All events that occur after t have no contribution.

" Plot the conditional intensity time history. "
fig = matplotlib.pyplot.figure()
ax = fig.gca()

labelsFontSize = 16
ticksFontSize = 14

fig.suptitle(r"$Conditional\ intensity\ VS\ time$", fontsize=20)
ax.grid(True)
ax.set_xlabel(r'$Time$',fontsize=labelsFontSize)
ax.set_ylabel(r'$\lambda$',fontsize=labelsFontSize)
matplotlib.rc('xtick', labelsize=ticksFontSize) 
matplotlib.rc('ytick', labelsize=ticksFontSize)

eventsScatter = ax.scatter(EventTimes,np.ones(len(EventTimes))) # Just to indicate where the events took place.

ax.plot(timesOfInterest, conditionalIntensities, color='red', linestyle='solid', marker=None, markerfacecolor='blue', markersize=12)
 fittedPlot = matplotlib.lines.Line2D([],[],color='red', linestyle='solid', marker=None,  markerfacecolor='blue', markersize=12)

fig.legend([fittedPlot, eventsScatter], [r'$Conditional\ intensity\ computed\ from\    events$', r'$Events$'])
matplotlib.pyplot.show()

非常准确地再现了这个数字，即使我有些武断地选择了事件时期：

这也可以应用于example set of data 的 5000 笔交易集，方法是对数据进行分箱并将每个分箱视为一个事件。但是，现在发生的情况是，每个事件的权重略有不同，因为每个仓中发生的交易数量不同。 the article 在将比特币贸易到达与霍克斯流程匹配部分也提到了这一点，并提出了解决此问题的建议方法：The only difference to the original dataset is that I added a random millisecond timestamp to all trades that share a timestamp with another trade. This is required as the model requires to distinguish every trade (i.e. every trade must have a unique timestamp). 这包含在以下代码中：

import numpy as np
import math, matplotlib, pandas
import scipy.optimize
import matplotlib.pyplot
import matplotlib.lines

" Read example trades' data. "
all_trades = pandas.read_csv('all_trades.csv', parse_dates=[0], index_col=0) # All trades' data.
all_counts = pandas.DataFrame({'counts': np.ones(len(all_trades))}, index=all_trades.index) # Only the count of the trades is really important.
empirical_1min = all_counts.resample('1min', how='sum') # Bin the data so find the number of trades in 1 minute intervals.

baseEventTimes = np.array( range(len(empirical_1min.values)), dtype=np.float64) # Dummy times when the events take place, don't care too much about actual epochs where the bins are placed - this could be scaled to days since epoch, second since epoch and any other measure of time.
eventTimes = [] # With the event batches split into separate events.
for i in range(len(empirical_1min.values)): # Deal with many events occurring at the same time - need to distinguish between them by splitting each batch of events into distinct events taking place at almost the same time.
    if not np.isnan(empirical_1min.values[i]):
        for j in range(empirical_1min.values[i]):
            eventTimes.append(baseEventTimes[i]+0.000001*(j+1)) # For every event that occurrs at this epoch enter a dummy event very close to it in time that will increase the conditional intensity.

eventTimes = np.array( eventTimes, dtype=np.float64 ) # Change to array for ease of operations.

" Find a fit for alpha, beta, and mu that minimises loglikelihood for the input data. "
#res = scipy.optimize.minimize(loglikelihood, (0.01, 0.1,0.1), method='Nelder-Mead', args = (eventTimes,))
#(mu, alpha, beta) =  res.x
mu = 0.07 # Parameter values as found in the article.
alpha = 1.18
beta = 1.79

" Compute conditional intensities for all epochs using the Hawkes process - add more points to see how the effect of individual events decays over time. "
conditionalIntensitiesPlotting = [] # Conditional intensity for every epoch of interest.
 timesOfInterest = np.linspace(eventTimes.min(), eventTimes.max(), eventTimes.size*10) # Times where the intensity will be sampled. Sample at much higher frequency than the events occur at.
for t in timesOfInterest:
    conditionalIntensitiesPlotting.append( mu + np.array( [alpha*math.exp(-beta*(t-ti))   if t > ti else 0.0 for ti in eventTimes] ).sum() ) # Find the contributions of all preceding events to the overall chance of another one occurring. All events that occur after time of interest t have no contribution.

" Compute conditional intensities at the same epochs as the empirical data are known. "
 conditionalIntensities=[] # This will be used in the QQ plot later, has to have the same size as the empirical data.
for t in np.linspace(eventTimes.min(), eventTimes.max(), eventTimes.size):
    conditionalIntensities.append( mu + np.array( [alpha*math.exp(-beta*(t-ti)) if t > ti else 0.0 for ti in eventTimes] ).sum() ) # Use eventTimes here as well to feel the influence of all the events that happen at the same time.

" Plot the empirical and fitted datasets. "
fig = matplotlib.pyplot.figure()
ax = fig.gca()

labelsFontSize = 16
ticksFontSize = 14

fig.suptitle(r"$Conditional\ intensity\ VS\ time$", fontsize=20)
ax.grid(True)
ax.set_xlabel(r'$Time$',fontsize=labelsFontSize)
ax.set_ylabel(r'$\lambda$',fontsize=labelsFontSize)
matplotlib.rc('xtick', labelsize=ticksFontSize) 
matplotlib.rc('ytick', labelsize=ticksFontSize)

# Plot the empirical binned data.
ax.plot(baseEventTimes,empirical_1min.values, color='blue', linestyle='solid',   marker=None, markerfacecolor='blue', markersize=12)
empiricalPlot = matplotlib.lines.Line2D([],[],color='blue', linestyle='solid', marker=None, markerfacecolor='blue', markersize=12)

# And the fit obtained using the Hawkes function.
ax.plot(timesOfInterest, conditionalIntensitiesPlotting, color='red', linestyle='solid',   marker=None, markerfacecolor='blue', markersize=12)
fittedPlot = matplotlib.lines.Line2D([],[],color='red', linestyle='solid', marker=None, markerfacecolor='blue', markersize=12)

fig.legend([fittedPlot, empiricalPlot], [r'$Fitted\ data$', r'$Empirical\ data$'])
matplotlib.pyplot.show()

这会生成以下拟合图：一切看起来都不错，但是，当您查看细节时，您会发现仅通过取交易数量的一个向量并减去拟合的向量来计算残差是行不通的，因为它们的长度不同：然而，可以在与经验数据记录时相同的时期提取强度，然后计算残差。这使您能够找到经验数据和拟合数据的分位数并将它们相互绘制，从而生成 QQ 图：

""" GENERATE THE QQ PLOT. """
" Process the data and compute the quantiles. "
orderStatistics=[]; orderStatistics2=[];
for i in range( empirical_1min.values.size ): # Make sure all the NANs are filtered out and both arrays have the same size.
    if not np.isnan( empirical_1min.values[i] ):
        orderStatistics.append(empirical_1min.values[i])
        orderStatistics2.append(conditionalIntensities[i])
orderStatistics = np.array(orderStatistics); orderStatistics2 = np.array(orderStatistics2);

orderStatistics.sort(axis=0) # Need to sort data in ascending order to make a QQ plot.    orderStatistics is a column vector.
 orderStatistics2.sort()

 smapleQuantiles=np.zeros( orderStatistics.size ) # Quantiles of the empirical data.
 smapleQuantiles2=np.zeros( orderStatistics2.size ) # Quantiles of the data fitted using the Hawkes process.
for i in range( orderStatistics.size ):
    temp = int( 100*(i-0.5)/float(smapleQuantiles.size) ) # (i-0.5)/float(smapleQuantiles.size) th quantile. COnvert to % as expected by the numpy  function.
    if temp<0.0:
        temp=0.0 # Avoid having -ve percentiles.
    smapleQuantiles[i] = np.percentile(orderStatistics, temp)
    smapleQuantiles2[i] = np.percentile(orderStatistics2, temp)

" Make the quantile plot of empirical data first. "
fig2 = matplotlib.pyplot.figure()
ax2 = fig2.gca(aspect="equal")

fig2.suptitle(r"$Quantile\ plot$", fontsize=20)
ax2.grid(True)
ax2.set_xlabel(r'$Sample\ fraction\ (\%)$',fontsize=labelsFontSize)
ax2.set_ylabel(r'$Observations$',fontsize=labelsFontSize)
matplotlib.rc('xtick', labelsize=ticksFontSize) 
matplotlib.rc('ytick', labelsize=ticksFontSize)

distScatter = ax2.scatter(smapleQuantiles, orderStatistics, c='blue', marker='o') # If these are close to the straight line with slope line these points come from a normal distribution.

ax2.plot(smapleQuantiles, smapleQuantiles, color='red', linestyle='solid', marker=None, markerfacecolor='red', markersize=12)
normalDistPlot = matplotlib.lines.Line2D([],[],color='red', linestyle='solid', marker=None, markerfacecolor='red', markersize=12)

fig2.legend([normalDistPlot, distScatter], [r'$Normal\ distribution$', r'$Empirical\ data$'])
 matplotlib.pyplot.show()

" Make a QQ plot. "
fig3 = matplotlib.pyplot.figure()
ax3 = fig3.gca(aspect="equal")

fig3.suptitle(r"$Quantile\ -\ Quantile\ plot$", fontsize=20)
ax3.grid(True)
ax3.set_xlabel(r'$Empirical\ data$',fontsize=labelsFontSize)
ax3.set_ylabel(r'$Data\ fitted\ with\ Hawkes\ distribution$',fontsize=labelsFontSize)
matplotlib.rc('xtick', labelsize=ticksFontSize) 
matplotlib.rc('ytick', labelsize=ticksFontSize)

distributionScatter = ax3.scatter(smapleQuantiles, smapleQuantiles2, c='blue', marker='x') # If these are close to the straight line with slope line these points come from a normal distribution.

ax3.plot(smapleQuantiles, smapleQuantiles, color='red', linestyle='solid', marker=None,     markerfacecolor='red', markersize=12)
normalDistPlot2 = matplotlib.lines.Line2D([],[],color='red', linestyle='solid',  marker=None, markerfacecolor='red', markersize=12)

fig3.legend([normalDistPlot2, distributionScatter], [r'$Normal\ distribution$', r'$Comparison\ of\ datasets$'])
matplotlib.pyplot.show()

这会生成以下图：

经验数据的分位数图并不完全是the same as in the article，我不知道为什么，因为我不擅长统计。但是，从编程的角度来看，这就是你可以做这一切的方式。

【讨论】：

谢谢。但是，我不确定该方法是否正确。我认为我们必须重新调整时间，然后针对指数分布绘制 QQ 图。这可以解释为什么情节不一样
我真的希望你的代码在 GitHub 上。它已经在某个 GitHub 存储库中了吗？或者你能做一个吗？
@jeongmin.cha 嗨，对不起，有点忙，所以直到知道才读到这个。谢谢，我很高兴这很有用。事实上，我的 git 上已经有了所有这些。如果你愿意，我很乐意在这方面进行合作。这是链接：github.com/AleksanderLidtke/Hawkes-Process
@jeongmin.cha 我同意，那里可以做很多。回购中的代码正是我为这个答案写的。我以为我会在某个时候进一步发展它，但从来没有任何刺激，可能就是这样，谢谢:)
@jeongmin.cha 如果你想在 Python 中试验 Hawkes 进程，你也可以看看这个刚刚开源的 github repo github.com/X-DataInitiative/tick :) 免责声明：我是第一个贡献者之一