你没有提供很多细节,所以我会编造一些数据并做出假设,希望它有用。
从导入包和采样数据集开始:
import random
import networkx as nx
# human social networks tend to be "scale-free"
G = nx.generators.scale_free_graph(1000)
# set labels to either 0 or 1
for i, attr in G.nodes.data():
attr['label'] = 1 if random.random() < 0.2 else 0
接下来,计算随机节点对之间的最短路径:
results = []
# I had to use 100,000 pairs to get the CI small enough below
for _ in range(100000):
a, b = random.sample(list(G.nodes), 2)
try:
n = nx.algorithms.shortest_path_length(G, a, b)
except nx.NetworkXNoPath:
# no path between nodes found
n = -1
results.append((a, b, n))
最后,这里有一些代码总结并打印出来:
from collections import Counter
from scipy import stats
# somewhere to counts of both 0, both 1, different labels
c_0 = Counter()
c_1 = Counter()
c_d = Counter()
# accumulate distances into the above counters
node_data = {i: a['label'] for i, a in G.nodes.data()}
cc = { (0,0): c_0, (0,1): c_d, (1,0): c_d, (1,1): c_1 }
for a, b, n in results:
cc[node_data[a], node_data[b]][n] += 1
# code to display the results nicely
def show(c, title):
s = sum(c.values())
print(f'{title}, n={s}')
for k, n in sorted(c.items()):
# calculate some sort of CI over monte carlo error
lo, hi = stats.beta.ppf([0.025, 0.975], 1 + n, 1 + s - n)
print(f'{k:5}: {n:5} = {n/s:6.2%} [{lo:6.2%}, {hi:6.2%}]')
show(c_0, 'both 0')
show(c_1, 'both 1')
show(c_d, 'different')
上面打印出来:
both 0, n=63930
-1: 60806 = 95.11% [94.94%, 95.28%]
1: 107 = 0.17% [ 0.14%, 0.20%]
2: 753 = 1.18% [ 1.10%, 1.26%]
3: 1137 = 1.78% [ 1.68%, 1.88%]
4: 584 = 0.91% [ 0.84%, 0.99%]
5: 334 = 0.52% [ 0.47%, 0.58%]
6: 154 = 0.24% [ 0.21%, 0.28%]
7: 50 = 0.08% [ 0.06%, 0.10%]
8: 3 = 0.00% [ 0.00%, 0.01%]
9: 2 = 0.00% [ 0.00%, 0.01%]
both 1, n=3978
-1: 3837 = 96.46% [95.83%, 96.99%]
1: 6 = 0.15% [ 0.07%, 0.33%]
2: 34 = 0.85% [ 0.61%, 1.19%]
3: 34 = 0.85% [ 0.61%, 1.19%]
4: 31 = 0.78% [ 0.55%, 1.10%]
5: 30 = 0.75% [ 0.53%, 1.07%]
6: 6 = 0.15% [ 0.07%, 0.33%]
为了节省空间,我已经剪掉了标签不同的部分。方括号中的比例是蒙特卡洛误差的95% CI。使用上面的更多迭代可以减少此错误,同时显然会占用更多 CPU 时间。