如何从图中选择合适的节点样本大小答案

【问题标题】：How to select a good sample size of nodes from a graph如何从图中选择合适的节点样本大小
【发布时间】：2021-04-01 06:09:03
【问题描述】：

我有一个节点属性标记为 0 或 1 的网络。我想找出具有相同属性的节点之间的距离与具有不同属性的节点之间的距离有何不同。由于在计算上很难找到所有节点组合之间的距离，我想选择节点的样本大小。我将如何选择节点的样本大小？我正在研究 python 和 networkx

【问题讨论】：

我将开始采用 S={10,100,1000,10000} 的样本大小，对于所有 S，计算 N=100 次的均值和标准差。您可能会很好地了解估计的工作情况。另请查看networkx.org/documentation/stable/reference/algorithms/…
很高兴知道：您有多少个节点？每个标签的节点比例是多少？你期望多少聚类？
根据我的经验，您甚至可以在一段时间内（例如
@Sparky05 这意味着 10**12 成对比较，因此如果您希望在 shortest_path_length 对我来说需要几微秒（使用 ~1k 节点图），所以它至少比你建议的慢 ~1000 倍......
我不确定您是如何得出成对比较的数量的，但是在未加权的图中，您只能对每个节点运行广度优先搜索。这听起来像是无穷无尽的，但实际上这可以运行相当多的图形大小并在之后存储。有关复杂性讨论，请参见此处math.stackexchange.com/questions/58198/… - 即使我认为此讨论不是 OP 的重点

标签： python-3.x random networkx social-networking

【解决方案1】：

你没有提供很多细节，所以我会编造一些数据并做出假设，希望它有用。

从导入包和采样数据集开始：

import random
import networkx as nx

# human social networks tend to be "scale-free"
G = nx.generators.scale_free_graph(1000)

# set labels to either 0 or 1
for i, attr in G.nodes.data():
    attr['label'] = 1 if random.random() < 0.2 else 0

接下来，计算随机节点对之间的最短路径：

results = []

# I had to use 100,000 pairs to get the CI small enough below
for _ in range(100000):
    a, b = random.sample(list(G.nodes), 2)
    try:
        n = nx.algorithms.shortest_path_length(G, a, b)
    except nx.NetworkXNoPath:
        # no path between nodes found
        n = -1
    results.append((a, b, n))

最后，这里有一些代码总结并打印出来：

from collections import Counter
from scipy import stats

# somewhere to counts of both 0, both 1, different labels 
c_0 = Counter()
c_1 = Counter()
c_d = Counter()

# accumulate distances into the above counters
node_data = {i: a['label'] for i, a in G.nodes.data()}
cc = { (0,0): c_0, (0,1): c_d, (1,0): c_d, (1,1): c_1 }
for a, b, n in results:
    cc[node_data[a], node_data[b]][n] += 1

# code to display the results nicely
def show(c, title):
    s = sum(c.values())
    print(f'{title},  n={s}')
    for k, n in sorted(c.items()):
        # calculate some sort of CI over monte carlo error
        lo, hi = stats.beta.ppf([0.025, 0.975], 1 + n, 1 + s - n)
        print(f'{k:5}: {n:5} = {n/s:6.2%} [{lo:6.2%}, {hi:6.2%}]')

show(c_0, 'both 0')
show(c_1, 'both 1')
show(c_d, 'different')

上面打印出来：

both 0,  n=63930
   -1: 60806 = 95.11% [94.94%, 95.28%]
    1:   107 =  0.17% [ 0.14%,  0.20%]
    2:   753 =  1.18% [ 1.10%,  1.26%]
    3:  1137 =  1.78% [ 1.68%,  1.88%]
    4:   584 =  0.91% [ 0.84%,  0.99%]
    5:   334 =  0.52% [ 0.47%,  0.58%]
    6:   154 =  0.24% [ 0.21%,  0.28%]
    7:    50 =  0.08% [ 0.06%,  0.10%]
    8:     3 =  0.00% [ 0.00%,  0.01%]
    9:     2 =  0.00% [ 0.00%,  0.01%]

both 1,  n=3978
   -1:  3837 = 96.46% [95.83%, 96.99%]
    1:     6 =  0.15% [ 0.07%,  0.33%]
    2:    34 =  0.85% [ 0.61%,  1.19%]
    3:    34 =  0.85% [ 0.61%,  1.19%]
    4:    31 =  0.78% [ 0.55%,  1.10%]
    5:    30 =  0.75% [ 0.53%,  1.07%]
    6:     6 =  0.15% [ 0.07%,  0.33%]

为了节省空间，我已经剪掉了标签不同的部分。方括号中的比例是蒙特卡洛误差的95% CI。使用上面的更多迭代可以减少此错误，同时显然会占用更多 CPU 时间。

【讨论】：

【解决方案2】：

这或多或少是我与 Sam Mason 讨论的延伸，只想给你一些时间数字，因为正如讨论的那样，检索所有距离可能是可行的，甚至可能更快。根据 Sam Mason 回答中的代码，我测试了这两种变体，并且检索所有距离对于 1000 个节点来说比采样 100 000 对要快得多。主要优点是使用了所有“检索距离”。

import random
import networkx as nx

import time


# human social networks tend to be "scale-free"
G = nx.generators.scale_free_graph(1000)

# set labels to either 0 or 1
for i, attr in G.nodes.data():
    attr['label'] = 1 if random.random() < 0.2 else 0

def timing(f):
    def wrap(*args, **kwargs):
        time1 = time.time()
        ret = f(*args, **kwargs)
        time2 = time.time()
        print('{:s} function took {:.3f} ms'.format(f.__name__, (time2-time1)*1000.0))

        return ret
    return wrap

@timing
def get_sample_distance():
    results = []
    # I had to use 100,000 pairs to get the CI small enough below
    for _ in range(100000):
        a, b = random.sample(list(G.nodes), 2)
        try:
            n = nx.algorithms.shortest_path_length(G, a, b)
        except nx.NetworkXNoPath:
            # no path between nodes found
            n = -1
        results.append((a, b, n))

@timing
def get_all_distances():
    all_distances = nx.shortest_path_length(G)

get_sample_distance()
# get_sample_distance function took 2338.038 ms

get_all_distances()
# get_all_distances function took 304.247 ms
``

【讨论】：