如何提高脚本的性能？答案

【问题标题】：How can I improve the performance of my script?如何提高脚本的性能？
【发布时间】：2020-06-05 20:36:11
【问题描述】：

我有一个“种子”GeoDataFrame (GDF)(RED)，其中包含一个 0.5 弧分的全球网格 ((180*2)*(360*2) = 259200)。每个单元格都包含一个绝对总体估计值。此外，我有一个“水蛭”GDF（绿色），大约有 8250 个相邻的各种大小的不规则形状（分水岭）。

我编写了一个脚本，根据网格单元（种子 GDF）和水蛭 GDF 中的几何形状之间的重叠区域，将总体估计分配给水蛭 GDF 中的几何形状。该脚本非常适合我的示例数据（见下文）。但是，一旦我在我的实际数据上运行它，它就会非常慢。我运行了一夜，第二天早上只执行了 27% 的计算。我将不得不多次运行这个脚本并且每次等待两天，这根本不是一个选择。

在做了一些文献研究之后，我已经用 for index i in df.iterrows() 替换了 (?) for 循环（或者这与“常规”python for 循环相同），但它并没有带来我希望的性能提升为。

任何建议儿子我可以如何加快我的代码？在 12 小时内，我的脚本只处理了 ~200000 行中的 ~30000 行。

我的预期输出是 leech_df['leeched_values'] 列。

import geopandas as gpd
import time
from datetime import datetime
from shapely.geometry import Polygon

# =============================================================================
# Geometries for testing
# =============================================================================

polys1 = gpd.GeoSeries([Polygon([(0.00,0.00), (0.00,0.25), (0.25,0.25), (0.25,0.00)]),
                        Polygon([(0.00,0.25), (0.00,0.50), (0.25,0.50), (0.25,0.25)]),
                        Polygon([(0.00,0.50), (0.00,0.75), (0.25,0.75), (0.25,0.50)]),
                        Polygon([(0.25,0.00), (0.25,0.25), (0.50,0.25), (0.50,0.00)]),
                        Polygon([(0.25,0.25), (0.25,0.50), (0.50,0.50), (0.50,0.25)]),
                        Polygon([(0.25,0.50), (0.25,0.75), (0.50,0.75), (0.50,0.50)]),
                        Polygon([(0.50,0.00), (0.50,0.25), (0.75,0.25), (0.75,0.00)]),
                        Polygon([(0.50,0.25), (0.50,0.50), (0.75,0.50), (0.75,0.25)]),
                        Polygon([(0.50,0.50), (0.50,0.75), (0.75,0.75), (0.75,0.50)]),
                        ])

polys2 = gpd.GeoSeries([Polygon([(0.125,0.125), (0.125,0.375), (0.375,0.375), (0.375,0.125)]),
                        Polygon([(0.050,0.550), (0.050,0.700), (0.200,0.700), (0.200,0.550)]),
                        Polygon([(0.25,0.625), (0.25,0.375), (0.750,0.375), (0.750,0.625)]),
                        ])

seed_df = gpd.GeoDataFrame({'geometry': polys1, 'seed_value':[10,10,10,10,10,10,10,10,10]})
leech_df = gpd.GeoDataFrame({'geometry': polys2})

del polys1, polys2

seed_value = 'seed_value'

# =============================================================================
# Prepare DataFrames
# =============================================================================

start = time.time()

print('\n\nPrepare DataFrames ... ')

# Create a new index for the seed DF
# The old index will be copied into the seed DF as a new column
# and transferred into the merged DF
seed_df = seed_df.reset_index()
seed_df = seed_df.rename(columns={'index': 'seed_index'})

# Create a new index for the seed DF
# The old index will be copied into the seed DF as a new column
# and transferred into the merged DF
leech_df = leech_df.reset_index()
leech_df = leech_df.rename(columns={'index': 'leech_index'})

end = time.time()
print(end - start)

# Add the area to the geometries

start = time.time()

print('Calculating the area of the leech DF geometries ...')
leech_df['leech_area'] = leech_df['geometry'].area

print('Calculating the area of the seed DF geometries ...')
seed_df['seed_area'] = seed_df['geometry'].area

leech_df['leeched_value'] = 0.0

end = time.time()
print(end - start)

# =============================================================================
# Merge seed and leech data and count overlaps
# =============================================================================

start = time.time()

print('Merging DataFrames ... ')

merged_df = gpd.overlay(leech_df, seed_df, how='union')
# Drop NaNs
merged_df = merged_df.dropna(axis='rows')

# =============================================================================
# Allocate seed values
# =============================================================================

# Count with how many leech geometries each seed geometry overlaps

print('Count overlaps ... ')

overlaps = merged_df['seed_index'].value_counts()

neglected_values = list()
one_overlaps_values = list()
more_overlaps_values = list()

no_overlaps = list()
one_overlaps = list()
more_overlaps = list()

end = time.time()
print(end - start)

start = time.time()

print('Allocate seed values ... ')

i = 0

for index, row in seed_df.iterrows(): # For each row in seed_df MAX 70123

    if row['seed_index'] % 1000 == 0:
        print(str(row['seed_index'])+'k  at  '+str(datetime.now().strftime('%Y-%m-%d %H:%M:%S')))

    # If seed geometry does not have an overlap
    # Get the value with the seed_index == 0
    # Otherwise return 0
    # So whenever, the seedindex does not turn up in overlaps return zero

    if overlaps.get(row['seed_index'],0) == 0: # If seed geometry does not have an overlap
        #print('Grid cell ' + str(i) + ' - ' + str(row['seed_index']) + ': No overlap')
        no_overlaps.append(1) # Count the values which have not been considered
        neglected_values.append(row[seed_value]) # Count the values which have not been considered
        i = i + 1
    elif overlaps.get(row['seed_index'],0) == 1:
        #print('Grid cell ' + str(i) + ' - ' + str(row['seed_index']) + ': One overlap')
        one_overlaps.append(1)
        # What is for row the leech index (with which leech geometry does an overlap exist)?
        temp_int = int(merged_df[merged_df['seed_index'] == row['seed_index']]['leech_index'])
        # For this leech index replace leeched_value with leeched_value + leeched_value
        seed_value_amount = int(seed_df[seed_df['seed_index'] == row['seed_index']][seed_value])
        leech_df.loc[temp_int,'leeched_value'] = leech_df.loc[temp_int,'leeched_value'] + seed_value_amount
        one_overlaps_values.append(row[seed_value]) # Count the values which have not been considered
        i = i + 1
    elif overlaps.get(row['seed_index'],0) > 1:
        #print('Grid cell ' + str(i) + ' - ' + str(row['seed_index']) + ': More than one overlap')
        more_overlaps.append(1)
        # Create a DF which contains the overlaps of the seed geometries with geoemtries of the leech df
        temp_df = merged_df[merged_df['seed_index'] == row['seed_index']]
        # Calculate the overlaying area
        temp_df['overlay_area'] = temp_df['geometry'].area
        # And comapre this to the total overlaying area of this grid cell
        temp_df['rel_contribution'] = 0.0
        temp_df['rel_contribution'] = temp_df['overlay_area']/sum(temp_df.area)
        # Calcualte the absolute distribution of the seed value to each leech row ('leech index')
        temp_df['abs_contribution'] = temp_df[seed_value]*temp_df['rel_contribution']
        # Reset index
        temp_df = temp_df.reset_index(drop=True)
        more_overlaps_values.append(row[seed_value]) # Count the values which have not been considered
        i = i + 1
        #For each overlap between grid cell (seed) and leech geometry:
        for j in range(len(leech_df)):
            #print('==   LEECH ' + str(j) + '.' +str(i))
            # Create a DF which only contains the temp_df row for the specific leech geometry
            contribution = temp_df[temp_df['leech_index'] == j]
            # Save the contribution to the leech_df
            #leech_df['leeched_value'][j] = leech_df.loc[:,('leeched_value')][j] + sum(contribution['abs_contribution'])
            leech_df.loc[j,'leeched_value'] = leech_df.loc[:,('leeched_value')][j] + sum(contribution['abs_contribution'])

end = time.time()
print(end - start)
print('Done')

# =============================================================================
# Data check
# =============================================================================

print('>1 Overlaps: ' + str(sum(more_overlaps)) + ' (sum neglected values ' + str(sum(more_overlaps_values)) + ')' )
print('=1 Overlaps: ' + str(sum(one_overlaps)) + ' (sum neglected values ' + str(sum(one_overlaps_values)) + ')' )
print('No Overlaps: ' + str(sum(no_overlaps)) + ' (sum neglected values ' + str(sum(neglected_values)) + ')' )

print('Unconsidered: ' + str(sum(seed_df[seed_value]) - (sum(more_overlaps_values)+sum(one_overlaps_values)+sum(neglected_values))))


# =============================================================================
# Plot
# =============================================================================

# Plot
base = seed_df.plot(color='red', edgecolor='black');
leech_df.plot(ax=base, color='green', alpha=0.5);

编辑：pip freeze > requirements_output.txt 返回：

affine==2.3.0
alabaster==0.7.12
appdirs==1.4.3
argh==0.26.2
asn1crypto==1.3.0
astroid==2.3.3
atomicwrites==1.3.0
attrs==19.3.0
autopep8==1.4.4
Babel==2.8.0
backcall==0.1.0
bcrypt==3.1.7
bleach==3.1.0
cachetools==4.0.0
Cartopy==0.17.0
certifi==2019.11.28
cffi==1.13.2
cftime==1.0.4.2
chardet==3.0.4
Click==7.0
click-plugins==1.1.1
cligj==0.5.0
cloudpickle==1.2.2
colorama==0.4.3
country-converter==0.6.6
cryptography==2.8
cycler==0.10.0
dask==2.10.0
datacube==1.7
decorator==4.4.1
defusedxml==0.6.0
Deprecated==1.2.7
descartes==1.1.0
diff-match-patch==20181111
docutils==0.16
entrypoints==0.3
Fiona==1.8.13
flake8==3.7.9
future==0.18.2
GDAL==3.0.2
geocube==0.0.10
geopandas==0.6.2
h5py==2.9.0
idna==2.8
imageio==2.6.1
imagesize==1.2.0
importlib-metadata==1.4.0
intervaltree==3.0.2
ipykernel==5.1.4
ipython==7.11.1
ipython-genutils==0.2.0
isort==4.3.21
jedi==0.14.1
Jinja2==2.10.3
joblib==0.14.1
jsonschema==3.2.0
jupyter-client==5.3.4
jupyter-core==4.6.1
keyring==21.1.0
kiwisolver==1.1.0
lark-parser==0.8.1
lazy-object-proxy==1.4.3
lxml==4.4.2
mapclassify==2.2.0
MarkupSafe==1.1.1
matplotlib==3.1.1
mccabe==0.6.1
mistune==0.8.4
mkl-fft==1.0.15
mkl-random==1.1.0
mkl-service==2.3.0
more-itertools==8.0.2
munch==2.5.0
nbconvert==5.6.1
nbformat==5.0.4
netCDF4==1.5.3
notebook==6.0.0
numpy==1.16.4
numpydoc==0.9.2
olefile==0.46
OWSLib==0.19.1
packaging==20.1
pandas==0.25.0
pandocfilters==1.4.2
paramiko==2.6.0
parso==0.6.0
pathtools==0.1.2
pexpect==4.8.0
pickleshare==0.7.5
Pillow==7.0.0
pluggy==0.13.1
prometheus-client==0.7.1
prompt-toolkit==3.0.3
psutil==5.6.7
psycopg2==2.8.4
pycodestyle==2.5.0
pycparser==2.19
pydocstyle==4.0.1
pyepsg==0.4.0
pyflakes==2.1.1
Pygments==2.5.2
pykdtree==1.3.1
pylint==2.4.4
pymatreader==0.0.20
Pympler==0.7
pymrio==0.4.0
PyNaCl==1.3.0
pyOpenSSL==19.1.0
pyparsing==2.4.6
pyPEG2==2.15.2
pyproj==2.4.2.post1
PyQt5-sip==4.19.19
pyrsistent==0.15.7
pyshp==2.1.0
PySocks==1.7.1
python-dateutil==2.8.1
python-jsonrpc-server==0.3.4
python-language-server==0.31.7
pytz==2019.3
pywin32==227
pywin32-ctypes==0.2.0
pywinpty==0.5.7
PyYAML==5.2
pyzmq==18.1.0
QDarkStyle==2.8
QtAwesome==0.6.1
qtconsole==4.6.0
QtPy==1.9.0
rasterio==1.1.2
requests==2.22.0
rioxarray==0.0.19
rope==0.16.0
Rtree==0.9.3
scikit-learn==0.22.1
scipy==1.3.2
Send2Trash==1.5.0
Shapely==1.7.0
singledispatch==3.4.0.3
six==1.14.0
snowballstemmer==2.0.0
snuggs==1.4.7
sortedcontainers==2.1.0
Sphinx==2.3.1
sphinxcontrib-applehelp==1.0.1
sphinxcontrib-devhelp==1.0.1
sphinxcontrib-htmlhelp==1.0.2
sphinxcontrib-jsmath==1.0.1
sphinxcontrib-qthelp==1.0.2
sphinxcontrib-serializinghtml==1.1.3
spyder==4.0.0
spyder-kernels==1.8.1
spyder-notebook==0.1.4
SQLAlchemy==1.3.13
terminado==0.8.3
testpath==0.4.4
toolz==0.10.0
tornado==6.0.3
tqdm==4.43.0
traitlets==4.3.3
ujson==1.35
urllib3==1.25.8
watchdog==0.9.0
wcwidth==0.1.7
webencodings==0.5.1
win-inet-pton==1.1.0
wincertstore==0.2
wrapt==1.11.2
xarray==0.14.1
xlrd==1.2.0
xmltodict==0.12.0
yapf==0.28.0
zipp==0.6.0

【问题讨论】：

我想，我也可以立即从 merge_df 中删除 nan，而不是在每次循环迭代中一遍又一遍地这样做。
你看过使用df.apply吗？顺便说一句，我尝试运行您的代码。我已经安装了geopandas，但是如果我运行代码，错误提示我需要安装一个名为rtree 的包。试过pip install rtree，还是不行。
如果您可以运行pip freeze > requirements_output.txt 并粘贴requirements_output.txt 的内容，这可能对那些想回答这个问题的人有所帮助。
我已经用 GeoPandas 和 sjoin 添加了一个完整的例子。你能回顾一下，看看它是否符合你的需要。如果没有，您能否完整描述您的预期输出。干杯
因此，如果来自水蛭的两个多边形（例如区域 1 和 2 平方米）重叠 10 平方米的种子。种子值将以 1/3 和 2/3 分配到水蛭中相应的多边形，还是 1/10 和 2/10？如果水蛭多边形相交会发生什么

标签： python pandas performance geopandas

【解决方案1】：

简介

可能值得详细分析您的代码，以准确了解您的瓶颈所在。

以下一些建议可以提高您的脚本性能：

避免list.append(1) 计算出现次数，改用collection.Counter；
避免使用pandas.DataFrame.iterrows，改用pandas.DataFrame.itertuples；
避免不必要的额外分配，改用pandas.DataFrame.fillna：

例如。这一行：

temp_df['rel_contribution'] = 0.0
temp_df['rel_contribution'] = temp_df['overlay_area']/sum(temp_df.area)

可能的瓶颈

布尔索引是一项强大的功能，但将其嵌套在循环中可能是算法中最大的性能问题。你的瓶颈可能来自那里。

例如。此行在循环中执行布尔索引：

temp_df = merged_df[merged_df['seed_index'] == row['seed_index']]

利用索引，因为它通常会大大降低算法的时间复杂度（如评论中建议的rtree，选择最佳索引需要经验）；
使用 Pandas 时，请尽量避免循环：可能有一个现有的“优化”快捷方式来执行聚合，而无需借助显式循环。它将简化维护并使您的代码更具可读性。

寻求解决方案

另请注意，GeoPandas 有一个空间连接 geopandas.sjoin 方法（依赖于 rtree），它可以连接空间操作中的行（默认为交集）。

我觉得你可以通过左相交连接和 group-by 来解决你的问题。然后你可以对不同的行和聚合应用不同的逻辑。

举例来说，假设我们要在seed_df 中找到所有多边形的交叉点计数和覆盖区域，并根据交叉区域的比率分配seed_value。我们可以这样实现：

# Merge datafarmes on geometry intersection
# keep geometries in seed_df which does not intersects at all:
df = gpd.sjoin(seed_df, leech_df, how='left').reset_index()
df = df.rename(columns={'index': 'seed_id', 'geometry': 'seed_geom', 'index_right': 'leech_id'})

# Add leech_df geometry to merged dataframe:
df = df.merge(leech_df, left_on='leech_id', right_index=True, how='left')
df = df.rename(columns={'geometry': 'leech_geom'})

# Create a function computing intersection area (fault tolerant)
def func(x):
    if x['leech_geom']:
        return x['seed_geom'].intersection(x['leech_geom']).area

# Apply function:
df['intersection'] = df.apply(func, axis=1)

结果看起来像 (df.tails(4))：

    seed_id                                          seed_geom  seed_value  \
8         5  POLYGON ((0.25000 0.50000, 0.25000 0.75000, 0....          10   
9         6  POLYGON ((0.50000 0.00000, 0.50000 0.25000, 0....          10   
10        7  POLYGON ((0.50000 0.25000, 0.50000 0.50000, 0....          10   
11        8  POLYGON ((0.50000 0.50000, 0.50000 0.75000, 0....          10   

    leech_id                                         leech_geom  intersection  
8        2.0  POLYGON ((0.25000 0.62500, 0.25000 0.37500, 0....       0.03125  
9        NaN                                               None           NaN  
10       2.0  POLYGON ((0.25000 0.62500, 0.25000 0.37500, 0....       0.03125  
11       2.0  POLYGON ((0.25000 0.62500, 0.25000 0.37500, 0....       0.03125

此时，我们分组并聚合：

# Group by and aggregate:
agg = df.groupby('seed_id')['int_area'].agg(['count', 'sum']).reset_index()
agg = agg.rename(columns={'count': 'int_count', 'sum': 'int_sum'})

它导致：

   seed_id  int_count   int_sum
0        0          1  0.015625
1        1          2  0.015625
2        2          2  0.022500
3        3          1  0.015625
4        4          2  0.046875
5        5          1  0.031250
6        6          0  0.000000
7        7          1  0.031250
8        8          1  0.031250

然后，我们将聚合与原始数据框合并并执行最终计算：

final = df.merge(agg)
final['leech_value'] = final['seed_value']*final['int_area']/final['int_sum']

最终结果是：

    seed_id  seed_value  leech_id  int_area  int_count   int_sum  leech_value
0         0          10       0.0  0.015625          1  0.015625    10.000000
1         1          10       0.0  0.015625          2  0.015625    10.000000
2         1          10       2.0  0.000000          2  0.015625     0.000000
3         2          10       1.0  0.022500          2  0.022500    10.000000
4         2          10       2.0  0.000000          2  0.022500     0.000000
5         3          10       0.0  0.015625          1  0.015625    10.000000
6         4          10       0.0  0.015625          2  0.046875     3.333333
7         4          10       2.0  0.031250          2  0.046875     6.666667
8         5          10       2.0  0.031250          1  0.031250    10.000000
9         6          10       NaN       NaN          0  0.000000          NaN
10        7          10       2.0  0.031250          1  0.031250    10.000000
11        8          10       2.0  0.031250          1  0.031250    10.000000

根据重叠交集的比例将seed_value分配给leech_df中的多边形。

如果您想知道leech_value 是如何分布到水蛭多边形中的，请再次聚合：

final.groupby('leech_id')['leech_value'].sum()

leech_id
0.0    33.333333
1.0    10.000000
2.0    36.666667

图形化：

注意

要使sjoin 工作，您需要安装rtree，首先您需要安装libspatialindex-dev 库。在 Debian 上，它归结为：

apt-get update && apt-get install libspatialindex-dev
python3 -m pip install --upgrade geopandas

【讨论】：