ValueError：使用 df.apply 时无法将输入数组从形状 (2) 广播到形状 (1)答案

【问题标题】：ValueError: could not broadcast input array from shape (2) into shape (1) when using df.applyValueError：使用 df.apply 时无法将输入数组从形状 (2) 广播到形状 (1)
【发布时间】：2018-08-30 04:39:45
【问题描述】：

我有一个代码贯穿系列中的每一行/项目，并将其转换为二元组/三元组。代码如下

def splitting(txt,gram=2):
    tx1 = txt.str.replace('[^\w\s]','').str.split().tolist()[0]
    if(len(tx1)==0):
        return np.nan
    txlis = [w for w in tx1 if w.lower() not in stop_wrds]
    if gram==2:
        return map(tuple,set(map(frozenset,list(nltk.bigrams(txlis)))))
    else:
        return map(tuple,set(map(frozenset,list(nltk.trigrams(txlis)))))

#pdb.set_trace()
print len(namedat)
prop_data = pd.DataFrame(namedat.apply(splitting,axis=1))

当我应用名为namedat 的系列数据时，错误出现在最后一行，看起来像这样：

0                                       inter-burgo ansan
1                                        dogo glory condo
2                                                 w hotel
3                                      onyang grand hotel
4                                 onyang hot spring hotel
5            onyang cheil hotel (ex. onyang palace hotel)
6                springhill suites paso robles atascadero
7                            best western plus colony inn
8                                                  hesse 
9                                 ibis styles aachen city
10                              pullman aachen quellenhof
11                             mercure aachen europaplatz
12                                  leonardo hotel aachen
13                                  aquis grana cityhotel
14                                            buschhausen
...                                                   ...
[166295 rows x 1 columns]

ValueError: 使用 df.apply 时无法将输入数组从形状 (2) 广播到形状 (1)

我试过调试，txt和bigrams都生成成功了，splitting这个函数似乎没有问题。我不知道如何解决这个问题。请帮忙

完整的错误信息：

Traceback (most recent call last):
  File "data_playground.py", line 163, in <module>
    main()
  File "data_playground.py", line 156, in main
    createparams(db.hotelbeds_properties,"hotelbeds")
  File "data_playground.py", line 139, in createparams
    prop_params = analyze(prop_subdf)
  File "data_playground.py", line 110, in analyze
    prop_data = pd.DataFrame(namedat.apply(splitting,axis=1))
  File "/home/shubhang/.virtualenvs/pa/local/lib/python2.7/site-packages/pandas/core/frame.py", line 4877, in apply
    ignore_failures=ignore_failures)
  File "/home/shubhang/.virtualenvs/pa/local/lib/python2.7/site-packages/pandas/core/frame.py", line 4990, in _apply_standard
    result = self._constructor(data=results, index=index)
  File "/home/shubhang/.virtualenvs/pa/local/lib/python2.7/site-packages/pandas/core/frame.py", line 330, in __init__
    mgr = self._init_dict(data, index, columns, dtype=dtype)
  File "/home/shubhang/.virtualenvs/pa/local/lib/python2.7/site-packages/pandas/core/frame.py", line 461, in _init_dict
    return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
  File "/home/shubhang/.virtualenvs/pa/local/lib/python2.7/site-packages/pandas/core/frame.py", line 6173, in _arrays_to_mgr
    return create_block_manager_from_arrays(arrays, arr_names, axes)
  File "/home/shubhang/.virtualenvs/pa/local/lib/python2.7/site-packages/pandas/core/internals.py", line 4642, in create_block_manager_from_arrays
    construction_error(len(arrays), arrays[0].shape, axes, e)
  File "/home/shubhang/.virtualenvs/pa/local/lib/python2.7/site-packages/pandas/core/internals.py", line 4604, in construction_error
    raise e
ValueError: could not broadcast input array from shape (2) into shape (1)

我的代码执行的示例：它从上面显示的表格中取出一行，例如：

name    shaba boutique hotel
Name: 166278, dtype: object

然后返回由它生成的二元组

[(u'shaba', u'boutique'), (u'boutique', u'hotel')]

如果我执行一个简单的 for 循环（使用 iterrows），该函数将起作用并且我得到一个列表。我不明白为什么 apply 函数会失败。

【问题讨论】：

请包含完整错误信息和最小示例。
嘿，谢谢@DyZ！我添加了完整的错误消息和代码的作用示例。

标签： python pandas

【解决方案1】：

此错误的原因是 df.apply(axis=1) 期望返回单个值来生成一系列，您可以阅读更多关于它的信息here。您的代码正在返回 map(tuple(...)) 的结果，对于任何包含两个以上单词的行，它的形状 > 1。你可以在一个小的假数据框上试试这个，看看它是否可以正常工作，如下所示，

namedat_s = pd.Series(['inter-burgo ansan', 'glory condo', 'w hotel'])
namedat = pd.DataFrame(namedat_s)

...但是把'dogo'放回去，你会再次得到错误。这是一个很好的例子，说明为什么单行长代码并不总是有用，尤其是在您刚开始的时候。

如果您尝试过这个，您可能会更快找到答案：

def splitting(txt,gram=2):
    tx1 = txt.str.replace('[^\w\s]','').str.split().tolist()[0]
    if(len(tx1)==0):
        return np.nan
    txlis = [w for w in tx1 if w.lower() not in stop_wrds]
    print 1, txlis
    print 2, find_ngrams(txlis,2)
    print 3, list(find_ngrams(txlis,2))
    print 4, map(frozenset,list(find_ngrams(txlis,2)))
    print 5, set(map(frozenset,list(find_ngrams(txlis,2))))
    print 6, map(tuple,set(map(frozenset,list(find_ngrams(txlis,2)))))
    print len(map(tuple,set(map(frozenset,list(find_ngrams(txlis,2))))))
    if gram==2:
        return map(tuple,set(map(frozenset,list(find_ngrams(txlis,2)))))
    else:
        return map(tuple,set(map(frozenset,list(find_ngrams(txlis,2)))))

正如您所说，您会看到错误发生在拆分函数中，而是在返回后发生的情况中，并且知道返回的内容将为您提供有关原因的重要线索。

【讨论】：