从句子列中提取新特征 - Python答案

【问题标题】：Extract new feature from sentence column - Python从句子列中提取新特征 - Python
【发布时间】：2019-12-10 03:47:20
【问题描述】：

我有两个数据框：

city_state 数据框

    city        state
0   huntsville  alabama
1   montgomery  alabama
2   birmingham  alabama
3   mobile      alabama
4   dothan      alabama
5   chicago     illinois
6   boise       idaho
7   des moines  iowa

和句子数据框

    sentence
0   marthy was born in dothan
1   michelle reads some books at her home
2   hasan is highschool student in chicago
3   hartford of the west is the nickname of des moines

我想从名为 city 的句子数据框中提取新特征。该列city 是从sentence 中提取的，如果句子中包含来自列city_state['city'] 的某个名称city，如果它不包含某个名称city，则其值为Null。

预期的新数据框将是这样的：

    sentence                                        city
0   marthy was born in dothan                       dothan
1   michelle reads some books at her home           Null
2   hasan is highschool student in chicago          chicago
3   capital of dream is the motto of des moines     des moines

我已经运行了这段代码

sentence['city'] ={}

for city in city_state.city:
    for text in sentence.sentence:
        words = text.split()
        for word in words:
            if word == city:
                sentence['city'].append(city)
                break
    else:
        sentence['city'].append(None)

但是这段代码的结果是这样的

ValueError: Length of values does not match length of index

如果您有类似案例的特征工程经验，您能否给我一些建议，如何为预期结果编写正确的代码。

谢谢

注意：这是错误的完整日志

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-205-8a9038a015ee> in <module>
----> 1 sentence['city'] ={}
      2 
      3 for city in city_state.city:
      4     for text in sentence.sentence:
      5         words = text.split()

~\Anaconda3\lib\site-packages\pandas\core\frame.py in __setitem__(self, key, value)
   3117         else:
   3118             # set column
-> 3119             self._set_item(key, value)
   3120 
   3121     def _setitem_slice(self, key, value):

~\Anaconda3\lib\site-packages\pandas\core\frame.py in _set_item(self, key, value)
   3192 
   3193         self._ensure_valid_index(value)
-> 3194         value = self._sanitize_column(key, value)
   3195         NDFrame._set_item(self, key, value)
   3196 

~\Anaconda3\lib\site-packages\pandas\core\frame.py in _sanitize_column(self, key, value, broadcast)
   3389 
   3390             # turn me into an ndarray
-> 3391             value = _sanitize_index(value, self.index, copy=False)
   3392             if not isinstance(value, (np.ndarray, Index)):
   3393                 if isinstance(value, list) and len(value) > 0:

~\Anaconda3\lib\site-packages\pandas\core\series.py in _sanitize_index(data, index, copy)
   3999 
   4000     if len(data) != len(index):
-> 4001         raise ValueError('Length of values does not match length of ' 'index')
   4002 
   4003     if isinstance(data, ABCIndexClass) and not copy:

ValueError: Length of values does not match length of index

【问题讨论】：

标签： python pandas dataframe machine-learning feature-extraction

【解决方案1】：

一些快速而肮脏的应用，尚未在大型数据帧上测试过，因此请谨慎使用。首先定义一个提取城市名称的函数：

def ex_city(col, cities):
    output = []
    for w in cities:
        if w in col:
            output.append(w)
    return ','.join(output) if output else None

然后将其应用于您的句子数据框

city_list = city_state.city.unique().tolist()
sentence['city'] = sentence['sentence'].apply(lambda x: ex_city(x, city_list))

【讨论】：

【解决方案2】：

让sdf = sentence dataframe 和cdf=city_state dataframe

des moines 在使用 str.split 时会出现问题，因为它的名称中有空格。

首先（或最后一个，需要测试）获得该城市

sdf.loc[sdf['sentence'].str.contains('des moines'), 'city'] = 'des moines'

剩下的

def get_city(sentence, cities):
    for word in sentence.split(' '):
        if sentence in cities:
           return word
    return None

cities = cdf['city'].tolist()
sdf['city'] = sdf['sentence'].apply(lambda x: get_city(x, cities))

【讨论】：

【解决方案3】：

这样的事情可能会奏效。我会自己尝试，但我在手机上。

sentence_cities =[]
cities = city_state.city

for text in sentence.sentence:
    [sentence_cities.append(word) if word in cities else sentence_cities.append(None) for word in text.split()]

sentence['city'] = sentence_cities

【讨论】：