【问题标题】:Extract new feature from sentence column - Python从句子列中提取新特征 - Python
【发布时间】:2019-12-10 03:47:20
【问题描述】:

我有两个数据框:

city_state 数据框

    city        state
0   huntsville  alabama
1   montgomery  alabama
2   birmingham  alabama
3   mobile      alabama
4   dothan      alabama
5   chicago     illinois
6   boise       idaho
7   des moines  iowa

和句子数据框

    sentence
0   marthy was born in dothan
1   michelle reads some books at her home
2   hasan is highschool student in chicago
3   hartford of the west is the nickname of des moines

我想从名为 city 的句子数据框中提取新特征。该列city 是从sentence 中提取的,如果句子中包含来自列city_state['city'] 的某个名称city,如果它不包含某个名称city,则其值为Null。

预期的新数据框将是这样的:

    sentence                                        city
0   marthy was born in dothan                       dothan
1   michelle reads some books at her home           Null
2   hasan is highschool student in chicago          chicago
3   capital of dream is the motto of des moines     des moines

我已经运行了这段代码

sentence['city'] ={}

for city in city_state.city:
    for text in sentence.sentence:
        words = text.split()
        for word in words:
            if word == city:
                sentence['city'].append(city)
                break
    else:
        sentence['city'].append(None)

但是这段代码的结果是这样的

ValueError: Length of values does not match length of index

如果您有类似案例的特征工程经验,您能否给我一些建议,如何为预期结果编写正确的代码。

谢谢

注意: 这是错误的完整日志

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-205-8a9038a015ee> in <module>
----> 1 sentence['city'] ={}
      2 
      3 for city in city_state.city:
      4     for text in sentence.sentence:
      5         words = text.split()

~\Anaconda3\lib\site-packages\pandas\core\frame.py in __setitem__(self, key, value)
   3117         else:
   3118             # set column
-> 3119             self._set_item(key, value)
   3120 
   3121     def _setitem_slice(self, key, value):

~\Anaconda3\lib\site-packages\pandas\core\frame.py in _set_item(self, key, value)
   3192 
   3193         self._ensure_valid_index(value)
-> 3194         value = self._sanitize_column(key, value)
   3195         NDFrame._set_item(self, key, value)
   3196 

~\Anaconda3\lib\site-packages\pandas\core\frame.py in _sanitize_column(self, key, value, broadcast)
   3389 
   3390             # turn me into an ndarray
-> 3391             value = _sanitize_index(value, self.index, copy=False)
   3392             if not isinstance(value, (np.ndarray, Index)):
   3393                 if isinstance(value, list) and len(value) > 0:

~\Anaconda3\lib\site-packages\pandas\core\series.py in _sanitize_index(data, index, copy)
   3999 
   4000     if len(data) != len(index):
-> 4001         raise ValueError('Length of values does not match length of ' 'index')
   4002 
   4003     if isinstance(data, ABCIndexClass) and not copy:

ValueError: Length of values does not match length of index

【问题讨论】:

    标签: python pandas dataframe machine-learning feature-extraction


    【解决方案1】:

    一些快速而肮脏的应用,尚未在大型数据帧上测试过,因此请谨慎使用。 首先定义一个提取城市名称的函数:

    def ex_city(col, cities):
        output = []
        for w in cities:
            if w in col:
                output.append(w)
        return ','.join(output) if output else None
    

    然后将其应用于您的句子数据框

    city_list = city_state.city.unique().tolist()
    sentence['city'] = sentence['sentence'].apply(lambda x: ex_city(x, city_list))
    

    【讨论】:

      【解决方案2】:

      sdf = sentence dataframecdf=city_state dataframe

      des moines 在使用 str.split 时会出现问题,因为它的名称中有空格。

      首先(或最后一个,需要测试)获得该城市

      sdf.loc[sdf['sentence'].str.contains('des moines'), 'city'] = 'des moines'

      剩下的

      def get_city(sentence, cities):
          for word in sentence.split(' '):
              if sentence in cities:
                 return word
          return None
      
      cities = cdf['city'].tolist()
      sdf['city'] = sdf['sentence'].apply(lambda x: get_city(x, cities))
      

      【讨论】:

        【解决方案3】:

        这样的事情可能会奏效。我会自己尝试,但我在手机上。

        sentence_cities =[]
        cities = city_state.city
        
        for text in sentence.sentence:
            [sentence_cities.append(word) if word in cities else sentence_cities.append(None) for word in text.split()]
        
        sentence['city'] = sentence_cities
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 2021-02-17
          • 1970-01-01
          • 2017-09-12
          • 2020-03-20
          • 2017-05-28
          • 1970-01-01
          • 2021-06-23
          相关资源
          最近更新 更多