在 pandas 中使用 np.where、langdetect答案

【问题标题】：Using np.where, langdetect in pandas在 pandas 中使用 np.where、langdetect
【发布时间】：2020-06-20 22:37:07
【问题描述】：

我想在数据框中添加一个新列，如果它是用英文编写的，它将粘贴另一列中的数据，如果不是英文，则使用 langdetect 库不粘贴任何内容。

df['lyrics_english'] = np.where(detect(df["lyrics"]) == 'en', df["lyrics"], '')

我希望，意思很清楚。但我有这样的错误。

  File "C:/Users/PycharmProjects/pythontask/example", line 128, in <module>
    df['lyrics_english'] = np.where(detect(df["lyrics"]) == 'en', df["lyrics"], '')
  File "C:\Users\AppData\Local\Programs\Python\Python38-32\lib\site-packages\langdetect\detector_factory.py", line 129, in detect
    detector.append(text)
  File "C:\Users\AppData\Local\Programs\Python\Python38-32\lib\site-packages\langdetect\detector.py", line 104, in append
    text = self.URL_RE.sub(' ', text)
TypeError: expected string or bytes-like object

如果我输入

df['lyrics_english'] = np.where(detect(df["lyrics"]) == 'en', 0, '')

再次出现与 AppData 关联的相同错误。我能做什么？

【问题讨论】：

detect函数的内容是什么？

标签： python pandas numpy dataframe sentiment-analysis

【解决方案1】：

您可以通过df.apply 应用函数，包括条件决定：

df['lyrics_english'] = df.apply(lambda x: x.lyrics if detect(x.lyrics)=='en' else '', axis=1)

或者您可以在构成lyrics 列的系列上使用map 命令：

df['lyrics_english'] = df['lyrics'].map(lambda x: x if detect(x)=='en' else '')

在我尝试过的一个不切实际的示例中，第二种方法似乎更快，并且没有使用原始 langdetect.detect 函数。

【讨论】：

谢谢大卫 Wierichs。但是我的问题仍然没有解决。与 lyb 文件有关的东西。
对不起，我有点错过了重点，对。 @YOLO 的回答在我看来是一条不错的路线。

【解决方案2】：

我猜可能是一些非字符串值比如nan，你可以试试：

df['lyrics_english'] = np.where(detect(df["lyrics"].fillna("")) == 'en', df["lyrics"], '')

如果这不起作用，那么您需要查看 df["lyrics"].unique() 并了解那里发生了什么。

【讨论】：

谢谢@YOLO。但是仍然有同样的错误。 df['lyrics'].unique() 包含大约 8000 首歌曲的歌词，我不知道这有什么问题，如果使用您的代码，我们使用 fillna。歌词是 NaN 或字符串․