在熊猫数据框列中查找特定文本答案

【问题标题】：Finding specific text in pandas dataframe column在熊猫数据框列中查找特定文本
【发布时间】：2020-05-13 11:35:09
【问题描述】：

我有一个包含论文引用的列的数据框，我想查找所有引用在整个列中重复的任何引用。

以下是数据框中的一些行：

In [1]:

df4.iloc[0:2]

Out[2]:

 **cit2ref**    **reference**                                                                                                    **_id**
0   NaN     All about depression: Diagnosis. (2013). Retrieved December 7, 2016,from All About Depression,
            http://www.allaboutdepression.com/dia_03.html                                                                   Y17-1020
0   NaN     American Psychological Association. (2016). Center for epidemiological studies depression (CESD). 
            Retrieved December 7, 2016, from American Psychological Association, 
            http://www.apa.org/pi/ about/publications/caregivers/practice-settings/ assessment/tools/depression-scale.aspx  Y17-1020

更多行：

 **cit2ref** **reference**                                                                                                                                 **_id**

0   NaN     All about depression: Diagnosis. (2013). Retrieved December 7, 2016, from All About Depression, http://www.allaboutdepression.com/dia_03.html   Y17-1020
0   NaN     American Psychological Association. (2016). Center for epidemiological studies depression (CESD). Retrieved December 7, 2016, from American Psychological Association, http://www.apa.org/pi/ about/publications/caregivers/practice-settings/ assessment/tools/depression-scale.aspx   Y17-1020
0   NaN     American Psychological Association. (2016). Patient health questionnaire (PHQ-9 %27 PHQ-2). Retrieved December 09, 2016, from http://www.apa.org/pi/ about/publications/caregivers/practice-settings/ assessment/tools/patient-health.aspx  Y17-1020
0   NaN     Beattie, G.S. (2005, November). Social Causes of Depression. Retrieved May 31, 2017, from http:// www.personalityresearch.org/papers/beattie.html   Y17-1020
0   Burton (2012)   Burton, N. (2012, June 5). Depressive Realism. Retrieved May 31, 2017, from https:// www.psychologytoday.com/blog/hide-and-seek/ 201206/depressive-realism  Y17-1020
0   NaN     Clark, P., Niblett, T. (1988, October 25). The CN2 induction Algorithm. Retrieved May 10, 2017, from https://pdfs.semanticscholar.org/766f/ e3586bda3f36cbcce809f5666d2c2b96c98c.pdf    Y17-1020
0   Choudhury, 2014     De Choudhury, M., Counts, S., Horvits, E., %27 Hoff, A. (2014). Characterizing and Predicting Postpartum Depression from Shared Facebook Data.  Y17-1020
0   NaN     De Choudhury, M., Gamon, M., Couns, S., %27 Horvitz, E. (2013). Predicting Depression via Social Media.     Y17-1020
0   Gotlib and Joormann (2010)  Gotlib IH, Kasch KL, Traill S, Joormann J, Arnow BA, Johnson SL. (2010) Coherence and specificity of information-processing biases in depression and social phobia. J Abnorm Psychol. 2004;113(3): 386-98.  Y17-1020
0   NaN     Gotlib, I. H., %27 Hammen, C. L. (1992). Psychological aspects of depression: Toward a cognitive- interpersonal integration. New York: Wiley.   Y17-1020
0   NaN     Gotlib IH, Joormann J. Cognition and depression: current status and future directions. Annu Rev Clin Psychol. 2010;6:285-312.   Y17-1020
0   NaN     Hu, Quan, Ang Li, Fei Heng, Jianpeng Li, and Tingshao Zhu. "Predicting Depression of Social Media User on Different Observation Windows." 2015 IEEE/ WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI- IAT) (2015): n. pag. Web.   Y17-102

这里的“0”是第一篇论文的索引，它有很多参考文献，有 40k 篇论文，每篇论文大约有 20 篇参考文献。

寻找在其他论文中再次使用的任何参考（这里每篇论文的索引不同）及其索引和重复次数。

尝试了正则表达式和熊猫的排序方法

value_counts(sort=True).sort_index()

和

sort_values()

但这无济于事。

Here is the screenshot of the dataframe with 2 papers as indexed '0' and '1'

【问题讨论】：

您能解释一下您的引用是什么意思吗？是美国心理学会。（2016 年）。参考？ Beattie, G.S.（2005 年 11 月）。 ?您想要实现的目标的示例会有所帮助。
@sammywemmy 'reference' 列值（即整个文本直到 '_id' 列值）是研究论文的参考。通过水平滚动查看整行。
@Chris 添加了更多索引数据帧的图像，但不知道如何在代码/数据帧中编写预期输出，但突出显示了我对问题的期望。 cit2ref 有许多 NaN 值，因为它是相同的参考论文，其中值未知，无法删除它们，因为它有助于将参考文献与实际论文对齐。
您可以在编辑完问题后回复此评论，我再看一下。您可以阅读minimal reproducible example 或this link 也可能有用。这些旨在指导您撰写更好的问题。

标签： python regex pandas dataframe

【解决方案1】：

IIUC，使用pandas.DataFrame.index.groupby。

使用伪数据框，df：（请注意，我添加了最后三行用于演示）：

print(df)
   cit2ref                                          reference       _id
0      NaN  All about depression: Diagnosis. (2013). Retri...  Y17-1020
0      NaN  American Psychological Association. (2016). Ce...  Y17-1020
0      NaN  American Psychological Association. (2016). Pa...  Y17-1020
0      NaN  Beattie, G.S. (2005, November). Social Causes ...  Y17-1020
0      NaN  Burton   (2012)   Burton, N. (2012, June 5). D...  Y17-1020
0      NaN  Clark, P., Niblett, T. (1988, October 25). The...  Y17-1020
0      NaN  Choudhury, 2014     De Choudhury, M., Counts, ...  Y17-1020
0      NaN  De Choudhury, M., Gamon, M., Couns, S., %27 Ho...  Y17-1020
0      NaN  Gotlib and Joormann (2010)  Gotlib IH, Kasch K...  Y17-1020
0      NaN  Gotlib, I. H., %27 Hammen, C. L. (1992). Psych...  Y17-1020
0      NaN  Gotlib IH, Joormann J. Cognition and depressio...  Y17-1020
0      NaN  Hu, Quan, Ang Li, Fei Heng, Jianpeng Li, and T...   Y17-102
1      NaN  All about depression: Diagnosis. (2013). Retri...  Y17-1020
1      NaN  American Psychological Association. (2016). Ce...  Y17-1020
1      NaN                StackOverflow. Not to be grouped-by   Y17-102

然后groupby:

df.index.groupby(df['reference'])
# or
d = {k: list(v) for k, v in df.index.groupby(df['reference']).items()}
new_df = pd.DataFrame.from_dict(d, orient='index').reset_index()
print(new_df)
# this looks prettier

                                                index       0
0   All about depression: Diagnosis. (2013). Retri...  [0, 1]
1   American Psychological Association. (2016). Ce...  [0, 1]
2   American Psychological Association. (2016). Pa...     [0]
3   Beattie, G.S. (2005, November). Social Causes ...     [0]
4   Burton   (2012)   Burton, N. (2012, June 5). D...     [0]
5   Choudhury, 2014     De Choudhury, M., Counts, ...     [0]
6   Clark, P., Niblett, T. (1988, October 25). The...     [0]
7   De Choudhury, M., Gamon, M., Couns, S., %27 Ho...     [0]
8   Gotlib IH, Joormann J. Cognition and depressio...     [0]
9   Gotlib and Joormann (2010)  Gotlib IH, Kasch K...     [0]
10  Gotlib, I. H., %27 Hammen, C. L. (1992). Psych...     [0]
11  Hu, Quan, Ang Li, Fei Heng, Jianpeng Li, and T...     [0]
12                StackOverflow. Not to be grouped-by     [1]

您可以查看哪些论文出现在哪些索引中。如果要计数，可以使用len 代替list：

d = {k: len(v) for k, v in df.index.groupby(df['reference']).items()}
new_df = pd.DataFrame.from_dict(d, orient='index').reset_index()
print(new_df)

输出：

                                                index  0
0   All about depression: Diagnosis. (2013). Retri...  2
1   American Psychological Association. (2016). Ce...  2
2   American Psychological Association. (2016). Pa...  1
3   Beattie, G.S. (2005, November). Social Causes ...  1
4   Burton   (2012)   Burton, N. (2012, June 5). D...  1
5   Choudhury, 2014     De Choudhury, M., Counts, ...  1
6   Clark, P., Niblett, T. (1988, October 25). The...  1
7   De Choudhury, M., Gamon, M., Couns, S., %27 Ho...  1
8   Gotlib IH, Joormann J. Cognition and depressio...  1
9   Gotlib and Joormann (2010)  Gotlib IH, Kasch K...  1
10  Gotlib, I. H., %27 Hammen, C. L. (1992). Psych...  1
11  Hu, Quan, Ang Li, Fei Heng, Jianpeng Li, and T...  1
12                StackOverflow. Not to be grouped-by  1

【讨论】：

谢谢。这是否会在整个“参考列”中查找每个参考并检查重复值，如果找到则给出计数和索引？
是的。第一部分用于重复索引，查找长度重复项等同于计数。但是，这将包括不重复的项目（请参阅 stackoverflow not be groupedby）。
字典理解只是将值从数组转换为列表以获得更漂亮的 repr。使其成为数据框也具有相同的效果，但它比在 dict 上更容易管理结果。
我得到了这个，我不知道索引列中的结果是什么，以及为什么有 521 个新列具有 NaN 值。 imgur.com/pAwBcTv