【发布时间】:2020-05-13 11:35:09
【问题描述】:
我有一个包含论文引用的列的数据框,我想查找所有引用在整个列中重复的任何引用。
以下是数据框中的一些行:
In [1]:
df4.iloc[0:2]
Out[2]:
**cit2ref** **reference** **_id**
0 NaN All about depression: Diagnosis. (2013). Retrieved December 7, 2016,from All About Depression,
http://www.allaboutdepression.com/dia_03.html Y17-1020
0 NaN American Psychological Association. (2016). Center for epidemiological studies depression (CESD).
Retrieved December 7, 2016, from American Psychological Association,
http://www.apa.org/pi/ about/publications/caregivers/practice-settings/ assessment/tools/depression-scale.aspx Y17-1020
更多行:
**cit2ref** **reference** **_id**
0 NaN All about depression: Diagnosis. (2013). Retrieved December 7, 2016, from All About Depression, http://www.allaboutdepression.com/dia_03.html Y17-1020
0 NaN American Psychological Association. (2016). Center for epidemiological studies depression (CESD). Retrieved December 7, 2016, from American Psychological Association, http://www.apa.org/pi/ about/publications/caregivers/practice-settings/ assessment/tools/depression-scale.aspx Y17-1020
0 NaN American Psychological Association. (2016). Patient health questionnaire (PHQ-9 %27 PHQ-2). Retrieved December 09, 2016, from http://www.apa.org/pi/ about/publications/caregivers/practice-settings/ assessment/tools/patient-health.aspx Y17-1020
0 NaN Beattie, G.S. (2005, November). Social Causes of Depression. Retrieved May 31, 2017, from http:// www.personalityresearch.org/papers/beattie.html Y17-1020
0 Burton (2012) Burton, N. (2012, June 5). Depressive Realism. Retrieved May 31, 2017, from https:// www.psychologytoday.com/blog/hide-and-seek/ 201206/depressive-realism Y17-1020
0 NaN Clark, P., Niblett, T. (1988, October 25). The CN2 induction Algorithm. Retrieved May 10, 2017, from https://pdfs.semanticscholar.org/766f/ e3586bda3f36cbcce809f5666d2c2b96c98c.pdf Y17-1020
0 Choudhury, 2014 De Choudhury, M., Counts, S., Horvits, E., %27 Hoff, A. (2014). Characterizing and Predicting Postpartum Depression from Shared Facebook Data. Y17-1020
0 NaN De Choudhury, M., Gamon, M., Couns, S., %27 Horvitz, E. (2013). Predicting Depression via Social Media. Y17-1020
0 Gotlib and Joormann (2010) Gotlib IH, Kasch KL, Traill S, Joormann J, Arnow BA, Johnson SL. (2010) Coherence and specificity of information-processing biases in depression and social phobia. J Abnorm Psychol. 2004;113(3): 386-98. Y17-1020
0 NaN Gotlib, I. H., %27 Hammen, C. L. (1992). Psychological aspects of depression: Toward a cognitive- interpersonal integration. New York: Wiley. Y17-1020
0 NaN Gotlib IH, Joormann J. Cognition and depression: current status and future directions. Annu Rev Clin Psychol. 2010;6:285-312. Y17-1020
0 NaN Hu, Quan, Ang Li, Fei Heng, Jianpeng Li, and Tingshao Zhu. "Predicting Depression of Social Media User on Different Observation Windows." 2015 IEEE/ WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI- IAT) (2015): n. pag. Web. Y17-102
这里的“0”是第一篇论文的索引,它有很多参考文献,有 40k 篇论文,每篇论文大约有 20 篇参考文献。
寻找在其他论文中再次使用的任何参考(这里每篇论文的索引不同)及其索引和重复次数。
尝试了正则表达式和熊猫的排序方法
value_counts(sort=True).sort_index()
和
sort_values()
但这无济于事。
Here is the screenshot of the dataframe with 2 papers as indexed '0' and '1'
【问题讨论】:
-
您能解释一下您的引用是什么意思吗?是美国心理学会。 (2016 年)。参考? Beattie, G.S.(2005 年 11 月)。 ?您想要实现的目标的示例会有所帮助。
-
@sammywemmy 'reference' 列值(即整个文本直到 '_id' 列值)是研究论文的参考。通过水平滚动查看整行。
-
@Chris 添加了更多索引数据帧的图像,但不知道如何在代码/数据帧中编写预期输出,但突出显示了我对问题的期望。
cit2ref有许多NaN值,因为它是相同的参考论文,其中值未知,无法删除它们,因为它有助于将参考文献与实际论文对齐。 -
您可以在编辑完问题后回复此评论,我再看一下。您可以阅读minimal reproducible example 或this link 也可能有用。这些旨在指导您撰写更好的问题。
标签: python regex pandas dataframe