r 中的 Web Scrape 标题答案

【问题标题】：Web Scrape titles in rr 中的 Web Scrape 标题
【发布时间】：2023-01-13 07:25:29
【问题描述】：

我正在尝试创建一个函数 get_CIDname()

每种化合物都有一个指定的 CID，Compound ID，来自PubChem's chemical database。

我有一个包含这些 CID 列和其他一些字符值列的数据框。我想改变一个新列，将每个 CID 命名为该站点的列标题名称。

例子：

即此标识符列中的所有 962 实例都替换为“水”，所有 176 实例替换为“乙酸”，网站上的主要名称 https://pubchem.ncbi.nlm.nih.gov/compound/CID

示例数据集：

df <- data.frame("Compound" = c(176,29096,6341,8914,5366204,98464,11572,9231,535144,15669393,1738127,1738124), "Value" = rnorm(12, mean = 500000, sd = 600000))

期望的输出：

df <- data.frame("Compound" = c(176,29096,6341,8914,5366204,98464,11572,9231,535144,15669393,1738127,1738124), "Value" = rnorm(12, mean = 500000, sd = 600000),
Match = c("Acetic Acid", "Dihydromyrcenol", etc....))

目前，我有：

get_CIDname <- function(CID){
read_html(paste0("https://pubchem.ncbi.nlm.nih.gov/compound/",
           CID)) 

}

但不知道如何破译 PubChem 网站的 HTML。接下来是什么？这种类型的语法/编程叫什么？

【问题讨论】：

标签： r web-scraping

【解决方案1】：

我们可以使用他们的 PUG REST API 来提取 JSON 数据文件并将 CID 链接到复合标题。

#libraries
library(jsonlite)
library(data.table)

#data
df <- data.frame("Compound" = c(10413, 176,29096,6341,8914,5366204,98464,11572,9231,535144,15669393,1738127,1738124), "Value" = rnorm(13, mean = 500000, sd = 600000))


#set to data.table
df <- as.data.table(df)

#set up progressbar
pb <- txtProgressBar(min = 0, max = nrow(df), style = 3)

#loop through df rows
for(i in 1:nrow(df)){
  #update progressbar
  setTxtProgressBar(pb, i)  
  
  #extract compound data 
  data <- fromJSON(readLines(paste0("https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/", df[i,]$Compound, "/JSON/?response_type=save&response_basename=compound_CID_", df[i,]$Compound)))
   
  #extract title
  compound_title <- data$Record$RecordTitle
  
  #add to df
  df[i, name := compound_title]
}
head(df)

   Compound    Value                   name
1:    10413 898404.7 4-Hydroxybutanoic acid
2:      176 174150.1            Acetic Acid
3:    29096 516514.0        Dihydromyrcenol
4:     6341 499010.7             Ethylamine
5:     8914 783220.9             Nonan-1-ol
6:  5366204 217092.8  (Z)-1-Methoxy-2-buten

如果您的数据集中有 Compound 的重复项，则循环遍历唯一的化合物可能会更快，即 for(i in unique(df$compounds) 并相应地调整代码。

编辑：他们在 PUG REST API 的描述中指出，PUG REST 不是为非常大量（数百万）的请求而设计的。他们要求任何脚本或应用程序每秒发出的请求不超过 5 个，以避免 PubChem 服务器过载。请参阅https://pubchemdocs.ncbi.nlm.nih.gov/pug-rest 要记住的事情。

【讨论】：

我不断返回此错误：parse_con(txt, bigint_as_char) 中的错误：词法错误：UTF8 字符串中的无效字节。 ed Substances Act (21 U.S.C. �801 et seq.) schedule]，以及 th（就在这里）------^ 这与您的编辑有关吗？有没有办法对这个脚本进行计量，以便它可以像我一样适用于大型数据框？（~5000 行）
对于哪种化合物，您得到了错误？你可以在循环中添加print(i)来检查
#create temporary folder if not present 是什么意思？ CID 10413 出现错误
对我来说也一样，调整后的代码（现在将 readlines 与 fromJSON 结合使用）。这对你有用吗？
是的！干杯@marrvd

【解决方案2】：

我有一个有点相关的问题。我正在尝试遍历 11,500 个 PubChem CID 的列表以检索表 BioAssay 结果表（如果可用）。
比如CID2965821，这是我要获取的table。我只需要活动为“活动”的行。

按照这个脚本，我只能得到活动辅助的数量，但我无法得到包含目标名称等的完整表格。

这是只有一种化合物的代码：

df <- data.frame("Compound" = 2965821)
df <- as.data.table(df)

#set up progressbar
pb <- txtProgressBar(min = 0, max = nrow(df), style = 3)

#loop through df rows
for(i in 1:nrow(df)){
  #update progressbar
  setTxtProgressBar(pb, i)  
  
  #extract active aids data 
  data <- fromJSON(readLines(paste0("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/", df[i,]$Compound, "/aids/JSON/?aids_type=active")))

  #extract active aid numbers
  compound_active_aid_numbers <- data$InformationList$Information$AID
  
  #add to df
  df[i, name := compound_active_aid_numbers]
}
head(df)

如何以我可以在 R 中进一步操作的格式获取完整的数据表？

谢谢！

【讨论】：