根据第二个数据框中的部分匹配创建新列答案

【问题标题】：creating new column based on partial match in 2nd dataframe根据第二个数据框中的部分匹配创建新列
【发布时间】：2014-03-14 05:44:55
【问题描述】：

我有两个数据框，top3df：

http://dpaste.com/hold/1714336/

和qw：

qw <- structure(list(id = structure(1:25, .Label = c("w01", "w02", "w03", "w04", "w05", "w06", "w07", "w08", "w09", "w10", "w11", "w12", "w13", "w14", "w15", "w16", "w17", "w18", "w19", "w20", "w21", "w22", "w23", "w24", "w25"), class = "factor"), link = structure(c(5L, 4L, 19L, 2L, 18L, 24L, 20L, 23L, 7L, 12L, 14L, 15L, 21L, 17L, 10L, 13L, 16L, 25L, 22L, 6L, 11L, 3L, 1L, 9L, 8L), .Label = c("http://gezondheid.blog.nl/overgewicht/2008/06/07/dik-zijn-heeft-veel-nadelen", "http://home.deds.nl/~obesitasinfo.nl/", "http://mens-en-gezondheid.infonu.nl/ziekten/18079-risicos-van-overgewicht-en-de-gevolgen-van-obesitas.html", "http://nl.wikipedia.org/wiki/Obesitas", "http://overgewicht.pilliewillie.nl/obesitas/behandeling.overgewicht.3.php", "http://www.afslankacademie.nl/page/2634/overgewicht.html", "http://www.afvallen-voeding.nl/", "http://www.erfelijkheid.nl/node/325", "http://www.gewoongezond.nl/", "http://www.gezondafvallen.net/", "http://www.gezonderafvallen.nl/page/938/overgewicht-als-gevolg-van-de-evolutie.html", "http://www.gr.nl/nl/adviezen/overgewicht-en-obesitas", "http://www.hely.net/oorzaken.html", "http://www.kiloafvallen.nl/", "http://www.nisb.nl/kennisplein-sport-bewegen/dossiers/bewegen-en-overgewicht/oorzaken-obesitas.html", "http://www.novarum.nl/eetproblemen/obesitas/signalen-en-gevolgen", "http://www.obesitas.azdamiaan.be/nl/index.aspx?n=280", "http://www.obesitaskliniek.nl/", "http://www.obesitasvereniging.nl/", "http://www.sagbmaagband.nl/minder-gewicht/morbideobesitas.html", "http://www.tipsbijafvallen.nl/", "http://www.tweestedenziekenhuis.nl/script/Template_SubsubMenu.asp?PageID=1144&SSMID=1247", "http://www.vgz.nl/zorg-en-gezondheid/ziektes-en-aandoeningen/obesitas", "http://www.volkskrant.nl/vk/nl/2672/Wetenschap-Gezondheid/article/detail/3143483/2012/01/30/Balanstop-in-Madurodam-mueslireep-tegen-obesitas.dhtml", "http://www.zuivelengezondheid.nl/?pageID=332"), class = "factor"), quality = c(3.875, 6.25, 7.875, 3.5, 6, 4.75, 3.625, 4.125, 2.375, 6, 2.125, 6.5, 2.5, 5.375, 2.5, 6.625, 5.125, 5, 6.875, 5.75, 6.125, 3.25, 1.75, 2.5, 7.375), q1 = c(0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L), q2 = c(0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 1L), q3 = c(0L, 1L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L)), .Names = c("id", "link", "quality", "q1", "q2", "q3"), class = "data.frame", row.names = c(NA, -25L))

使用top3df$id = qw$id[match(top3df$url,qw$link)] 我可以查找an exact match，但这也会产生NA。如何查找部分匹配的链接？

我需要根据链接的第一部分进行匹配（包括顶级域，但不包括 TLD 之后的所有内容）。例如，来自qw 的http://www.hely.net/oorzaken.html 应该与来自top3df 的http://www.hely.net/gevolgen.html 匹配。

【问题讨论】：

partial <- function(txt) { sub("http://(.*?)/.*", "\\1", txt) }; qw$id[match(partial(top3df$url), partial(qw$link))] - 像这样？
提取TLD然后匹配？另请参阅stackoverflow.com/questions/17285439/…
@EDi 它不仅需要匹配 TLD，还需要匹配 TLD 之前的部分
@lukeA 看来您的解决方案正是我正在寻找的。如果你回答它，我会接受它。

标签： regex r

【解决方案1】：

partial <- function(txt)  
  sub("http://(.*?)/.*", "\\1", txt) 

qw$id[match(partial(top3df$url), partial(qw$link))]

【讨论】：

【解决方案2】：

正如@lukeA 和@EDi 提到的，您可以使用正则表达式来提取TLD 的URL 并在这部分进行匹配，例如：

top3df$tld <- sub("(http[s]?://)?([^/]+)/.*$", "\\1\\2", top3df$url)
qw$tld <- sub("(http[s]?://)?([^/]+)/.*$", "\\1\\2", qw$link)

match(top3df$tld, qw$tld)
# [1] 22 11 25  5 14 16 18  2 16 25

【讨论】：

它不仅需要匹配 TLD，还需要匹配 TLD 之前的部分
好的。由于我对正则表达式没有太多经验，因此我可能误解了您的答案。