【问题标题】:creating new column based on partial match in 2nd dataframe根据第二个数据框中的部分匹配创建新列
【发布时间】:2014-03-14 05:44:55
【问题描述】:

我有两个数据框,top3df:

http://dpaste.com/hold/1714336/

和qw:

qw <- structure(list(id = structure(1:25, .Label = c("w01", "w02", "w03", "w04", "w05", "w06", "w07", "w08", "w09", "w10", "w11", "w12", "w13", "w14", "w15", "w16", "w17", "w18", "w19", "w20", "w21", "w22", "w23", "w24", "w25"), class = "factor"), link = structure(c(5L, 4L, 19L, 2L, 18L, 24L, 20L, 23L, 7L, 12L, 14L, 15L, 21L, 17L, 10L, 13L, 16L, 25L, 22L, 6L, 11L, 3L, 1L, 9L, 8L), .Label = c("http://gezondheid.blog.nl/overgewicht/2008/06/07/dik-zijn-heeft-veel-nadelen", "http://home.deds.nl/~obesitasinfo.nl/", "http://mens-en-gezondheid.infonu.nl/ziekten/18079-risicos-van-overgewicht-en-de-gevolgen-van-obesitas.html", "http://nl.wikipedia.org/wiki/Obesitas", "http://overgewicht.pilliewillie.nl/obesitas/behandeling.overgewicht.3.php", "http://www.afslankacademie.nl/page/2634/overgewicht.html", "http://www.afvallen-voeding.nl/", "http://www.erfelijkheid.nl/node/325", "http://www.gewoongezond.nl/", "http://www.gezondafvallen.net/", "http://www.gezonderafvallen.nl/page/938/overgewicht-als-gevolg-van-de-evolutie.html", "http://www.gr.nl/nl/adviezen/overgewicht-en-obesitas", "http://www.hely.net/oorzaken.html", "http://www.kiloafvallen.nl/", "http://www.nisb.nl/kennisplein-sport-bewegen/dossiers/bewegen-en-overgewicht/oorzaken-obesitas.html", "http://www.novarum.nl/eetproblemen/obesitas/signalen-en-gevolgen", "http://www.obesitas.azdamiaan.be/nl/index.aspx?n=280", "http://www.obesitaskliniek.nl/", "http://www.obesitasvereniging.nl/", "http://www.sagbmaagband.nl/minder-gewicht/morbideobesitas.html", "http://www.tipsbijafvallen.nl/", "http://www.tweestedenziekenhuis.nl/script/Template_SubsubMenu.asp?PageID=1144&SSMID=1247", "http://www.vgz.nl/zorg-en-gezondheid/ziektes-en-aandoeningen/obesitas", "http://www.volkskrant.nl/vk/nl/2672/Wetenschap-Gezondheid/article/detail/3143483/2012/01/30/Balanstop-in-Madurodam-mueslireep-tegen-obesitas.dhtml", "http://www.zuivelengezondheid.nl/?pageID=332"), class = "factor"), quality = c(3.875, 6.25, 7.875, 3.5, 6, 4.75, 3.625, 4.125, 2.375, 6, 2.125, 6.5, 2.5, 5.375, 2.5, 6.625, 5.125, 5, 6.875, 5.75, 6.125, 3.25, 1.75, 2.5, 7.375), q1 = c(0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L), q2 = c(0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 1L), q3 = c(0L, 1L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L)), .Names = c("id", "link", "quality", "q1", "q2", "q3"), class = "data.frame", row.names = c(NA, -25L))

使用top3df$id = qw$id[match(top3df$url,qw$link)] 我可以查找an exact match,但这也会产生NA。如何查找部分匹配的链接?

我需要根据链接的第一部分进行匹配(包括顶级域,但不包括 TLD 之后的所有内容)。例如,来自qwhttp://www.hely.net/oorzaken.html 应该与来自top3dfhttp://www.hely.net/gevolgen.html 匹配。

【问题讨论】:

  • partial &lt;- function(txt) { sub("http://(.*?)/.*", "\\1", txt) }; qw$id[match(partial(top3df$url), partial(qw$link))] - 像这样?
  • 提取TLD然后匹配?另请参阅stackoverflow.com/questions/17285439/…
  • @EDi 它不仅需要匹配 TLD,还需要匹配 TLD 之前的部分
  • @lukeA 看来您的解决方案正是我正在寻找的。如果你回答它,我会接受它。

标签: regex r


【解决方案1】:
partial <- function(txt)  
  sub("http://(.*?)/.*", "\\1", txt) 

qw$id[match(partial(top3df$url), partial(qw$link))]

【讨论】:

    【解决方案2】:

    正如@lukeA 和@EDi 提到的,您可以使用正则表达式来提取TLD 的URL 并在这部分进行匹配,例如:

    top3df$tld <- sub("(http[s]?://)?([^/]+)/.*$", "\\1\\2", top3df$url)
    qw$tld <- sub("(http[s]?://)?([^/]+)/.*$", "\\1\\2", qw$link)
    
    match(top3df$tld, qw$tld)
    # [1] 22 11 25  5 14 16 18  2 16 25 
    

    【讨论】:

    • 它不仅需要匹配 TLD,还需要匹配 TLD 之前的部分
    • 好的。由于我对正则表达式没有太多经验,因此我可能误解了您的答案。
    猜你喜欢
    • 2023-03-18
    • 1970-01-01
    • 2013-11-26
    • 2013-11-13
    • 1970-01-01
    • 2021-03-28
    • 2020-12-10
    • 2019-06-26
    • 2018-04-28
    相关资源
    最近更新 更多