【问题标题】:good way to web scrape username on a page在网页上抓取用户名的好方法
【发布时间】:2012-12-20 11:10:16
【问题描述】:

我想从 youtube cmets 中抓取用户名,就像在页面中一样:

http://www.youtube.com/all_comments?v=mIA0W69U2_Y

我想获取所有用户名/显示名称,例如:“fedfields”、“mystik dread” 和相应的链接(当你点击“fedfields”时,它会链接到它的个人资料) 我想使用自动化 bash 脚本来删除它们 我有以下问题:

1 我最初的方法是编写自动化脚本,使用 wget 下载页面,然后使用正则表达式处理页面以获取这些名称,但是这样,我需要下载整个页面,每个页面都是几个 MB,如果我下载很多页面,它会占用很多空间,有没有更好的方法?

2 页面很多,比如链接里有7个页面,能不能把它们都放在一个页面里?

【问题讨论】:

    标签: web scrape


    【解决方案1】:

    您可以在 C# 应用程序中使用 HtmlAgilityPack。

            HtmlAgilityPack.HtmlWeb web = new HtmlWeb();
            HtmlAgilityPack.HtmlDocument doc = web.Load(Url);
            IEnumerable<HtmlNode> userNames = doc.DocumentNode.Descendants("a").Where(
                d => d.Attributes.Contains("class") &&   
                d.Attributes["class"].Value.Contains("yt-user-name"));
    

    Useful info about parsing html with RegEx

    我不知道 youtube 内容是否具有原生 gzip 压缩,但您可以使用 WebRequest 类进行检查。如果是,它将显着减少流量。

    webRequest = (HttpWebRequest)WebRequest.Create(url);
    webRequest.Method = WebRequestMethods.Http.Get;
    webRequest.KeepAlive = true;
    webRequest.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
    webRequest.Headers.Add("Accept-Encoding", "gzip,deflate");
    HttpWebResponse webResponse = (HttpWebResponse)webRequest.GetResponse(); 
    MessageBox.Show(webResponse.ContentEncoding.ToString());
    

    然后您可以使用 HTMLAgilityPack 读取流并获取用户名。

    【讨论】:

    • 你打算使用什么语言?无论如何,您知道如何减小尺寸。
    【解决方案2】:

    在 mashape 上使用 ScrapeGoat 将所有用户名作为 json 对象返回 :)

    https://www.mashape.com/warting/scrapegoat/

    curl --include --request GET 'https://scrapegoat.p.mashape.com/?url=http%3A%2F%2Fwww.youtube.com%2Fall_comments%3Fv%3DmIA0W69U2_Y&selector=.yt-user-name' --header "X-Mashape-Authorization: <MASHAPE API KEY>"
    

    结果:

    {"message":"ok","payload":["whitehouse","Osambasucks2","Osambasucks2","Osambasucks2","omar barazanji","omar barazanji","omar barazanji","omar barazanji","omar barazanji","omar barazanji","HigherPlanes","HigherPlanes","HigherPlanes","RamonaFromPomona","RamonaFromPomona","Osambasucks2","Osambasucks2","Osambasucks2","RamonaFromPomona","terminator360tm","terminator360tm","terminator360tm","terminator360tm","terminator360tm","terminator360tm","Osambasucks2","Osambasucks2","Osambasucks2","Joe Lackey","Joe Lackey","Joe Lackey","ThaGenius101","ThaGenius101","ThaGenius101","Joe Lackey","Ed Patowski","Ed Patowski","Ed Patowski","toughdogyt","toughdogyt","toughdogyt","Osambasucks2","Osambasucks2","Osambasucks2","goodkarmaband","goodkarmaband","Martynas Valiukas","Martynas Valiukas","Martynas Valiukas","goodkarmaband","goodkarmaband","goodkarmaband","Martynas Valiukas","XRedstone688X","XRedstone688X","XRedstone688X","goodkarmaband","Trevor Jones","Trevor Jones","Trevor Jones","goodkarmaband","V V","V V","V V","V V","V V","V V","V V","V V","V V","V V","V V","V V","leeman6417","leeman6417","leeman6417","Osambasucks2","Osambasucks2","Osambasucks2","leeman6417","sosocrazy1234","sosocrazy1234","sosocrazy1234","leeman6417","liamdudeeee","liamdudeeee","liamdudeeee","sosocrazy1234","sosocrazy1234","sosocrazy1234","sosocrazy1234","leeman6417","Ed Patowski","Ed Patowski","Ed Patowski","mastershakelock","mastershakelock","mastershakelock","VGQgex","VGQgex","VGQgex","Osambasucks2","Osambasucks2","Osambasucks2","VGQgex","MindzEnt","MindzEnt","MindzEnt","William willie","William willie","William willie","William willie","William willie","William willie","bkdmd","bkdmd","bkdmd","Osambasucks2","Osambasucks2","Osambasucks2","bkdmd","Rafael Vargas","Rafael Vargas","Rafael Vargas","7even2wenty1","7even2wenty1","7even2wenty1","cashlessbread","cashlessbread","cashlessbread","base3798","base3798","base3798","Ed Patowski","Ed Patowski","Ed Patowski","base3798","john smith","john smith","john smith","Ed Patowski","Neftali Acosta","Neftali Acosta","Neftali Acosta","Ed Patowski","Ed Patowski","Ed Patowski","Neftali Acosta","john smith","john smith","john smith","Neftali Acosta","Canal YooCheckTheFloow","Canal YooCheckTheFloow","Canal YooCheckTheFloow","Abandonbeast","Abandonbeast","Abandonbeast","Canal YooCheckTheFloow","Ironcitytony72","Ironcitytony72","Ironcitytony72","john smith","john smith","john smith","Ironcitytony72","Andrew Apelt","Andrew Apelt","Andrew Apelt","Ironcitytony72","Osambasucks2","Osambasucks2","Osambasucks2","Andrew Apelt","Andrew Apelt","Andrew Apelt","Andrew Apelt","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","Andrew Apelt","incas94","incas94","incas94","Osambasucks2","William willie","William willie","William willie","incas94","Osambasucks2","Osambasucks2","Osambasucks2","incas94","Osambasucks2","Osambasucks2","Osambasucks2","incas94","Osambasucks2","Osambasucks2","Osambasucks2","incas94","Andrew Apelt","Andrew Apelt","Osambasucks2","LawnMowerfromHell","LawnMowerfromHell","LawnMowerfromHell","Ironcitytony72","Osambasucks2","Osambasucks2","Osambasucks2","TheAndr3tzi","TheAndr3tzi","TheAndr3tzi","thumsupformyusername","thumsupformyusername","thumsupformyusername","algett","algett","algett","thumsupformyusername","thumsupformyusername","thumsupformyusername","thumsupformyusername","algett","ferkondenster","ferkondenster","ferkondenster","Christian Heinrich","Christian Heinrich","Christian Heinrich","erieejustice911","erieejustice911","erieejustice911","ferkondenster","ferkondenster","ferkondenster","Seth Farsides","Seth Farsides","Seth Farsides","ferkondenster","ferkondenster","ferkondenster","Seth Farsides","Seth Farsides","Seth Farsides","ferkondenster","Doky9889","Doky9889","Doky9889","ferkondenster","ferkondenster","ferkondenster","ferkondenster","Doky9889","sealrk19","sealrk19","sealrk19","wiljam12345","wiljam12345","wiljam12345","Dwayne Cole","Dwayne Cole","Dwayne Cole","Osambasucks2","Osambasucks2","Osambasucks2","Dwayne Cole","Jax Jr","Jax Jr","Jax Jr","Rafael Vargas","Rafael Vargas","Rafael Vargas","William willie","William willie","William willie","William willie","William willie","William willie","Gunnar Rowe","Gunnar Rowe","Gunnar Rowe","Rafael Vargas","Rafael Vargas","Rafael Vargas","Susan Porter","Susan Porter","Susan Porter","derp toth","derp toth","derp toth","MXNR16","nick62301","nick62301","nick62301","Aries2012100","Aries2012100","Aries2012100","Osambasucks2","Osambasucks2","Osambasucks2","Aries2012100","SeventhSun","SeventhSun","SeventhSun","Osambasucks2","Aries2012100","Aries2012100","Aries2012100","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","Aries2012100","Aries2012100","Aries2012100","Aries2012100","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","Aries2012100","Rafael Vargas","Rafael Vargas","Rafael Vargas","senormierda","senormierda","senormierda","Rafael Vargas","chrisgilofficial","chrisgilofficial","chrisgilofficial","MXNR16","Osambasucks2","Osambasucks2","Osambasucks2","chrisgilofficial","chrisgilofficial","chrisgilofficial","chrisgilofficial","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","chrisgilofficial","chrisgilofficial","chrisgilofficial","chrisgilofficial","Osambasucks2","Andrew Apelt","Andrew Apelt","chrisgilofficial","Osambasucks2","Osambasucks2","Osambasucks2","chrisgilofficial","aztecadog","aztecadog","aztecadog","chrisgilofficial","Osambasucks2","Osambasucks2","Osambasucks2","aztecadog","Osambasucks2","Osambasucks2","Osambasucks2","aztecadog","Osambasucks2","Osambasucks2","Osambasucks2","chrisgilofficial","Osambasucks2","Osambasucks2","Osambasucks2","chrisgilofficial","ThePhase20","ThePhase20","ThePhase20","ICE778","ICE778","ICE778","Sabrina Blacks","Sabrina Blacks","Sabrina Blacks","Darwin Gutierrez","Darwin Gutierrez","Darwin Gutierrez","lessonsfromryan","tooncrazy1","tooncrazy1","tooncrazy1","unbreackable3000","unbreackable3000","unbreackable3000","Barack Obama","Barack Obama","Barack Obama","Osambasucks2","Osambasucks2","Osambasucks2","Barack Obama","tooncrazy1","tooncrazy1","tooncrazy1","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","tooncrazy1","Osambasucks2","Osambasucks2","Osambasucks2","tooncrazy1","Osambasucks2","Osambasucks2","Osambasucks2","Barack Obama","Americaunderduress","Americaunderduress","Americaunderduress","Barack Obama","Barack Obama","Barack Obama","Osambasucks2","Osambasucks2","Osambasucks2","Barack Obama","FoodStampBarry","FoodStampBarry","FoodStampBarry","Barack Obama","rondog ron","rondog ron","rondog ron","rondog ron","rondog ron","rondog ron","rondog ron","rondog ron","rondog ron","rondog ron","rondog ron","rondog ron","rondog ron","rondog ron","rondog ron","rondog ron","rondog ron","rondog ron","myviewsontheworld","myviewsontheworld","myviewsontheworld","SuperNikoYT","SuperNikoYT","SuperNikoYT","myviewsontheworld","Osambasucks2","Osambasucks2","Osambasucks2","myviewsontheworld","Americaunderduress","Americaunderduress","Americaunderduress","myviewsontheworld","Asuma741","Asuma741","Asuma741","RevolutionNewz","damonjo15","damonjo15","damonjo15","Osambasucks2","Osambasucks2","Osambasucks2","damonjo15","Aries2012100","Aries2012100","Aries2012100","Osambasucks2","Osambasucks2","Osambasucks2","Aries2012100","Aries2012100","Aries2012100","Aries2012100","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","Aries2012100","Aries2012100","Aries2012100","Aries2012100","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","Aries2012100","Aries2012100","Aries2012100","Aries2012100","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","Aries2012100","Aries2012100","Aries2012100","Aries2012100","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","Aries2012100","Aries2012100","Aries2012100","Aries2012100","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","Aries2012100","Aries2012100","Aries2012100","Aries2012100","Osambasucks2","tooncrazy1","tooncrazy1","tooncrazy1","Aries2012100","KH AK","KH AK","KH AK","Aries2012100","Osambasucks2","Osambasucks2","Osambasucks2","Aries2012100","Osambasucks2","Osambasucks2","Osambasucks2","Aries2012100","Osambasucks2","Osambasucks2","Osambasucks2","Aries2012100","kangaroo3259","kangaroo3259","kangaroo3259","Aries2012100","tooncrazy1","tooncrazy1","tooncrazy1","tooncrazy1","tooncrazy1","tooncrazy1","tooncrazy1","youhan younen","youhan younen","youhan younen","tooncrazy1","tooncrazy1","tooncrazy1","youhan younen","Osambasucks2","Osambasucks2","Osambasucks2","youhan younen","Stevejobsultimate2","Stevejobsultimate2","Stevejobsultimate2","Stevejobsultimate2","Stevejobsultimate2","Stevejobsultimate2","Osambasucks2","Osambasucks2","Osambasucks2","Stevejobsultimate2","Rafael Vargas","Rafael Vargas","Rafael Vargas","drewpert0515","drewpert0515","drewpert0515","dv wfwefwe","TheAlienContactee","TheAlienContactee","TheAlienContactee","tooncrazy1","tooncrazy1","tooncrazy1","tooncrazy1","tooncrazy1","tooncrazy1","Jordan Beckwith","Jordan Beckwith","Jordan Beckwith","Michael Carrillo","Michael Carrillo","Michael Carrillo","gotwess","gotwess","gotwess","gotwess","Michael Carrillo","Michael Carrillo","Michael Carrillo","Michael Carrillo","gotwess","Jawad Pullin","Jawad Pullin","Jawad Pullin","TreborHG93","tooncrazy1","tooncrazy1","tooncrazy1","chickeneggchickeneg1","chickeneggchickeneg1","chickeneggchickeneg1","chickeneggchickeneg1","chickeneggchickeneg1","chickeneggchickeneg1","kinggrindhard","kinggrindhard","kinggrindhard","branoaas branoaas","branoaas branoaas","branoaas branoaas","Osambasucks2","Osambasucks2","Osambasucks2","branoaas branoaas","branoaas branoaas","branoaas branoaas","branoaas branoaas","Theindicud","Theindicud","Theindicud","eizieizz","eizieizz","eizieizz","Osambasucks2","Osambasucks2","Osambasucks2","eizieizz","1990Zuck","1990Zuck","1990Zuck","ArcoZakus","ArcoZakus","ArcoZakus","firemedic30ca","johnny grove","johnny grove","johnny grove","joost1v","joost1v","joost1v","Osambasucks2","Osambasucks2","Osambasucks2","joost1v","5sdk1","5sdk1","5sdk1","jeff brennan","jeff brennan","jeff brennan","izizdropshotz","izizdropshotz","izizdropshotz","izizdropshotz","jeff brennan","jeff brennan","jeff brennan","jeff brennan","Bo James","aztecadog","aztecadog","aztecadog","izizdropshotz","izizdropshotz","izizdropshotz","aztecadog","izizdropshotz","izizdropshotz","izizdropshotz","aztecadog","Greg Cimera","Greg Cimera","Greg Cimera","izizdropshotz","izizdropshotz","izizdropshotz","izizdropshotz","Greg Cimera","Greg Cimera","Greg Cimera","Greg Cimera","izizdropshotz","izizdropshotz","izizdropshotz","izizdropshotz","Greg Cimera","Greg Cimera","Greg Cimera","Greg Cimera","izizdropshotz","izizdropshotz","izizdropshotz","izizdropshotz","Greg Cimera","Greg Cimera","Greg Cimera","Greg Cimera","izizdropshotz","izizdropshotz","izizdropshotz","izizdropshotz","Greg Cimera","Greg Cimera","Greg Cimera","Greg Cimera","izizdropshotz","Paul Pascalau","Paul Pascalau","Paul Pascalau","Greg Cimera","tooncrazy1","tooncrazy1","tooncrazy1","Greg Cimera","Greg Cimera","Greg Cimera","Greg Cimera","tooncrazy1","tooncrazy1","tooncrazy1","tooncrazy1","Greg Cimera","Greg Cimera","Greg Cimera","Greg Cimera","tooncrazy1","tooncrazy1","tooncrazy1","tooncrazy1","Greg Cimera","Greg Cimera","Greg Cimera","Greg Cimera","tooncrazy1","tooncrazy1","tooncrazy1","tooncrazy1","Greg Cimera","Osambasucks2","Osambasucks2","Osambasucks2","aztecadog","Osambasucks2","Osambasucks2","Osambasucks2","aztecadog","Osambasucks2","Osambasucks2","Osambasucks2","aztecadog","aztecadog","aztecadog","aztecadog","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","aztecadog","aztecadog","aztecadog","aztecadog","Osambasucks2","izizdropshotz","izizdropshotz","izizdropshotz","aztecadog","Osambasucks2","Osambasucks2","Osambasucks2","aztecadog","Osambasucks2","Osambasucks2","Osambasucks2","aztecadog","Zajac Staszek","Zajac Staszek","Zajac Staszek","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","Zajac Staszek","Zajac Staszek","Zajac Staszek","Zajac Staszek","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","Zajac Staszek","Zajac Staszek","Zajac Staszek","Zajac Staszek","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","Zajac Staszek","Osambasucks2","Osambasucks2","Osambasucks2","Zajac Staszek","Ed Patowski","Ed Patowski","Ed Patowski","Zajac Staszek","aztecadog","aztecadog","aztecadog","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","aztecadog","gotwess","gotwess","gotwess","aztecadog","JeremyTheMoose","JeremyTheMoose","JeremyTheMoose","5sdk1","5sdk1","5sdk1","fordbronco1991","fordbronco1991","fordbronco1991","andy kerver","andy kerver","andy kerver","Omarimage","Omarimage","Omarimage","Omarimage","Omarimage","Omarimage","justin lionti","justin lionti","justin lionti","Omarimage","Butheadbros2","Butheadbros2","Butheadbros2","Omarimage","moonbeamrider1","moonbeamrider1","moonbeamrider1","justin lionti","justin lionti","justin lionti","moonbeamrider1","moonbeamrider1","moonbeamrider1","moonbeamrider1","justin lionti","fordbronco1991","fordbronco1991","fordbronco1991","pellenyberg","pellenyberg","pellenyberg","Son Goku","Son Goku","Son Goku","5ilv3rbvll","5ilv3rbvll","5ilv3rbvll","Butheadbros2","Butheadbros2","Butheadbros2","5ilv3rbvll","5ilv3rbvll","5ilv3rbvll","5ilv3rbvll","Butheadbros2","Butheadbros2","Butheadbros2","5ilv3rbvll","5ilv3rbvll","5ilv3rbvll","5ilv3rbvll","Butheadbros2","Butheadbros2","Butheadbros2","5ilv3rbvll","5ilv3rbvll","5ilv3rbvll","5ilv3rbvll","Butheadbros2","Butheadbros2","Butheadbros2","Butheadbros2","5ilv3rbvll","5ilv3rbvll","5ilv3rbvll","5ilv3rbvll","Butheadbros2","Butheadbros2","Butheadbros2","Butheadbros2","5ilv3rbvll","fisch kopf","fisch kopf","fisch kopf","andrew baker","andrew baker","andrew baker","FVCKDA POPO","FVCKDA POPO","FVCKDA POPO","MrChessmans","MrChessmans","MrChessmans","BryndisiDali","Brazzer man","Brazzer man","Brazzer man","Jack Thompson","ecw141685","ecw141685","ecw141685","Osambasucks2","Osambasucks2","Osambasucks2","ecw141685","lps24evelyn","lps24evelyn","lps24evelyn","erieejustice911","erieejustice911","erieejustice911","erieejustice911","erieejustice911","erieejustice911","Keepskatin","Keepskatin","Keepskatin","erieejustice911","V V","V V","V V","Keepskatin","Abrahan Peraza","Abrahan Peraza","Abrahan Peraza","lexyloveful","Zratedguns","Zratedguns","Zratedguns","MadNoys1","MadNoys1","MadNoys1","MadNoys1","Zratedguns","MadNoys1","MadNoys1","MadNoys1","MadNoys1","MadNoys1","MadNoys1","MadNoys1","MadNoys1","MadNoys1","Joseph Pal","Joseph Pal","Joseph Pal","Joseph Pal","MadNoys1","MadNoys1","MadNoys1","MadNoys1","bear cat","laurynas stirbys","laurynas stirbys","laurynas stirbys","newjerusalem newtestament","newjerusalem newtestament","newjerusalem newtestament","amerilstones","amerilstones","amerilstones","newjerusalem newtestament","Keepskatin","Keepskatin","Keepskatin","newjerusalem newtestament","amerilstones","amerilstones","amerilstones","Keepskatin","Noah Neo","Noah Neo","Noah Neo","charmander4533","charmander4533","charmander4533","Noah Neo","Noah Neo","Noah Neo","Noah Neo","charmander4533","Noah Neo","Noah Neo","Noah Neo","charmander4533","Osambasucks2","Osambasucks2","Osambasucks2","Noah Neo","George Washington","George Washington","George Washington","charmander4533","izizdropshotz","izizdropshotz","izizdropshotz","charmander4533","Wavanova","Wavanova","Wavanova","charmander4533","wisestfoolalive","wisestfoolalive","wisestfoolalive","Noah Neo","Noah Neo","Noah Neo","Noah Neo","wisestfoolalive","colin dooley","colin dooley","colin dooley","colin dooley","colin dooley","colin dooley","Silme037","Silme037","Silme037","colin dooley","Keepskatin","Keepskatin","Keepskatin","colin dooley","princelord55","princelord55","princelord55","Osambasucks2","Osambasucks2","Osambasucks2","princelord55","DriadonRapShow","DriadonRapShow","DriadonRapShow","eddrum100","eddrum100","eddrum100","Ryan S","Osambasucks2","Osambasucks2","Osambasucks2","eddrum100","eddrum100","eddrum100","eddrum100","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","eddrum100","Ryan S","Ryan S","Ryan S","eddrum100","eddrum100","eddrum100","Ryan S","Ryan S","Ryan S","Ryan S","Ryan S","eddrum100","eddrum100","eddrum100","eddrum100","RatedMForModz","RatedMForModz","RatedMForModz","alban97","alban97","alban97","RatedMForModz","Alex Bannon","Alex Bannon","Alex Bannon","alban97","alban97","alban97","alban97","Alex Bannon","james aaron","james aaron","james aaron","RatedMForModz","Ryan S","Ryan S","Ryan S","Dylan N","killllshot","killllshot","killllshot","Saadia Khan","Saadia Khan","talithatf17","talithatf17","talithatf17","amerilstones","amerilstones","amerilstones","talithatf17","BENGHAZIneverForget","BENGHAZIneverForget","BENGHAZIneverForget","talithatf17","talithatf17","talithatf17","supergrover6868","supergrover6868","supergrover6868","talithatf17","Alexander Sigsworth","Alexander Sigsworth","Alexander Sigsworth","supergrover6868","Zratedguns","Zratedguns","Zratedguns","supergrover6868","Keepskatin","Keepskatin","Keepskatin","Zratedguns","Butheadbros2","Butheadbros2","Butheadbros2","Zratedguns","Omegeist","Omegeist","Omegeist","supergrover6868","2Dmensions","2Dmensions","2Dmensions","talithatf17","talithatf17","talithatf17","supergrover6868","supergrover6868","supergrover6868","talithatf17","newjerusalem newtestament","newjerusalem newtestament","newjerusalem newtestament","supergrover6868","VGQgex","VGQgex","VGQgex","talithatf17","talithatf17","talithatf17","talithatf17","Mandragara","Mandragara","Mandragara","talithatf17","deathzbo","deathzbo","deathzbo","Mandragara","Mandragara","Mandragara","deathzbo","Mandragara","Mandragara","Mandragara","deathzbo","deathzbo","deathzbo","deathzbo","Mandragara","eddrum100","eddrum100","eddrum100","Mandragara","Mandragara","Mandragara","Mandragara","eddrum100","Unit01232","Unit01232","Unit01232","supergrover6868","supergrover6868","supergrover6868","Unit01232","Osambasucks2","Osambasucks2","Osambasucks2","supergrover6868","eddrum100","eddrum100","eddrum100","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","eddrum100","Osambasucks2","Osambasucks2","Osambasucks2","Unit01232","Unit01232","Unit01232","Unit01232","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","Unit01232","eddrum100","eddrum100","eddrum100","senormierda","Osambasucks2","Osambasucks2","Osambasucks2","eddrum100","eddrum100","eddrum100","eddrum100","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","eddrum100","Kevin Koala","Kevin Koala","Kevin Koala","senormierda","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","eddrum100","eddrum100","eddrum100","bobothecreepyclown","eddrum100","eddrum100","bobothecreepyclown","eddrum100","eddrum100","eddrum100","bobothecreepyclown","eddrum100","eddrum100","eddrum100","GGRSC","GGRSC","GGRSC","GGRSC","eddrum100","michael smith","michael smith","michael smith","GGRSC","GGRSC","GGRSC","truthinvideos","supergrover6868","supergrover6868","supergrover6868","GGRSC","supergrover6868","supergrover6868","supergrover6868","eddrum100","eddrum100","eddrum100","supergrover6868","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","eddrum100","eddrum100","eddrum100","eddrum100","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","eddrum100","eddrum100","eddrum100","eddrum100","bobothecreepyclown","eddrum100","eddrum100","eddrum100","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","eddrum100","eddrum100","eddrum100","eddrum100","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","eddrum100","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","eddrum100","eddrum100","eddrum100","bobothecreepyclown","eddrum100","eddrum100","eddrum100","bobothecreepyclown","eddrum100","eddrum100","eddrum100","bobothecreepyclown","eddrum100","eddrum100","eddrum100","eddrum100","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","supergrover6868","eddrum100","eddrum100","eddrum100","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","eddrum100","eddrum100","eddrum100","eddrum100","bobothecreepyclown","eddrum100","eddrum100","eddrum100","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","eddrum100","eddrum100","eddrum100","eddrum100","bobothecreepyclown","eddrum100","eddrum100","eddrum100","bobothecreepyclown","supergrover6868","supergrover6868","supergrover6868","bobothecreepyclown","eddrum100","eddrum100","eddrum100","supergrover6868","supergrover6868","supergrover6868","supergrover6868","eddrum100","eddrum100","eddrum100","eddrum100","supergrover6868","bobothecreepyclown","bobothecreepyclown","bobothecreepyclown","willypdyer","willypdyer","willypdyer","Osambasucks2","Osambasucks2","Osambasucks2","willypdyer","spairtain","spairtain","spairtain","DigitalAcceptance","DigitalAcceptance","DigitalAcceptance","ElRancholo2","Osambasucks2","Osambasucks2","Osambasucks2","DigitalAcceptance","ElRancholo2","ElRancholo2","ElRancholo2","DigitalAcceptance","Osambasucks2","Osambasucks2","Osambasucks2","ElRancholo2","Mark Tse","Mark Tse","Mark Tse","DigitalAcceptance","Mark Tse","Mark Tse","Mark Tse","Mark Tse","The Best","The Best","The Best","supergrover6868","supergrover6868","supergrover6868","creativeengineer","creativeengineer","creativeengineer","eddrum100","Ed Patowski","Ed Patowski","Ed Patowski","creativeengineer","Ed Patowski","Ed Patowski","Ed Patowski","creativeengineer","creativeengineer","creativeengineer","creativeengineer","Ed Patowski","Ed Patowski","Ed Patowski","Ed Patowski","creativeengineer","eddrum100","eddrum100","eddrum100","creativeengineer","creativeengineer","creativeengineer","creativeengineer","eddrum100","Osambasucks2","Osambasucks2","Osambasucks2","creativeengineer","supergrover6868","supergrover6868","supergrover6868","creativeengineer","creativeengineer","creativeengineer","creativeengineer","supergrover6868","supergrover6868","supergrover6868","creativeengineer","comicozy87","comicozy87","comicozy87","Raven Gomez","turbidhat","turbidhat","turbidhat","Daracon1010","Daracon1010","Daracon1010","Daracon1010","turbidhat","turbidhat","turbidhat","Daracon1010","VGQgex","VGQgex","VGQgex","Daracon1010","Daracon1010","Daracon1010","Daracon1010","VGQgex","WeThePeopleNoNWO","WeThePeopleNoNWO","WeThePeopleNoNWO","amerilstones","zmanthecool","zmanthecool","zmanthecool","metal220","supergrover6868","supergrover6868","supergrover6868","1974wolfman","1974wolfman","1974wolfman","William willie","William willie","William willie","1974wolfman","1974wolfman","1974wolfman","1974wolfman","William willie","Barskor1","Barskor1","Barskor1","Barskor1","Barskor1","Barskor1","Barskor1","Barskor1","Barskor1","Barskor1","Barskor1","Barskor1","Kanwar Judge","Kanwar Judge","Kanwar Judge","getsumdonginurmouth","getsumdonginurmouth","getsumdonginurmouth","abu bakr","abu bakr","abu bakr","Obamalies100","Obamalies100","Obamalies100","eddrum100","Obamalies100","Obamalies100","Obamalies100","eddrum100","eddrum100","eddrum100","eddrum100","Obamalies100","Obamalies100","Obamalies100","Obamalies100","eddrum100","eddrum100","eddrum100","eddrum100","Obamalies100","Obamalies100","Obamalies100","Obamalies100","eddrum100","Obamalies100","Obamalies100","Obamalies100","eddrum100","eddrum100","eddrum100","eddrum100","Obamalies100","Obamalies100","Obamalies100","Obamalies100","eddrum100","Obamalies100","Obamalies100","Obamalies100","eddrum100","eddrum100","eddrum100","eddrum100","Obamalies100","Obamalies100","Obamalies100","Obamalies100","eddrum100","Obamalies100","Obamalies100","Obamalies100","eddrum100","Obamalies100","Obamalies100","Obamalies100","eddrum100","Obamalies100","Obamalies100","Obamalies100","eddrum100","Obamalies100","Obamalies100","Obamalies100","getsumdonginurmouth","getsumdonginurmouth","getsumdonginurmouth","getsumdonginurmouth","Obamalies100","amerilstones","amerilstones","amerilstones","getsumdonginurmouth","getsumdonginurmouth","getsumdonginurmouth","getsumdonginurmouth","amerilstones","amerilstones","amerilstones","amerilstones","getsumdonginurmouth","Obamalies100","Obamalies100","Obamalies100","getsumdonginurmouth","Obamalies100","Obamalies100","Obamalies100","getsumdonginurmouth","Obamalies100","Obamalies100","Obamalies100","eddrum100","ThaYayo","ThaYayo","ThaYayo","William willie","chrisn365","chrisn365","chrisn365","Eli Jackson","Eli Jackson","Eli Jackson","Jboulos12","Frank Adams","Frank Adams","Frank Adams","amerilstones","amerilstones","amerilstones","eddrum100","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","amerilstones","amerilstones","amerilstones","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","eddrum100","eddrum100","eddrum100","amerilstones","amerilstones","amerilstones","amerilstones","eddrum100","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","amerilstones","amerilstones","amerilstones","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","amerilstones","amerilstones","amerilstones","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","amerilstones","amerilstones","amerilstones","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","amerilstones","amerilstones","amerilstones","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","supergrover6868","supergrover6868","supergrover6868","amerilstones","amerilstones","amerilstones","amerilstones","supergrover6868","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","amerilstones","amerilstones","amerilstones","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","amerilstones","amerilstones","amerilstones","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","amerilstones","amerilstones","amerilstones","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","amerilstones","amerilstones","amerilstones","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","amerilstones","amerilstones","amerilstones","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","amerilstones","amerilstones","amerilstones","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","amerilstones","amerilstones","amerilstones","amerilstones","Osambasucks2","LiamborninDC","LiamborninDC","LiamborninDC","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","LiamborninDC","Osambasucks2","Osambasucks2","Osambasucks2","William willie","Osambasucks2","Osambasucks2","Osambasucks2","killllshot","killllshot","killllshot","killllshot","Osambasucks2","Osambasucks2","Osambasucks2","Osambasucks2","killllshot","killllshot","killllshot","killllshot","Osambasucks2","supergrover6868","supergrover6868","supergrover6868","killllshot","Osambasucks2","Osambasucks2","Osambasucks2","killllshot"],"status":200}
    

    【讨论】:

    • 是的,我是 API 的作者 :)
    【解决方案3】:

    这样做:

    import re
    import sys
    import time
    import urllib2
    
    html = True
    
    argv_list = sys.argv
    if len(argv_list) == 2:
        vid = argv_list[1]
    else:
        vid = "mIA0W69U2_Y"
    
    regex = re.compile("<span class=\"author.*?<a href=\"(.*?)\".*? dir=\"ltr\">(.*?)</a>", re.DOTALL | re.UNICODE | re.IGNORECASE)
    
    index = 1
    author_lists = []
    t1 = time.time()
    print "######################### Start #########################"
    
    while 1:
        url = "http://www.youtube.com/watch_ajax?action_get_comments=1&v="+vid+"&commenttype=everything&source=w&page_size=500&p="+str(index)+"&format=XML"
        print "Retrieving page "+str(index)+": ", url
        o = urllib2.urlopen(url)
        r = o.read()
        elements = regex.findall(r)
        author_list = []
        for x, y in elements:
    
            if x.startswith("http://") or x.startswith("https://"):
                continue
            xx = "".join(["http://www.youtube.com", x])
            href = xx.strip()
            #print href
    
    
            if "</span>" not in y :
                uname = y.strip()
            else:
                uname = y.split("</span>")[0].strip()
    
            if uname.startswith("<a"):
                continue
    
            if not uname or not href:
                continue
    
            if html:
                #1 output html
                author = "".join(["<a href=\"", href, "\">", uname, "</a>"])
            else:
                #2 output txt
                author = " ".join([uname, href])
    
            author_list.append(author)
    
        t = "%02d:%02d:%02d" % reduce(lambda ll,b : divmod(ll[0],b) + ll[1:], [(time.time()-t1,),60,60])
        print "".join(["Time passed: ", t])
        if not author_list:
            break
        else:
            author_lists.extend(author_list)
        index+=1
        #break #uncomment it if you only want to test one page
    
    print "######################### Finished #########################"
    print "Total comments: ", len(author_lists)
    if author_lists:
        author_lists.sort()
        last = author_lists[-1]
        for i in range(len(author_lists)-2, -1, -1):
            if last == author_lists[i]:
                del author_lists[i]
            else:
                last = author_lists[i]
        if html:
            authors = "<br>".join(author_lists)
            authors = "".join(["<html><meta http-equiv='Content-Type' content='text/html; charset=utf-8'><body>", authors, "</body></html>"])
            fname = vid+".html"
        else:
            authors = "\n".join(author_lists)
            fname = vid+".txt"
    
        #print "Authors: ", authors
        print "Total commenters: ", len(author_lists)
    
    
    
        oo = open(fname, "w")
        oo.write(authors)
        oo.close()
    print "######################### Exist #########################"
    

    txt 输出示例:

    示例 html 输出:

    【讨论】:

      【解决方案4】:

      C# 也可以通过这种方式提供帮助(尽管 HAP 和 WebRequest 更好):

           SHDocVw.InternetExplorer ie = new
                  SHDocVw.InternetExplorerClass();
                  WebBrowser wb = (WebBrowser)ie;
                  wb.Visible = true;
                  //Do anything else with the window here that you wish
                  wb.Navigate("https://adwords.google.co.uk/um/Logout", ref o, ref o, ref o, ref o);
                  while (wb.Busy) { Thread.Sleep(100); }
                  HTMLDocument document = ((HTMLDocument)wb.Document);
                  IHTMLElement element = document.getElementById("Email");
                  HTMLInputElementClass email = (HTMLInputElementClass)element;
                  email.value = "testtestingtton@gmail.com";
                  email = null;
                  element = document.getElementById("Passwd");
                  HTMLInputElementClass pass = (HTMLInputElementClass)element;
                  pass.value = "pass";
                  pass = null;
                  element = document.getElementById("signIn");
                  HTMLInputElementClass subm = (HTMLInputElementClass)element;
                  subm.click();
                  subm = null;
      

      【讨论】:

        【解决方案5】:

        为您要提取的名称字段和其他字段编写 rssfeed 使用自动插件设置爬虫,按照以下步骤操作How to extract the data from multiple website

        【讨论】:

          【解决方案6】:

          这是使用 ruby 和 gems nokogiri 和 open-uri 的简单解决方案

          require 'nokogiri'
          require 'open-uri'
          url="https://www.youtube.com/all_comments?v=mIA0W69U2_Y"
          dom=Nokogiri::HTML(open(url))
          dom.xpath("//div[@class='comment-entry']").each do |comment|
            username=comment.xpath(".//a[contains(@class,'user-name')]").first
            username=username.content.chomp.strip if username
            profilelink=comment.xpath(".//a[contains(@class,'user-name')]/@href").first
            profilelink=profilelink.content.chomp.strip if profilelink
            profilelink="http://www.youtube.com"+profilelink if profilelink.match(/^\//)
            puts "#{username} #{profilelink}" if username and profilelink
          end
          

          欲了解更多信息,请访问How to extract data easily from multiple websites

          【讨论】:

            猜你喜欢
            • 2011-07-19
            • 1970-01-01
            • 2021-09-12
            • 2020-06-18
            • 2013-06-02
            • 1970-01-01
            • 2022-08-21
            • 1970-01-01
            相关资源
            最近更新 更多