python数据分析个人学习读书笔记-目录索引

 

第9章描述了多种与分析文本信息流相关的技巧:词性标注、主题抽取以及对文本数据的分类。

本章中,会学习以下技巧:
·从网络读入原始文本
·标记化和标准化
·识别词类,处理n-gram,识别命名实体
·识别文章主题
·识别句子结构
·根据评论给影片归类

9.1导论
根据受控环境中收集的结构化数据建模(比如前一章)还是相对直接的。然而,现实世界中,我们很少处理结构化数据。理解人们的反馈或分析报纸的文章时尤其如此。
NLP(Natural Language Processing,自然语言处理)这门学科涉及计算机科学,统计学及语言学,其目标是处理人类语言(我特意没有使用“理解”这个词)及提取特征以用于建模。使用NLP的概念,在其他任务中,我们可以找出文本中出现最多的词,以大致辨别出文本主题,识别出人名地名,找到句子的主语和宾语,或者分析某人反馈信息的情感。
这些技巧会用到两个数据。第一个来自西雅图时报的官网——Josh Lederman关于Obama要求对更多枪支交易的背景调查(http://www.seattletimes.com/nation-world/obama-starts-2016-with-a-fight-over-gun-control/,访问时间:2016年1月4日)。
另一个是经过A.L.Mass等处理过的50000条影评;完整的数据集在http://ai.stanford.edu/~amaas/data/sentiment/。其在“Andrew L.Maas,Raymond E.Daly,Peter T.Pham,Dan Huang,Andrew Y.Ng,and Christopher Potts(2011),Learning Word Vectors for Sentiment Analysis,The 49th Annual Meeting of the Association for Computational Linguistics(ACL 2011)”中公布。
关于50000条影评,我们从训练批次和测试批次中各选取2000条正面的与2000条负面的。

9.2从网络读入原始文本

大多数时候,无格式文本可以在文本文件中找到;本技巧中,我们不会教你如何这样做,因为之前已经展示过了。(参考本书第1章)
下一技巧会讨论我们还没讨论过的读入文件的方法。
然而,很多时候,我们需要直接从网络读入原始文本:我们也许希望分析一篇博客帖子、一篇文章或者Facebook/Twitter上的帖子。Facebook和Twitter提供了API(Application Programming Interfaces,应用编程接口),一般以XML或JSON格式返回数据,处理HTML文件并不这么直接。
本技巧中,会学到如何处理Web页面,读入内容并进行处理。
准备:需装好urllib、html5lib和Beautiful Soup。
Python 3自带urllib(https://docs.python.org/3/library/urllib.html)。然而,如果你的配置中没有Beautiful Soup,安装一下也很简单。
另外,要用Beautiful Soup解析HTML文件,我们需要安装html5lib;
步骤:
Python 2.x和Python 3.x下使用urllib访问网站的过程略有不同:(Python 2.x中的)urllib2已经拆成了urllib.request、urllib.error、urllib.parse和urllib.robotparser。
更多信息请移步https://docs.python.org/2/library/urllib2.html
本技巧中,我们使用urllib.request(nlp_read.py文件):

 1 import urllib.request as u
 2 import bs4 as bs
 3 
 4 # link to the article at The Seattle Times
 5 st_url = 'http://www.seattletimes.com/nation-world/obama-starts-2016-with-a-fight-over-gun-control/'
 6 
 7 # read the contents of the webpage
 8 with u.urlopen(st_url) as response:
 9     html = response.read()
10 
11 # using beautiful soup -- let's parse the content of the HTML
12 read = bs.BeautifulSoup(html, 'html5lib')
13 
14 # find the article tag
15 article = read.find('article')
16 
17 # find all the paragraphs of the article
18 all_ps = article.find_all('p')
19 
20 # object to hold the body of text
21 body_of_text = []
22 
23 # get the tile
24 body_of_text.append(read.title.get_text())
25 print(read.title)
26 
27 # put all the paragraphs to the body of text list
28 for p in all_ps:
29     body_of_text.append(p.get_text())
30 
31 # we don't need some of the parts at the bottom of the page
32 body_of_text = body_of_text[:24]
33 
34 # let's see what we got
35 print('\n'.join(body_of_text))
36 
37 # and save it to a file
38 with open('../../Data/Chapter09/ST_gunLaws.txt', 'w') as f:
39     f.write('\n'.join(body_of_text)) 

原理:一如既往,我们先导入需要的模块;本例中,就是urllib和Beautiful Soup。
西雅图时报的文章链接存在st_url对象。urllib的.urlopen(...)方法打开这个特定的URL。
我们用到了with(...)as...结构——这个结构我们已经熟悉了——因为在我们不再使用时,它会适时地关闭连接。当然,你也能这么做:

1 local_filename.headers=\
2     urllib.request.urlretrieve(st_url)
3 html=open(local_filename)
4 html.Close() 

响应对象的.read()方法读入网页的全部内容。打印出来会是这样(当然,已经简化了):

《数据分析实战-托马兹.卓巴斯》读书笔记第9章--自然语言处理NLTK(分析文本、词性标注、主题抽取、文本数据分类)

这是网页以纯文本呈现的内容。这不是我们要分析的内容。
Beautiful Soup从天而降!BeautifulSoup(...)方法以HTML或XML文本作为第一个参数。第二个参数指定了使用的解析器。
所有可用的解析器,参见http://www.crummy.com/software/BeautifulSoup/bs4/doc/#specifying-the-parser-to-use
解析之后,(某种程度上)更可读了:
《数据分析实战-托马兹.卓巴斯》读书笔记第9章--自然语言处理NLTK(分析文本、词性标注、主题抽取、文本数据分类)

然而,我们并不用Beautiful Soup将结果打印到屏幕上。BeautifulSoup对象内部实现成一个文档中标签组成的层次化包。
你可以认为BeautifulSoup对象是一棵树。
有了前面的输出,好消息是你可以找出HTML/XML文件中所有的标签。新一代网络页面(兼容HTML5)有些新标签,有助于更方便地在页面上展示内容。
所有新元素参见:http://www.w3.org/TR/html5-diff/#new-elements。
我们的例子中,我们先找到并提取文章标签;这缩小了我们搜索的范围:
《数据分析实战-托马兹.卓巴斯》读书笔记第9章--自然语言处理NLTK(分析文本、词性标注、主题抽取、文本数据分类)
现在我们只关注文章内容,忽略用来构建页面的部分。如果你自行查看网页,你会看到眼熟的句子:
Obama moves to require background checks for more gun sales.
Originally published January 4,2016 at12:50 am.
我们行进在正确的道路上。往下滚动一点,我们可以看到更多熟悉的文章句子:
《数据分析实战-托马兹.卓巴斯》读书笔记第9章--自然语言处理NLTK(分析文本、词性标注、主题抽取、文本数据分类)
显然,文章的段落都包含在<p>标签对里面。所以,我们用.find_all('p')将他们全都提取出来。
然后,我们将标题加到body_of_text列表。我们用.get_text()方法提取标签对之间的文本;否则我们得到的内容里面将包含标签:

/*
<title>Obama moves to require background checks for more gun sales | The Seattle Times</title>
*/

我们用同样的方法对所有段落剥除标签。你会在屏幕上看到下面的内容(有缩略):

/*
Obama moves to require background checks for more gun sales | The Seattle Times
Although Obama can't unilaterally change gun laws, the president is hoping that beefing up enforcement of existing laws can prevent at least some gun deaths in a country rife with them.
WASHINGTON (AP) — President Barack Obama moved Monday to expand background checks to cover more firearms sold at gun shows, online and anywhere else, aiming to curb a scourge of gun violence despite unyielding opposition to new laws in Congress.
Obama’s plan to broaden background checks forms the centerpiece of a broader package of gun control measures the president plans to take on his own in his final year in office. Although Obama can’t unilaterally change gun laws, the president is hoping that beefing up enforcement of existing laws can prevent at least some gun deaths in a country rife with them.
Washington state voters last fall passed Initiative 594 that expanded background checks for gun buyers to include private sales and transfers, such as those conducted online or at gun shows.
Gun-store owner moving out of Seattle because of new tax
“This is not going to solve every violent crime in this country,” Obama said. Still, he added, “It will potentially save lives and spare families the pain of these extraordinary losses.”
Under current law, only federally licensed gun dealers must conduct background checks on buyers, but many who sell guns at flea markets, on websites or in other informal settings don’t register as dealers. Gun control advocates say that loophole is exploited to skirt the background check requirement.
Now, the Justice Department’s Bureau of Alcohol, Tobacco, Firearms and Explosives will issue updated guidance that says the government should deem anyone “in the business” of selling guns to be a dealer, regardless of where he or she sells the guns. To that end, the government will consider other factors, including how many guns a person sells and how frequently, and whether those guns are sold for a profit.
The executive actions on gun control fall far short of what Obama and likeminded lawmakers attempted to accomplish with legislation in 2013, after a massacre at a Connecticut elementary school that shook the nation’s conscience. Even still, the more modest measures were sure to spark legal challenges from those who oppose any new impediments to buying guns.
“We’re very comfortable that the president can legally take these actions,” said Attorney General Loretta Lynch.
Obama’s announcement was hailed by Democratic lawmakers and gun control groups like the Brady Campaign to Prevent Gun Violence, which claimed Obama was making history with “bold and meaningful action” that would make all Americans safer. Hillary Clinton, at a rally in Iowa, said she was “so proud” of Obama but warned that the next president could easily undo his changes.
“I won’t wipe it away,” Clinton said.
But Republicans were quick to accuse Obama of gross overreach. Sen Bob Corker, R-Tenn., denounced Obama’s steps as “divisive and detrimental to real solutions.”
“I will work with my colleagues to respond appropriately to ensure the Constitution is respected,” Corker said.
Far from mandating background checks for all gun sales, the new guidance still exempts collectors and gun hobbyists, and the exact definition of who must register as a dealer and conduct background checks remains exceedingly vague. The administration did not issue a number for how many guns someone must sell to be considered a dealer, instead saying it planned to remind people that courts have deemed people to be dealers in some cases even if they only sell a handful of guns.
And the background check provision rests in the murky realm of agency guidelines, which have less force than full-fledged federal regulations and can easily be rescinded. Many of the Republican presidential candidates running to succeed Obama have vowed to rip up his new gun restrictions upon taking office.
In an attempt to prevent gun purchases from falling through the cracks, the FBI will hire 230 more examiners to process background checks, the White House said, an increase of about 50 percent. Many of the roughly 63,000 background check requests each day are processed within seconds. But if the system kicks back a request for further review, the government only has three days before federal law says the buyer can return and buy the gun without being cleared.
That weak spot in the system came under scrutiny last summer when the FBI revealed that Dylann Roof, the accused gunman in the Charleston, S.C., church massacre, was improperly allowed to buy a gun because incomplete record-keeping and miscommunication among authorities delayed processing of his background check beyond the three-day limit.
The White House also said it planned to ask Congress for $500 million to improve mental health care, and Obama issued a memorandum directing federal agencies to conduct or sponsor research into smart gun technology that reduces the risk of accidental gun discharges. The Obama administration also plans to complete a rule, already in the works, to close another loophole that allows trusts or corporations to purchase sawed-off shotguns, machine guns and similar weapons without background checks.
Obama planned to announce the new measures at an event at the White House on Tuesday as he continued a weeklong push to promote the gun effort and push back on its critics.
He met at the White House on Monday with Democratic lawmakers who have supported stricter gun control, and planned to take his argument to prime time Thursday with a televised town hall discussion. The initiative also promised to be prominent in Obama’s final State of the Union address next week.
Whether the new steps will effectively prevent future gun deaths remained unclear. Philip Cook, a Duke University professor who researches gun violence and policy, said surveys of prisoners don’t show gun shows to be a major direct source of weapons used in violent crime. The attorney general, asked how many dealers would be newly forced to register, declined to give a number.
“It’s just impossible to predict,” Lynch said.
*/
View Code

相关文章:

  • 2021-11-29
  • 2021-07-20
  • 2021-12-06
  • 2021-05-17
  • 2021-07-17
  • 2022-01-11
  • 2021-08-07
猜你喜欢
  • 2021-10-15
  • 2021-06-24
  • 2021-06-22
  • 2021-08-14
  • 2021-11-29
  • 2022-01-13
  • 2022-01-28
相关资源
相似解决方案