使用 beutifulsoup 和 mechanize 从 html 表获取文本时出错答案

【问题标题】：Error getting text from html table using beutifulsoup and mechanize使用 beutifulsoup 和 mechanize 从 html 表获取文本时出错
【发布时间】：2017-07-11 13:18:46
【问题描述】：

我试图从表格标签内的 html 代码中获取文本，但我没有得到所有文本。相反，我得到了一些部分文本，其余的被忽略了

这是我的输出和代码：

输出

Public Sector Organization (Recruitment Test)
Test held on: Saturday, 3rd & Sunday 4th, December 2016
>>>

代码

import mechanize
from bs4 import BeautifulSoup
import urllib
from PIL import Image
import os


Roll=60170001          

url = "http://nts.org.pk/Test&Products/Results/012017/PubSecOrg_24122016_Result/Search.php"

br = mechanize.Browser()
br.set_handle_robots(False) # ignore robots
br.open(url)
br.select_form(nr=0)                                                            
rollnumber=str(Roll)
captcha=11111
cap=str(captcha)                        
br["RollNo"]=rollnumber
br["captcha"]=cap
res = br.submit()
content = res.read()
soup = BeautifulSoup(content,"html.parser")
rolln=soup('table')[2]
rolln=rolln.text.encode('utf-8')
print rolln

【问题讨论】：

请求的输出是什么？
原则上我的输出应该是表格[2]内的整个文本，有点像这样。公共部门组织（招聘测试）测试举行：星期六，3日和星期日4日，12月2016（结果）上传日期：2016 年 11 月 23 日，星期三关键字 60170001 的搜索结果卷号姓名父亲姓名 CNIC 邮政 NTS 标记 60170001 MUMTAZ ALI RAHMAN WALI 16101-1938424-7 讲师（BPS-17）（电子） 67 当前日期/时间：2017 年 2 月 22 日星期三 09:30:48 PM

标签： python html beautifulsoup mechanize

【解决方案1】：

这种方法似乎可以满足您的需求。

>>> content = open(r"C:\scratch\___National Testing Service___.html").read()
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(content, 'lxml')
>>> tables = soup.findAll('table')
>>> len(tables)
8
>>> tables[2].text
'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nPublic Sector Organization (Recruitment Test)\nTest held on: Saturday, 3rd & Sunday 4th, December 2016\n\n                            \n                            (Result)\n\n\n\n\n\n                                Search Result for the keyword   "\n                                60170001                             \n"\n\n\n\nRoll No\nName\nFather Name\nCNIC\n\nPost\n\n\nKDPH\n\n\nNTS Marks\n\n\n\n60170001\nSARA ISLAM                               \nNAZAR UL ISLAM  \n17301-2406027-4  \n\n    Assistant Manager(Electronics Engineering)   \n\n\n      \n\n\n    63   \n\n\n\n\n\n\n\n\n\n\nCurrent Date / Time: Tuesday 21st, February 2017 , 11:49:59 PM                           \n\n\n\n\n\xa0\n\n'

假设mechanize 为您提供的文件格式与我只需在 Chrome 浏览器中打开页面并保存即可获得的格式相同。

【讨论】：

非常感谢..终于完美运行了。刚刚安装了lxml。
不客气。我不能确定问题可能是什么。