获取网站名称包含python 27中的HTML代码答案

【问题标题】：Fetching websites name contains the HTML code in python 27获取网站名称包含python 27中的HTML代码
【发布时间】：2015-09-22 20:46:24
【问题描述】：

我在运行 python 脚本时遇到问题，它会下载公司名称、地址、位置地址和 WEB 地址等公司业务目录。

但是当这个脚本获取公司的网站名称时，例如www.example.com，它只是获取网站名称的 HTML 代码，而不是获取网站名称，它还将 HTML 代码存储到当前网站的 MySQL 服务器中。

我使用来自 BeautifulSoup、lxml、html、hashlib、urllib2 的以下 Python 库，并将网站名称 HTML 代码存储到 MYSQL 服务器中，例如

<input><tr><td>www.example.com</td></tr></input>

我想删除这个 html 标记并将公司网址（如 www.example.com）存储到 MySQL 服务器中

我的代码在这里：

for hit in soup2.findAll(attrs={'id' : 'webSite_0'}):
    web = str(hit).replace('<input type="hidden" value="', '')
    web = web.replace('" id="webSite_0" />', '')
if web == "":
    flog.write("\nWebsite extraction... Failed")
    print "None"
else:
    flog.write("\nWebsite extraction... OK")
    print web
    companyObj.setWeb(web)

关于如何解决此问题的任何解决方案或任何建议。

【问题讨论】：

标签： html mysql python-2.7 beautifulsoup lxml

【解决方案1】：

您有（至少）两种选择：使用re 或BeautifulSoup。

使用重新

import re
cleanse_url = re.compile(r'<[^>]*>')

for hit in soup2.findAll(attrs={'id' : 'webSite_0'}):
    web = str(hit).replace('<input type="hidden" value="', '')
    web = web.replace('" id="webSite_0" />', '')
if web == "":
    flog.write("\nWebsite extraction... Failed")
    print "None"
else:
    web = cleanse_url.sub('', web)  # escape the HTML
    flog.write("\nWebsite extraction... OK")
    print web
    companyObj.setWeb(web)

使用 BeautifulSoup.Tag.text

我认为这个选项更好，因为tag.text 可以去除属性和标签。

for hit in soup2.findAll(attrs={'id' : 'webSite_0'}):
    web = hit.text # use beautifulsoup
if web == "":
    flog.write("\nWebsite extraction... Failed")
    print "None"
else:
    flog.write("\nWebsite extraction... OK")
    print web
    companyObj.setWeb(web)

【讨论】：

thanks man...但仍然出现错误..但每当我执行以下查询时，它解决了我的问题，但我需要在我的代码中执行此操作。查询是 UPDATE amt_austria_company SET web = REPLACE( web, '', '' ) WHERE web !="" EI DUITA QUERY DILEI DATABASE THEKE url CLEAN HOYE THIK KOR FELE