mzjn-s 答案是正确的。经过一些试验和错误,我设法让它工作。这就是最终代码的样子。您需要将 //text() 放在 xpath 的末尾。暂时没有重构,所以肯定会有一些错误和不好的做法,但它是有效的。
session = requests.Session()
retry = Retry(connect=3, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
page = session.get("The url you are webscraping")
content = page.content
htmlsite = urllib.request.urlopen("The url you are webscraping")
soup = BeautifulSoup(htmlsite, 'lxml')
htmlsite.close()
tree = html.fromstring(content)
scraped = tree.xpath('//html[contains(@class, "no-js")]/body/div[contains(@class, "container")]/div[contains(@class, "content")]/div[contains(@class, "row")]/div[contains(@class, "col-md-6")]/div[contains(@class, "clearfix")]//text()')
我已经在 keeleyteton.com 的团队介绍页面上试用过了。它返回了以下正确的列表(尽管需要大量修改!),因为它们位于不同的标签中,有些是子标签。感谢您的帮助!
['\r\n ', '\r\n ', 'Nicholas F. Galluccio', '\r\n ', '\r\n ', 'Managing Director and Portfolio Manager', '\r\n ', 'Teton Small Cap Select Value', '\r\n ', 'Keeley Teton Small Mid Cap Value', '\r\n ', '\r\n ', '\r\n ', 'Scott R. Butler', '\r\n ', '\r\n ', 'Senior Vice President and Portfolio Manager ', '\r\n ', 'Teton Small Cap Select Value', '\r\n ', 'Keeley Teton Small Mid Cap Value', '\r\n ', '\r\n ', '\r\n ', 'Thomas E. Browne, Jr., CFA', '\r\n ', '\r\n ', 'Portfolio Manager', '\r\n ', 'Keeley Teton Small and Mid Cap Dividend Value', '\r\n ', 'Keeley Teton Small and Small Mid Cap Value', '\r\n ', '\r\n ', '\r\n ', 'Brian P. Leonard, CFA', '\r\n ', '\r\n
', 'Portfolio Manager', '\r\n ', 'Keeley Teton Small and Mid Cap Dividend Value', '\r\n ', 'Keeley Teton Small and Small Mid Cap Value', '\r\n ', '\r\n ', '\r\n ', 'Robert M. Goldsborough', '\r\n ', '\r\n ', 'Research Analyst', '\r\n ', 'Keeley Teton Small and Mid Cap Dividend Value', '\r\n ', '\r\n ', '\r\n ', 'Brian R. Keeley, CFA', '\r\n ', '\r\n ', 'Portfolio Manager', '\r\n ', 'Keeley Teton Small and Small Mid Cap Value', '\r\n ', '\r\n ', '\r\n ', 'Edward S. Borland', '\r\n ', '\r\n
', 'Research Analyst', '\r\n ', 'Keeley Teton Small and Small Mid Cap Value', '\r\n ', '\r\n ', '\r\n ', 'Kevin M. Keeley', '\r\n ', '\r\n ', 'President', '\r\n
', '\r\n ', '\r\n ', 'Deanna B. Marotz', '\r\n ', '\r\n ', 'Chief Compliance Officer', '\r\n ']