【发布时间】:2022-07-06 04:17:21
【问题描述】:
我正在尝试单击具有相同类名的多个 div。解析 HTML 页面,提取一些信息并返回到同一页面。 在这个page.
- 选择项目并提取相关信息
- 回到原来的page
- 点击下一项。
这在 for 循环之外非常有效。
WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.XPATH,'//*[@class="product__wrapper"][1]'))).click()
但是当我在循环中使用上述命令时。它抛出错误 InvalidSelectorException
for i in range(1,len(all_profile_url)):
WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.XPATH,'//*[@class="product__wrapper"][{i}]'))).click()
time.sleep(10)
wd.execute_script('window.scrollTo(0,1000)')
page_source = BeautifulSoup(wd.page_source, 'html.parser')
info_div = page_source.find('div', class_='ProductInfoCard__Breadcrumb-sc-113r60q-4 cfIqZP')
info_block = info_div.find_all('a')
try:
info_category = info_block[1].get_text().strip()
except IndexError:
info_category ="Null"
wd.back()
time.sleep(5)
我想使用下面的代码从每个页面中提取什么
page_source = BeautifulSoup(wd.page_source, 'html.parser')
info_div = page_source.find('div', class_='ProductInfoCard__Breadcrumb-sc-113r60q-4 cfIqZP')
info_block = info_div.find_all('a')
try:
info_category = info_block[1].get_text().strip()
except IndexError:
info_category ="Null"
try:
info_sub_category = info_block[2].get_text().strip()
except IndexError:
info_sub_category='Null'
try:
info_product_name = info_div.find_all('span')[0].get_text().strip()
except IndexError:
info_product_name='null'
# Extract Brand name
info_div_1 = page_source.find('div', class_='ProductInfoCard__BrandContainer-sc-113r60q-9 exyKqL')
try:
info_brand = info_div_1.find_all('a')[0].get_text().strip()
except IndexError:
info_brand='null'
# Extract details for rest of the page
info_div_2 = page_source.find('div', class_='ProductDetails__RemoveMaxHeight-sc-z5f4ag-3 fOPLcr')
info_block_2 = info_div_2.find_all('div', class_='ProductAttribute__ProductAttributesDescription-sc-dyoysr-2 lnLDYa')
try:
info_shelf_life = info_block_2[0].get_text().strip()
except IndexError:
info_shelf_life = 'null'
try:
info_country_of_origin = info_block_2[3].get_text().strip()
except IndexError:
info_country_of_origin='null'
try:
info_weight = info_block_2[9].get_text().strip()
except IndexError:
info_weight ='null'
try:
info_expiry_date = info_block_2[7].get_text().strip()
except IndexError:
info_expiry_date='null'
# Extract MRP and price
# Extract MRP and price
info_div_3 = page_source.find('div', class_='ProductVariants__VariantDetailsContainer-sc-1unev4j-7 fvkqJd')
info_block_3 = info_div_3.find_all('div', class_='ProductVariants__PriceContainer-sc-1unev4j-9 jjiIua')
info_price_raw = info_block_3[0].get_text().strip()
info_price = info_block_3[0].get_text().strip()[1:3]
info_MRP = info_price_raw[-2:]
【问题讨论】:
-
你有没有机会缩小你的例子?
-
@dosas 上面已编辑
-
我建议你一次获取所有项目的链接,然后逐个访问url
-
如果您需要我编码,请告诉我
-
@HimanshuPoddar 这正是我第一次尝试时所做的。使用 wd.get(all_profile_url[i])。但是在前几次迭代后,循环在 .get() 命令上完全失败。然后我采取了更长的路线并在每个循环中重新启动 webdriver。但这在 wd.get() 命令上也随机失败
标签: python selenium selenium-webdriver web-scraping