【问题标题】:How to parse info in a specific language from a multilingual site?How to parse info in a specific language from a multilingual site?
【发布时间】:2022-11-09 13:21:22
【问题描述】:

I am trying to parse info from a multilingual site. I fail to grab information in English, the soup I make would always return info in Russian.

The link and my code are as follows.

'https://iherb.com/c/california-gold-nutrition'

`headers = {
    "Accept-Language": "en",
    "user-agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36"
}

def make_soup(url):
    r = requests.get(url=url, headers=headers)
    r.encoding = 'utf-8'
    return BeautifulSoup(r.text, 'lxml')

url = 'https://iherb.com/c/california-gold-nutrition'

with webdriver.Chrome() as browser:
    browser.get(url)

    menue_goer = WebDriverWait(browser, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, \
    '.language-select.hidden-xs.hidden-sm'))).click()

    language = WebDriverWait(browser,5).until(EC.element_to_be_clickable((By.CSS_SELECTOR,
    '.select-language.gh-dropdown'))).click()

    English = WebDriverWait(browser,5).until(EC.element_to_be_clickable((By.CSS_SELECTOR,
    ".item.gh-dropdown-menu-item["data-val='en-US']"))).click()

    save_button = WebDriverWait(browser,5).until(EC.element_to_be_clickable((By.XPATH,
    "//button[@class='save-selection gh-btn gh-btn-primary']"))).click()

    time.sleep(10)

soup = make_soup(url)
names = [x['title'].replace(u'\xa0', u' ') for x in soup.find('div', id='ProductsPage').find_all('a', class_='absolute-link product-link')]

print(names)`

So far I have tried to change lang settings using Selenium and play with headers, but alas none of them worked. Is there any way to change settings to a specific language?

  • Check with this locator - By.CSS_SELECTOR, ".item.gh-dropdown-menu-item["data-val='en-US']" , is this a correct one? You have to remove the double-quote before the text data-val, it should be like: ".item.gh-dropdown-menu-item[data-val='en-US']"
  • This is entirely up to the web site. If they provide a method for changing the language (and many sites to not), then you have to figure out how to select it.
  • @AbiSaran, Thank you, sir. I removed the double quote, but it wouldn't work anyway.

标签: python selenium parsing multilingual


【解决方案1】:

I modified your code:

from selenium.webdriver.chrome.service import Service as ChromeService

with webdriver.Chrome(service=ChromeService(ChromeDriverManager().install())) as browser:  # included the service here
    browser.get(url)
    menue_goer = WebDriverWait(browser, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '.language-select.hidden-xs.hidden-sm'))).click()
    language = WebDriverWait(browser,5).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '.select-language.gh-dropdown'))).click()
    English = WebDriverWait(browser,5).until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".item.gh-dropdown-menu-item[data-val='en-US']"))).click()
    save_button = WebDriverWait(browser,5).until(EC.element_to_be_clickable((By.XPATH, "//button[@class='save-selection gh-btn gh-btn-primary']"))).click()
    time.sleep(10)
soup = make_soup(url)
names = [x['title'].replace(u'\xa0', u' ') for x in soup.find('div', id='ProductsPage').find_all('a', class_='absolute-link product-link')]

print(names)

Got the output as:

['California Gold Nutrition, Gold C, USP Grade Vitamin C, 1,000 mg, 60 Veggie Capsules', 'California Gold Nutrition, Omega-3 Premium Fish Oil, 180 EPA / 120 DHA, 100 Fish Gelatin Softgels', 'California Gold Nutrition, LactoBif Probiotics, 30 Billion CFU, 60 Veggie Capsules', 'California Gold Nutrition, Vitamin D3, 125 mcg (5,000 IU), 360 Fish Gelatin Softgels', 'California Gold Nutrition, Immune 4, Immune System Support, 60 Veggie Capsules', 'California Gold Nutrition, Vitamin D3, 125 mcg (5,000 IU), 90 Fish Gelatin Softgels', 'California Gold Nutrition, LactoBif Probiotics, 5 Billion CFU, 60 Veggie Capsules', 'California Gold Nutrition, Vitamin D3, 50 mcg (2,000 IU), 90 Fish Gelatin Softgels', 'California Gold Nutrition, Omega 800 Pharmaceutical Grade Fish Oil, 80% EPA/DHA, Triglyceride Form, 1,000 mg, 30 Fish Gelatin Softgels', 'California Gold Nutrition, Vitamin C Gummies, 90 Gummies', 'California Gold Nutrition, FOODS, Variety Pack Snack Bars, 12 Bars, 1.4 oz (40 g) Each', 'California Gold Nutrition, Gold C, USP Grade Vitamin C, 500 mg, 240 Veggie Capsules', 'California Gold Nutrition, Omega-3 Premium Fish Oil, 240 Fish Gelatin Softgels', 'California Gold Nutrition, Baby Vitamin D3 Liquid, 10 mcg (400 IU), 0.34 fl oz (10 ml)', 'California Gold Nutrition, Silymarin Complex, Milk Thistle Extract Plus Dandelion, Artichoke, Curcumin C3 Complex®, Ginger, and BioPerine®, 120 Veggie Capsules', 'California Gold Nutrition, Vitamin D3 Gummies, No Gelatin, No Gluten, Mixed Berry & Fruit Flavors, 25 mcg (1,000 IU), 90 Gummies', 'California Gold Nutrition, Gold C Powder, Vitamin C, 1,000 mg, 8.81 oz (250 g)', 'California Gold Nutrition, Astaxanthin, Astaliff® Pure Icelandic, 12 mg, 120 Veggie Softgels', 'California Gold Nutrition, Immune 4, Immune System Support, 180 Veggie Capsules', 'California Gold Nutrition, Buffered Gold C, GOLD Standard Sodium Ascorbate (Vitamin C), 750 mg, 240 Veggie Capsules', 'California Gold Nutrition, Organic Spirulina, 500 mg, 60 Tablets', 'California Gold Nutrition, Vitamin D3, 50 mcg (2,000 IU), 360 Fish Gelatin Softgels', 'California Gold Nutrition, Buffered Gold C, GOLD Standard Sodium Ascorbate (Vitamin C), 750 mg, 60 Veggie Capsules', 'California Gold Nutrition, Total Veggie Joint Support Formula, With Glucosamine, Chondroitin, MSM, and Hyaluronic Acid, 90 Veggie Capsules']
    猜你喜欢
    • 2022-12-02
    • 2022-12-28
    • 2022-12-02
    • 2022-12-02
    • 1970-01-01
    • 2022-12-19
    • 2022-08-27
    • 2022-12-02
    • 2022-12-01
    相关资源
    最近更新 更多