Python Web Scraping - 如何避免为我的数据库抓取重复项？答案

【问题标题】：Python Web Scraping - How do I avoid scraping duplicates for my database?Python Web Scraping - 如何避免为我的数据库抓取重复项？
【发布时间】：2021-03-11 18:31:33
【问题描述】：

我是 Python 新手，最近在此处和 YouTube 的帮助下，编写了一个程序，该程序能够抓取新闻网站上发布的新闻文章的页面。我的下一步是建立一个数据库并将我抓取的文章提供给它。

设置数据库成功。但我遇到的一个明显问题是，新闻文章被多次抓取，因此在一遍又一遍地运行我的程序时被多次添加到数据库中。不幸的是，到目前为止，我还没有找到任何可以为我解决此问题的答案或视频，所以我希望有人可以帮助我解决这个问题（也许我一直在寻找错误的术语，已尽力而为）。

程序代码本身按预期工作。这只是某种“识别”对象或我需要的不同的东西。非常感谢任何帮助:) 代码如下：

import requests
import sqlite3
from bs4 import BeautifulSoup
from time import sleep
from random import randint

connect = sqlite3.connect('StoredArticles.db')
cursor = connect.cursor()

# cursor.execute('''CREATE TABLE articlestable
# (article_page INT, article_time TEXT, article_title TEXT, article_link TEXT, article_description TEXT)''')

# Scraping Function
def getarticles(page):
    headers = {
        'User-Agent':
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/605.1.15 (KHTML, like Gecko)' +
            'Version/14.0.1 Safari/605.1.15'
    }
    url = 'https://www.prnewswire.com/news-releases/news-releases-list/?page=' + str(page) + '&pagesize=100'
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, 'html.parser')

    articles = soup.select('.card-list > .row')  # select all rows that are under class "card-list"
    print("Scraping page " + str(page) + "...")
    sleep(randint(0, 1))

    for item in articles:
        article_page = page
        article_time = item.select_one('h3 small').text
        article_title = item.select_one('h3 small').find_next_sibling(text=True).strip()
        article_link = 'https://www.prnewswire.com/' + item.select_one('a')['href']
        article_description = item.select_one('p').get_text(strip=True, separator='\n')
        cursor.execute('''INSERT INTO articlestable VALUES(?,?,?,?,?)''',
                       (article_page, article_time, article_title, article_link, article_description))
    return

# Range of Pages to scrape through
for x in range(1, 3):
    getarticles(x)

# Add to Database and Finish Program
connect.commit()
cursor.execute('''SELECT * FROM articlestable''')
results = cursor.fetchall()
print(results)

connect.close()

【问题讨论】：

也许这个链接会有所帮助：stackoverflow.com/questions/42381358/…
@AndrejKesely 感谢您的回答，我已经尝试了一些，我想我会达到我的目标！ :)

标签： python database sqlite web-scraping

【解决方案1】：

回答我自己的问题：超级简单

这是我添加的代码。

在“connect”和“cursor”变量的声明下添加了这段代码：

try:
    cursor.execute('''CREATE TABLE articlestable
    (article_page INT, article_time TEXT, article_title TEXT, article_link TEXT, article_description TEXT)''')
    cursor.execute('''CREATE UNIQUE INDEX index_article_link ON articlestable(article_link)''')
except:
    pass

这实际上只是试图创建表和唯一索引。就我而言，唯一索引是 URL。这是有道理的，因为这些都是个人的。这必须完成一次，之后将被跳过。

第二个也是最后一个变化发生在循环中，我将变量插入到数据库中。从字面上看，您所要做的就是写INSERT OR IGNORE 而不仅仅是INSERT。看起来像这样：

cursor.execute('''INSERT OR IGNORE INTO articlestable VALUES(?,?,?,?,?)''', (article_page, article_time, article_title, article_link, article_description))

瞧！完成了。

【讨论】：