如何检查网站上的值是否已更改答案

【问题标题】：How to check if the value on a website has changed如何检查网站上的值是否已更改
【发布时间】：2012-06-28 20:42:16
【问题描述】：

基本上，如果网站上的值发生更改，我会尝试运行一些代码（Python 3.2），否则请稍等片刻并稍后检查。

首先，我认为我可以将值保存在变量中，并将其与下次运行脚本时获取的新值进行比较。但这很快就遇到了问题，因为当脚本再次运行并初始化该变量时，该值被覆盖。

然后我尝试将网页的 html 保存为文件，然后将其与下次脚本运行时调用的 html 进行比较。那里也没有运气，因为即使没有变化，它也会不断出现 False。

接下来是对网页进行腌制，然后尝试将其与 html 进行比较。有趣的是，这在脚本中也不起作用。但是，如果我在脚本运行后键入 file = pickle.load( open( 'D:\Download\htmlString.p', 'rb')) 然后 file == html，它会在没有出现时显示 True任何更改。

我有点困惑为什么它在脚本运行时不起作用，但如果我执行上述操作，它会显示正确的答案。

编辑：感谢到目前为止的回复。我遇到的问题并不是关于其他方法来解决这个问题（尽管学习更多完成任务的方法总是好的！）而是为什么下面的代码在作为脚本运行时不起作用，但是如果我脚本运行后在提示符处重新加载 pickle 对象，然后针对 html 对其进行测试，如果没有任何更改，它将返回 True 。

try: 
    file = pickle.load( open( 'D:\\Download\\htmlString.p', 'rb'))
    if pickle.load( open( 'D:\\Download\\htmlString.p', 'rb')) == htmlString:
        print("Values haven't changed!")
        sys.exit(0)
    else:
        pickle.dump( htmlString, open( 'D:\\Download\\htmlString.p', "wb" ) )  
        print('Saving')
except: 
    pickle.dump( htmlString, open( 'D:\\Download\\htmlString.p', "wb" ) )
    print('ERROR')

【问题讨论】：

远程和本地内容的内容/mimetype 是什么？
保存和比较整个页面的效率会非常低。你可以计算一个像 md5 这样的哈希值并保存它。如果将来哈希匹配，则页面没有更改。
我已更新我的答案以解决您的编辑问题。这就是你要找的东西吗？

标签： python compare

【解决方案1】：

编辑：我没有意识到您只是在寻找脚本的问题。这就是我认为的问题，然后是我的原始答案，它解决了您要解决的更大问题的另一种方法。

您的脚本是使用毯子except 声明的危险的一个很好的例子：你抓住了一切。在这种情况下，包括您的sys.exit(0)。

我假设您是 try 块，以捕捉 D:\Download\htmlString.p 尚不存在的情况。该错误称为IOError，您可以使用except IOError: 专门捕获它

这是您的脚本和一些代码，用于解决您的 except 问题：

import sys
import pickle
import urllib2

request = urllib2.Request('http://www.iana.org/domains/example/')
response = urllib2.urlopen(request) # Make the request
htmlString = response.read()

try: 
    file = pickle.load( open( 'D:\\Download\\htmlString.p', 'rb'))
    if file == htmlString:
        print("Values haven't changed!")
        sys.exit(0)
    else:
        pickle.dump( htmlString, open( 'D:\\Download\\htmlString.p', "wb" ) )  
        print('Saving')
except IOError: 
    pickle.dump( htmlString, open( 'D:\\Download\\htmlString.p', "wb" ) )
    print('Created new file.')

作为旁注，您可以考虑使用os.path 作为您的文件路径——它将帮助以后想要在另一个平台上使用您的脚本的任何人，并且它可以为您省去丑陋的双反斜杠。

编辑 2：针对您的特定 URL 进行调整。

该页面上的广告有一个动态生成的数字，该数字会随着每次页面加载而变化。在所有内容之后就在末尾附近，因此我们可以在该点拆分 HTML 字符串并取前半部分，丢弃带有动态数字的部分。

import sys
import pickle
import urllib2

request = urllib2.Request('http://ecal.forexpros.com/e_cal.php?duration=weekly')
response = urllib2.urlopen(request) # Make the request
# Grab everything before the dynabic double-click link
htmlString = response.read().split('<iframe src="http://fls.doubleclick')[0]

try: 
    file = pickle.load( open( 'D:\\Download\\htmlString.p', 'r'))
    if pickle.load( open( 'D:\\Download\\htmlString.p', 'r')) == htmlString:
        print("Values haven't changed!")
        sys.exit(0)
    else:
        pickle.dump( htmlString, open( 'D:\\Download\\htmlString.p', "w" ) )  
        print('Saving')
except IOError: 
    pickle.dump( htmlString, open( 'D:\\Download\\htmlString.p', "w" ) )
    print('Created new file.')

如果这很重要，您的字符串将不再是有效的 HTML 文档。如果是这样，您可能会删除该行或其他内容。可能有一种更优雅的方式来做到这一点——也许用正则表达式删除数字——但这至少可以满足你的问题。

原始答案——解决问题的另一种方法。

来自网络服务器的响应标头是什么样的？ HTTP 指定了一个Last-Modified 属性，您可以使用它来检查内容是否已更改（假设服务器说的是真话）。正如 Uku 在他的回答中显示的那样，将此与 HEAD 请求一起使用。如果您想节省带宽并对正在轮询的服务器友好。

还有一个 If-Modified-Since 标头，听起来像您可能正在寻找的。p>

如果我们把它们结合起来，你可能会想出这样的东西：

import sys
import os.path
import urllib2

url = 'http://www.iana.org/domains/example/'
saved_time_file = 'last time check.txt'

request = urllib2.Request(url)
if os.path.exists(saved_time_file):
    """ If we've previously stored a time, get it and add it to the request"""
    last_time = open(saved_time_file, 'r').read()
    request.add_header("If-Modified-Since", last_time)

try:
    response = urllib2.urlopen(request) # Make the request
except urllib2.HTTPError, err:
    if err.code == 304:
        print "Nothing new."
        sys.exit(0)
    raise   # some other http error (like 404 not found etc); re-raise it.

last_modified = response.info().get('Last-Modified', False)
if last_modified:
    open(saved_time_file, 'w').write(last_modified)
else:
    print("Server did not provide a last-modified property. Continuing...")
    """
    Alternately, you could save the current time in HTTP-date format here:
    http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.3
    This might work for some servers that don't provide Last-Modified, but do
    respect If-Modified-Since.
    """

"""
You should get here if the server won't confirm the content is old.
Hopefully, that means it's new.
HTML should be in response.read().
"""

还有 check out this blog post 由 Stii 提供，这可能会提供一些灵感。我对ETags 的了解不够，无法将它们放在我的示例中，但他的代码也会检查它们。

【讨论】：

我在撰写此内容时也错过了编辑...答案 #2 即将推出。
嘿 Phil，感谢您指出有关 sys.exit 的花絮，因为我不知道退出脚本会引发错误。关于我原来的问题，这并没有解决它。由于某种未知的原因，即使它应该打印，它仍然永远不会打印 True，除非我重新加载 pickle 对象然后测试是否相等。不过谢谢！
嗯，这很奇怪。它对我来说似乎工作正常：它第一次运行时显示Created New File，然后正确显示Values Haven't Changed! 或Saving。我在我控制的服务器上对其进行了测试。您正在使用的 URL 是什么？是你自己的还是别人的？也许这在某种程度上是特定于平台的。我在这里运行linux。
似乎它必须是我试过你的网址，它工作正常。奇怪的部分是它如何在脚本中不起作用，但手动测试它却可以正常工作。这是我用于 url 内容的内容： url = 'ecal.forexpros.com/e_cal.php?duration=weekly' headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:13.0) Gecko/20100101 Firefox/13.0.1 '} data = bytes('data=None', 'utf-8') req = urllib.request.Request(url, data, headers) response = urllib.request.urlopen(req) htmlString = response.read()
嗯。该网址也对我不起作用。但是，天哪，我从那个 URL 得到了 180 万字符的响应！

【解决方案2】：

执行 HEAD 请求并检查文档的 Content-Length 会更有效。

import urllib2
"""
read old length from file into variable
"""
request = urllib2.Request('http://www.yahoo.com')
request.get_method = lambda : 'HEAD'

response = urllib2.urlopen(request)
new_length = response.info()["Content-Length"]
if old_length != new_length:
    print "something has changed"

请注意，虽然内容长度不太可能完全相同，但同时也是最有效的方式。此方法可能适合或不适合，具体取决于您期望的更改类型。

【讨论】：

漂亮。虽然问题标题似乎暗示他正在检查页面上的特定值，所以如果它只是一个整数或其他东西，那么内容长度没有改变的机会就更高了。

【解决方案3】：

您始终可以通过散列两者的内容来判断本地存储文件和远程文件之间数据的任何变化。这通常用于验证下载数据的真实性。要进行连续检查，您将需要一个 while 循环。

import hashlib
import urllib
    
num_checks = 20
last_check = 1
while last_check != num_checks:
    remote_data = urllib.urlopen('http://remoteurl').read()
    remote_hash = hashlib.md5(remote_data).hexdigest()

    local_data = open('localfilepath').read()
    local_hash = hashlib.md5(local_data).hexdigest()
    if remote_hash == local_hash:
        print('right now, we match!')
    else:
        print('right now, we are different')

如果实际数据永远不需要保存在本地，我只会存储 md5 哈希并在检查时即时计算。

【讨论】：

【解决方案4】：

这个答案是@DeaconDesperado 的答案的延伸

为了简单和更快的代码执行，可以先创建一个本地哈希（而不是存储页面的副本）并将其与新获得的哈希进行比较

要创建本地存储的哈希最初可以使用此代码

import hashlib
import urllib

    remote_data = urllib.urlopen('http://remoteurl').read()
    remote_hash = hashlib.md5(remote_data).hexdigest()
  
    # Open a file with access mode 'a'
    file_object = open('localhash.txt', 'a')
    # Append  at the end of file
    file_object.write(remote_hash)
    # Close the file
    file_object.close()

并将local_data = open('localfilepath').read() 替换为local_data = open('local\file\path\localhash.txt').read()

那是

    import hashlib
    import urllib

    num_checks = 20
    last_check = 1
    while last_check != num_checks:
    
    remote_data = urllib.urlopen('http://remoteurl').read()
    remote_hash = hashlib.md5(remote_data).hexdigest()

    local_hash = open('local\file\path\localhash.txt').read()`
   
    if remote_hash == local_hash:
    
    print( 'right now, we match!' )
    
    else:
    
    print('right now, we are different' )

来源：-https://thispointer.com/how-to-append-text-or-lines-to-a-file-in-python/

DeaconDesperado' 回答

【讨论】：

【解决方案5】：

我不完全清楚您是否只想看看网站是否发生了变化，或者您是否打算对网站的数据做更多的事情。如果是前者，肯定是哈希，如前所述。这是一个工作（mac 上的python 2.6.1）示例，它将完整的旧 html 与新 html 进行比较；它应该很容易修改，因此它可以根据需要使用散列或仅使用网站的特定部分。希望 cmets 和 docstrings 让一切变得清晰。

import urllib2

def getFilename(url):
    '''
    Input: url
    Return: a (string) filename to be used later for storing the urls contents
    '''
    return str(url).lstrip('http://').replace("/",":")+'.OLD'


def getOld(url):
    '''
    Input: url- a string containing a url
    Return: a string containing the old html, or None if there is no old file
    (checks if there already is a url.OLD file, and make an empty one if there isn't to handle the case that this is the first run)
    Note: the file created with the old html is the format url(with : for /).OLD
    '''
    oldFilename = getFilename(url)
    oldHTML = ""
    try:
        oldHTMLfile = open(oldFilename,'r')
    except:
        # file doesn't exit! so make it
        with open(oldFilename,'w') as oldHTMLfile:
            oldHTMLfile.write("")
        return None
    else:
        oldHTML = oldHTMLfile.read()
        oldHTMLfile.close()

    return oldHTML

class ConnectionError(Exception):
    def __init__(self, value):
        if type(value) != type(''):
            self.value = str(value)
        else:
            self.value = value
    def __str__(self):
        return 'ConnectionError: ' + self.value       


def htmlHasChanged(url):
    '''
    Input: url- a string containing a url
    Return: a boolean stating whether the website at url has changed
    '''

    try:
        fileRecvd = urllib2.urlopen(url).read()
    except:
        print 'Could not connect to %s, sorry!' % url
        #handle bad connection error...
        raise ConnectionError("urlopen() failed to open " + str(url))
    else:
        oldHTML = getOld(url)
        if oldHTML == fileRecvd:
            hasChanged = False
        else:
            hasChanged = True

        # rewrite file
        with open(getFilename(url),'w') as f:
            f.write(fileRecvd)

        return hasChanged

if __name__ == '__main__':
    # test it out with whatismyip.com
    try:
        print htmlHasChanged("http://automation.whatismyip.com/n09230945.asp")
    except ConnectionError,e:
        print e

【讨论】：