【问题标题】:Instagram story scraper: What would the process be?Instagram 故事刮板:流程是什么?
【发布时间】:2019-03-31 15:42:52
【问题描述】:

我正在尝试编写一个网络抓取 python 程序,该程序可以通过您的登录从用户那里获取故事。我想看看我能不能让它工作会很有趣,因为4k Stogram 只是为了更多功能而花钱。

我登录成功了,但我不知道从这里去哪里。

from bs4 import BeautifulSoup
import json, random, re, requests, urllib.request
import urllib2

USERNAME = '*****'
PASSWD = '****'
account_purging = '****'

BASE_URL = 'https://www.instagram.com/accounts/login/'
LOGIN_URL = BASE_URL + 'ajax/'

headers_list = [
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; FSL 7.0.6.01001)",
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; FSL 7.0.7.01001)",
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; FSL 7.0.5.01003)",
    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0",
    "Mozilla/5.0 (X11; U; Linux x86_64; de; rv:1.9.2.8) Gecko/20100723 Ubuntu/10.04 (lucid) Firefox/3.6.8",
    "Mozilla/5.0 (Windows NT 5.1; rv:13.0) Gecko/20100101 Firefox/13.0.1",
    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:11.0) Gecko/20100101 Firefox/11.0",
    "Mozilla/5.0 (X11; U; Linux x86_64; de; rv:1.9.2.8) Gecko/20100723 Ubuntu/10.04 (lucid) Firefox/3.6.8",
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.0.3705)",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)",
    "Opera/9.80 (Windows NT 5.1; U; en) Presto/2.10.289 Version/12.01",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727)",
    "Mozilla/5.0 (Windows NT 5.1; rv:5.0.1) Gecko/20100101 Firefox/5.0.1",
    "Mozilla/5.0 (Windows NT 6.1; rv:5.0) Gecko/20100101 Firefox/5.02",
    "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.112 Safari/535.1",
    "Mozilla/4.0 (compatible; MSIE 6.0; MSIE 5.5; Windows NT 5.0) Opera 7.02 Bork-edition [en]"
    ]

USER_AGENT = headers_list[random.randrange(0,(len(headers_list)+1))]

session = requests.Session()
session.headers = {'user-agent': USER_AGENT}
session.headers.update({'Referer': BASE_URL})
req = session.get(BASE_URL)
soup = BeautifulSoup(req.content, 'html.parser')
body = soup.find('body')

pattern = re.compile('window._sharedData')
script = body.find("script", text=pattern)

script = script.get_text().replace('window._sharedData = ', '')[:-1]
data = json.loads(script)

csrf = data['config'].get('csrf_token')
login_data = {'username': USERNAME, 'password': PASSWD}
session.headers.update({'X-CSRFToken': csrf})
login = session.post(LOGIN_URL, data=login_data, allow_redirects=True)

story_page = "https://www.instagram.com/stories" + "/" + account_purging

# stories url is:
request_headers_story = {
    "Accept:" : "video/webm,video/ogg,video/*;q…q=0.7,audio/*;q=0.6,*/*;q=0.5",
    "Accept-Language" : "en-US,en;q=0.5",
    "Connection" : "keep-alive",
    "DNT" : "1",
    "Host" : "scontent-ort2-1.cdninstagram.com",
    "Range" : "bytes=0-",
    "Referer" : story_page,
    "TE" : "Trailers",
    "User-Agent" : USER_AGENT
}

soup = session.post(story_page, data=request_headers_story, allow_redirects=True)
print(BeautifulSoup(soup.content, 'html.parser'))

我正在尝试获取 mp4 和 jpg 链接,并使用它们稍后以数组或其他方式下载。如果有什么可以指点我的,我将不胜感激。

我也在尽量避免使用 api,因为那只会让人觉得无聊。

【问题讨论】:

    标签: python python-3.x web-scraping beautifulsoup instagram


    【解决方案1】:

    避免使用 api 的更简单的解决方案是使用 selenium。通过使用 selenium,您可以更快、更有效地登录,并获取您需要的图像和视频。

    from selenium import webdriver
    from selenium.webdriver.common.keys import Keys
    
    driver = webdriver.Firefox()
    #or driver = webdriver.Chrome()
    

    注意:要抓取图像,您需要找到图像的 id 或名称并执行以下操作:

    driver.find_element_by_id("image_id")
    

    driver.find_element_by_name("image_name")
    

    如果您需要更多信息或说明,请查看https://selenium-python.readthedocs.io/

    如果这对你有帮助,请告诉我!

    【讨论】:

      猜你喜欢
      • 2014-09-10
      • 2014-10-15
      • 2014-08-16
      • 2015-08-28
      • 2023-02-11
      • 1970-01-01
      • 2012-08-17
      • 2015-12-15
      • 2012-04-10
      相关资源
      最近更新 更多