【发布时间】:2019-11-25 09:16:44
【问题描述】:
我目前有一个脚本,它将从基本 URL 生成一个 url 和名称列表,以及一个脚本,它将使用 url 列表中的链接为我提供我需要的数据,但我正在努力解决我应该如何做开始创建一个循环,从第一行获取 url,运行第二个脚本,然后将文件保存为包含 url 的第一个文件中的第 2+3 列。
这是我的两个脚本和它们输出的数据。
import io
import requests
import operator
import csv
from bs4 import BeautifulSoup
from requests import get
url = 'http://arizonascaleracers.liverc.com/results'
response = get(url)
print(response.text[:500])
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
race_tables = html_soup.find_all('table', {'class':'table table-hover entry_list_data'})[2]
output_rows = []
for row in race_tables.find('tbody').find_all("tr")[1:]:
col = row.find_all("td")
output_row = []
for td in col:
if td.find(racename=''):
racename = ''.join('blank')
if td.find(suburl=''):
suburl = ''.join('blank')
if td.find(time=''):
time = ''.join('blank')
else:
suburl = row.find("a")['href']
racename = col[0].text
time = col[1].text
output_row.append(td)
output_rows.append(output_row)
with open('output.csv', 'w') as csvfile:
writer = csv.writer(csvfile, delimiter=',')
writer.writerows(output_rows)
这个脚本的输出是……
"<td><a class=""block"" href=""/results/?p=view_race_result&id=2227665""><i class=""fa fa-trophy""></i> Race 6: Modified Touring Car (Modified Touring Car A-Main)</a></td>","<td>Nov 23, 2019 at 2:39pm</td>"
"<td><a class=""block"" href=""/results/?p=view_race_result&id=2227664""><i class=""fa fa-trophy""></i> Race 5: 17.5 Super Stock Touring Car (17.5 Super Stock Touring Car A-Main)</a></td>","<td>Nov 23, 2019 at 2:31pm</td>"
"<td><a class=""block"" href=""/results/?p=view_race_result&id=2227666""><i class=""fa fa-trophy""></i> Race 4: Pro GT (Pro GT A-Main)</a></td>","<td>Nov 23, 2019 at 2:17pm</td>"
"<td><a class=""block"" href=""/results/?p=view_race_result&id=2227662""><i class=""fa fa-trophy""></i> Race 3: USGT 21.5 (USGT 21.5 A-Main)</a></td>","<td>Nov 23, 2019 at 2:10pm</td>"
"<td><a class=""block"" href=""/results/?p=view_race_result&id=2227663""><i class=""fa fa-trophy""></i> Race 2: 1-12 Open Modified (1-12 Open Modified A-Main)</a></td>","<td>Nov 23, 2019 at 2:03pm</td>"
"<td><a class=""block"" href=""/results/?p=view_race_result&id=2227661""><i class=""fa fa-trophy""></i> Race 1: VTA (VTA A-Main)</a></td>","<td>Nov 23, 2019 at 1:52pm</td>"
"<td><a class=""block"" href=""/results/?p=view_race_result&id=2227120""><i class=""fa fa-trophy""></i> Race 6: Modified Touring Car (Modified Touring Car (Heat 1/1))</a></td>","<td>Nov 23, 2019 at 1:22pm</td>"
"<td><a class=""block"" href=""/results/?p=view_race_result&id=2227119""><i class=""fa fa-trophy""></i> Race 5: 17.5 Super Stock Touring Car (17.5 Super Stock Touring Car (Heat 1/1))</a></td>","<td>Nov 23, 2019 at 1:14pm</td>"
"<td><a class=""block"" href=""/results/?p=view_race_result&id=2227118""><i class=""fa fa-trophy""></i> Race 4: Pro GT (Pro GT (Heat 1/1))</a></td>","<td>Nov 23, 2019 at 1:06pm</td>"
"<td><a class=""block"" href=""/results/?p=view_race_result&id=2227117""><i class=""fa fa-trophy""></i> Race 3: USGT 21.5 (USGT 21.5 (Heat 1/1))</a></td>","<td>Nov 23, 2019 at 12:58pm</td>"
"<td><a class=""block"" href=""/results/?p=view_race_result&id=2227116""><i class=""fa fa-trophy""></i> Race 2: 1-12 Open Modified (1-12 Open Modified (Heat 1/1))</a></td>","<td>Nov 23, 2019 at 12:51pm</td>"
"<td><a class=""block"" href=""/results/?p=view_race_result&id=2227115""><i class=""fa fa-trophy""></i> Race 1: VTA (VTA (Heat 1/1))</a></td>","<td>Nov 23, 2019 at 12:40pm</td>"
"<td><a class=""block"" href=""/results/?p=view_race_result&id=2226732""><i class=""fa fa-trophy""></i> Race 6: Modified Touring Car (Modified Touring Car (Heat 1/1))</a></td>","<td>Nov 23, 2019 at 12:18pm</td>"
"<td><a class=""block"" href=""/results/?p=view_race_result&id=2226731""><i class=""fa fa-trophy""></i> Race 5: 17.5 Super Stock Touring Car (17.5 Super Stock Touring Car (Heat 1/1))</a></td>","<td>Nov 23, 2019 at 12:11pm</td>"
"<td><a class=""block"" href=""/results/?p=view_race_result&id=2226730""><i class=""fa fa-trophy""></i> Race 4: Pro GT (Pro GT (Heat 1/1))</a></td>","<td>Nov 23, 2019 at 12:03pm</td>"
"<td><a class=""block"" href=""/results/?p=view_race_result&id=2226729""><i class=""fa fa-trophy""></i> Race 3: USGT 21.5 (USGT 21.5 (Heat 1/1))</a></td>","<td>Nov 23, 2019 at 11:55am</td>"
"<td><a class=""block"" href=""/results/?p=view_race_result&id=2226728""><i class=""fa fa-trophy""></i> Race 2: 1-12 Open Modified (1-12 Open Modified (Heat 1/1))</a></td>","<td>Nov 23, 2019 at 11:47am</td>"
"<td><a class=""block"" href=""/results/?p=view_race_result&id=2226727""><i class=""fa fa-trophy""></i> Race 1: VTA (VTA (Heat 1/1))</a></td>","<td>Nov 23, 2019 at 11:37am</td>"
"<td><a class=""block"" href=""/results/?p=view_race_result&id=2226250""><i class=""fa fa-trophy""></i> Race 6: Modified Touring Car (Modified Touring Car (Heat 1/1))</a></td>","<td>Nov 23, 2019 at 11:16am</td>"
"<td><a class=""block"" href=""/results/?p=view_race_result&id=2226249""><i class=""fa fa-trophy""></i> Race 5: 17.5 Super Stock Touring Car (17.5 Super Stock Touring Car (Heat 1/1))</a></td>","<td>Nov 23, 2019 at 11:08am</td>"
"<td><a class=""block"" href=""/results/?p=view_race_result&id=2226251""><i class=""fa fa-trophy""></i> Race 4: Pro GT (Pro GT (Heat 1/1))</a></td>","<td>Nov 23, 2019 at 11:00am</td>"
"<td><a class=""block"" href=""/results/?p=view_race_result&id=2226247""><i class=""fa fa-trophy""></i> Race 3: USGT 21.5 (USGT 21.5 (Heat 1/1))</a></td>","<td>Nov 23, 2019 at 10:52am</td>"
"<td><a class=""block"" href=""/results/?p=view_race_result&id=2226248""><i class=""fa fa-trophy""></i> Race 2: 1-12 Open Modified (1-12 Open Modified (Heat 1/1))</a></td>","<td>Nov 23, 2019 at 10:42am</td>"
"<td><a class=""block"" href=""/results/?p=view_race_result&id=2226246""><i class=""fa fa-trophy""></i> Race 1: VTA (VTA (Heat 1/1))</a></td>","<td>Nov 23, 2019 at 10:31am</td>"
然后是我的第二个脚本..
from bs4 import BeautifulSoup
import urllib.request as urllib2
html = urllib2.urlopen('http://arizonascaleracers.liverc.com/results/?p=view_race_result&id=2227665')
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table', {"class":"table table-striped race_result"})
for row in table.find('tbody').find_all("tr", recursive=False):
col = row.find_all("td")
FinishPos = col[0].text.strip()
DriverInfo = col[1].text.strip()
DI = row.find('span', class_ = 'driver_name')
Qual = col[2].text.strip()
test = col[3].text
LapsTime = col[4].text
Behind = col[5].text
Fastest = col[6].text
Avg_Lap = col[7].text
Avg_Top_5 = col[8].text
Avg_Top_10 = col[9].text
Avg_Top_15 = col[10].text
Top_3_Consecutive = col[11].text
DI = DI.text
print(FinishPos, Qual, DI, test, LapsTime, Behind, Fastest, Avg_Lap)
这里的输出是……
1 1 GABE HARVELL 24/5:04.408 --- 12.481 12.83912.839 12.59312.593
2 3 JOHNATHAN LEE 24/5:09.287 4.879 12.583 13.01113.011 12.65512.655
3 4 BRAD TOFFELMIRE 24/5:12.110 2.823 12.520 13.11813.118 12.71012.710
4 2 JACK KLOEBER 23/5:09.212 1 Lap 13.028 13.61013.610 13.09713.097
5 5 BILL CLINE 22/5:02.867 1 Lap 13.177 13.89813.898 13.28713.287
6 6 TIMOTHY SCHMUCK 22/5:03.815 0.948 12.919 13.92713.927 13.13913.139
7 7 CRAIG NELSON 21/5:08.304 1 Lap 13.713 14.82414.824 13.95713.957
现在我想我需要创建一个循环,并通过调用输入和输出列来循环这些......
import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
contents = []
with open('output.csv','r') as csvf:
urls = csv.reader(csvf)[:0]
for url in urls:
html = urllib2.urlopen('http://arizonascaleracers.liverc.com'+(url))
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table', {"class":"table table-striped race_result"})
for row in table.find('tbody').find_all("tr", recursive=False):
col = row.find_all("td")
FinishPos = col[0].text.strip()
DriverInfo = col[1].text.strip()
DI = row.find('span', class_ = 'driver_name')
Qual = col[2].text.strip()
test = col[3].text
LapsTime = col[4].text
Behind = col[5].text
Fastest = col[6].text
Avg_Lap = col[7].text
Avg_Top_5 = col[8].text
Avg_Top_10 = col[9].text
Avg_Top_15 = col[10].text
Top_3_Consecutive = col[11].text
DI = DI.text
output_rows.append(output_row)
with open('col[1]+[2].csv', 'w') as csvfile:
writer = csv.writer(csvfile, delimiter=',')
writer.writerows(output_rows)
显然,这最后一块不起作用,我似乎找不到任何关于如何完成这件事的答案。
我试图最终得到的是这些数据。
1 1 GABE HARVELL 24/5:04.408 --- 12.481 12.83912.839 12.59312.593
2 3 JOHNATHAN LEE 24/5:09.287 4.879 12.583 13.01113.011 12.65512.655
3 4 BRAD TOFFELMIRE 24/5:12.110 2.823 12.520 13.11813.118 12.71012.710
4 2 JACK KLOEBER 23/5:09.212 1 Lap 13.028 13.61013.610 13.09713.097
5 5 BILL CLINE 22/5:02.867 1 Lap 13.177 13.89813.898 13.28713.287
6 6 TIMOTHY SCHMUCK 22/5:03.815 0.948 12.919 13.92713.927 13.13913.139
7 7 CRAIG NELSON 21/5:08.304 1 Lap 13.713 14.82414.824 13.95713.957
但对于每个网址。该数据是针对单个 url 的。
"<td><a class=""block"" href=""/results/?p=view_race_result&id=2227665""><i class=""fa fa-trophy""></i> Race 6: Modified Touring Car (Modified Touring Car A-Main)</a></td>","<td>Nov 23, 2019 at 2:39pm</td>"
我的最终目标是在周日运行(我们在周六比赛)刮取子网址的基本网址,然后刮取每个单独的子网址以获取与上面相似的数据,并将其保存为比赛名称和时间.像
Race 6: Modified Touring Car (Modified Touring Car A-Main)Nov 23, 2019.csv
感谢您的帮助!
【问题讨论】:
-
你能更清楚地描述你需要的数据吗?预期的结果是什么?谢谢。
-
我继续添加了一些额外的细节,让我知道这是否更有意义。谢谢!
-
似乎脚本 2 产生了这个输出,对吧?
-
脚本二可以,但我想自动滚动浏览 url 并生成数据,我需要从脚本 1 中提取数据以填充到脚本二作为输入和输出。
标签: javascript python csv web scrape