【问题标题】:facing issue with "wget" in python在 python 中面临“wget”的问题
【发布时间】:2018-01-26 14:59:24
【问题描述】:

我对python很陌生。我正面临“wget”和“urllib.urlretrieve(str(myurl),tail)”的问题

当我运行脚本时,它正在下载文件,但文件名以“?”结尾

我的完整代码:

import os
import wget
import urllib
import subprocess
with open('/var/log/na/na.access.log') as infile, open('/tmp/reddy_log.txt', 'w') as outfile:
    results = set()
    for line in infile:
        if ' 200 ' in line:
            tokens = line.split()
            results.add(tokens[6]) # 7th token
    for result in sorted(results):
        print >>outfile, result
with open ('/tmp/reddy_log.txt') as infile:
     results = set()
     for line in infile:
     head, tail = os.path.split(line)
                print tail
                myurl = "http://data.xyz.com" + str(line)
                print myurl
                wget.download(str(myurl))
                #  urllib.urlretrieve(str(myurl),tail)

输出:

# python last.py
0011400026_recap.xml

http://data.na.com/feeds/mobile/android/v2.0/video/games/high/0011400026_recap.xml

latest_1.xml

http://data.na.com/feeds/mobile/iphone/article/league/news/latest_1.xml

currenttime.js

列出文件:

# ls
0011400026_recap.xml?                   currenttime.js?  latest_1.xml?      today.xml?

【问题讨论】:

  • 看起来像换行符,因为它每次都打印出额外的行。没有看到line就很难确定@
  • @CoryMadden 我还应该提供哪些信息?
  • line 对于初学者。
  • myurl = 'data.na.com' + str(line) print myurl # wgproc = subprocess.Popen(['wget', '-r', '--tries=10', 'str( url)', '-o', 'log'], stdout=subprocess.PIPE, stderr=subprocess.STDOUT) # (standardout, junk) = wgproc.communicate() wget.download(str(myurl)) #urllib. urlretrieve(str(myurl),tail)
  • 你显示的代码不可能给出你显示的输出。此外,缩进是错误的。更不用说在 cmets 中发布代码了。也不需要临时文件。整条线上200上的匹配迟早会导致误匹配。也就是说,我的水晶球告诉我myurl = "http://data.xyz.com" + str(line.strip()) 确实是你想要的。

标签: python wget


【解决方案1】:

你所经历的行为的一个可能解释是你这样做 不清理您的输入line

with open ('/tmp/reddy_log.txt') as infile:
     ...
     for line in infile:
         ...
         myurl = "http://data.xyz.com" + str(line)
         wget.download(str(myurl))

当你迭代一个文件对象时,(for line in infile:) 字符串 你得到的是一个换行符 ('\n') 字符 - 如果你不 在使用line之前删除换行符,哦,换行符 您使用line 所产生的内容仍然存在……

作为这个概念的说明,看看成绩单 我做过的一个测试

08:28 $ cat > a_file
a
b
c
08:29 $ cat > test.py
data = open('a_file')
for line in data:
    new_file = open(line, 'w')
    new_file.close() 
08:31 $ ls
a_file  test.py
08:31 $ python test.py
08:31 $ ls
a?  a_file  b?  c?  test.py
08:31 $ ls -b
a\n  a_file  b\n  c\n  test.py
08:31 $

如您所见,我从文件中读取行并使用 line 作为文件名,猜猜看,ls 列出的文件名 最后有一个?——但我们可以做得更好,正如在 ls的精美手册页

  -b, --escape
         print C-style escapes for nongraphic characters

并且,正如您在ls -b 的输出中看到的,文件名不是 以问号结尾(它只是默认使用的占位符 由ls 程序),但以换行符终止。

虽然我在这里,但我不得不说你应该避免使用 用于存储计算的中间结果的临时文件。

Python 的一个不错的特性是存在 生成器表达式, 如果你愿意,你可以按如下方式编写代码

import wget

# you matched on a '200' on the whole line, I assume that what
# you really want is to match a specific column, the 'error_column'
# that I symbolically load from an external resource
from my_constants import error_column, payload_column

# here it is a sequence of generator expressions, each one relying
# on the previous one

# 1. the lines in the file, stripped from the white space
#    on the right (the newline is considered white space)
#    === not strictly necessary, just convenient because
#    === below we want to test for non-empty lines
lines = (line.rstrip() for line in open('whatever.csv'))

# 2. the lines are converted to a list of 'tokens' 
all_tokens = (line.split() for line in lines if line)

# 3. for each 'tokens' in the 'all_tokens' generator expression, we
#    check for the code '200' and possibly generate a new target
targets = (tokens[payload_column] for tokens in all_tokens if tokens[error_column]=='200')

# eventually, use the 'targets' generator to proceed with the downloads
for target in targets: wget.download(target)

不要被 cmets 的数量所迷惑,没有 cmets 我的代码只是

import wget
from my_constants import error_column

lines = (line.rstrip() for line in open('whatever.csv'))
all_tokens = (line.split() for line in lines if line)
targets = (tokens[payload_column] for tokens in all_tokens if tokens[error_column]=='200')

for target in targets: wget.download(target)

【讨论】:

  • 现在 wget 正在工作使用 strip() : myurl = "data.ba.com" + str(line.strip()) print myurl filename = wget.download(myurl) print filename
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2020-01-31
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2020-07-09
相关资源
最近更新 更多