一个问题是print() 语句不会进入日志文件,因此您需要使用 spark 日志记录功能。在 pyspark 中,我创建了一个实用函数,可以将输出发送到日志文件,但在手动运行笔记本时也会将其打印到笔记本:
# utility method for logging
log4jLogger = sc._jvm.org.apache.log4j
# give a meaningful name to your logger (mine is CloudantRecommender)
LOGGER = log4jLogger.LogManager.getLogger("CloudantRecommender")
def info(*args):
print(args) # sends output to notebook
LOGGER.info(args) # sends output to kernel log file
def error(*args):
print(args) # sends output to notebook
LOGGER.error(args) # sends output to kernel log file
在我的笔记本中使用这样的功能:
info("some log output")
如果我检查日志文件,我可以看到我的注销正在被写入:
! grep 'CloudantRecommender' $HOME/logs/notebook/*pyspark*
kernel-pyspark-20170105_164844.log:17/01/05 10:49:08 INFO CloudantRecommender: [Starting load from Cloudant: , 2017-01-05 10:49:08]
kernel-pyspark-20170105_164844.log:17/01/05 10:53:21 INFO CloudantRecommender: [Finished load from Cloudant: , 2017-01-05 10:53:21]
异常似乎也不会发送到日志文件,因此您需要将代码包装在 try 块中并记录错误,例如
import traceback
try:
# your spark code that may throw an exception
except Exception as e:
# send the exception to the spark logger
error(str(e), traceback.format_exc(), ts())
raise e
注意: 在调试过程中遇到的另一个问题是计划作业运行特定版本的笔记本。检查您是否在保存新版本的笔记本时更新了计划作业。