【发布时间】:2021-04-06 21:38:35
【问题描述】:
我正在尝试将 AWS EMR 集群 (emr-5.29.0) 连接并附加到我在本地 Windows 机器上工作的 Jupyter 笔记本。我已经使用 Hive 2.3.6、Pig 0.17.0、Hue 4.4.0、Livy 0.6.0、Spark 2.4.4 启动了一个集群,并且子网是公共的。我发现这可以通过Azure HDInsight 完成,因此希望可以使用 EMR 完成类似的操作。我遇到的问题是在 config.json 文件中传递正确的值。我应该如何附加 EMR 集群?
我可以在 AWS 原生的 EMR 笔记本上工作,但我认为我可以走本地开发路线并遇到了障碍。
{
"kernel_python_credentials" : {
"username": "{IAM ACCESS KEY ID}", # not sure about the username for the cluster
"password": "{IAM SECRET ACCESS KEY}", # I use putty to ssh into the cluster with the pem key, so again not sure about the password for the cluster
"url": "ec2-xx-xxx-x-xxx.us-west-2.compute.amazonaws.com", # as per the AWS blog When Amazon EMR is launched with Livy installed, the EMR master node becomes the endpoint for Livy
"auth": "None"
},
"kernel_scala_credentials" : {
"username": "{IAM ACCESS KEY ID}",
"password": "{IAM SECRET ACCESS KEY}",
"url": "{Master public DNS}",
"auth": "None"
},
"kernel_r_credentials": {
"username": "{}",
"password": "{}",
"url": "{}"
},
2021 年 1 月 4 日更新
在 4 月 1 日,我使用 sparkmagic 在本地的 jupyter notebook 上工作。使用这些文档作为参考(ref-1、ref-2 和ref-3)来设置本地端口转发(如果可能,请避免使用 sudo)。
sudo ssh -i ~/aws-key/my-pem-file.pem -N -L 8998:ec2-xx-xxx-xxx-xxx.us-west-2.compute.amazonaws.com:8998 hadoop@ec2-xx-xxx-xxx-xxx.us-west-2.compute.amazonaws.com
配置详情 发布标签:emr-5.32.0 Hadoop 发行版:Amazon 2.10.1 应用:Hive 2.3.7、Livy 0.7.0、JupyterHub 1.1.0、Spark 2.4.7、Zeppelin 0.8.2
更新配置文件
{
"kernel_python_credentials" : {
"username": "",
"password": "",
"url": "http://localhost:8998"
},
"kernel_scala_credentials" : {
"username": "",
"password": "",
"url": "http://localhost:8998",
"auth": "None"
},
"kernel_r_credentials": {
"username": "",
"password": "",
"url": "http://localhost:8998"
},
"logging_config": {
"version": 1,
"formatters": {
"magicsFormatter": {
"format": "%(asctime)s\t%(levelname)s\t%(message)s",
"datefmt": ""
}
},
"handlers": {
"magicsHandler": {
"class": "hdijupyterutils.filehandler.MagicsFileHandler",
"formatter": "magicsFormatter",
"home_path": "~/.sparkmagic"
}
},
"loggers": {
"magicsLogger": {
"handlers": ["magicsHandler"],
"level": "DEBUG",
"propagate": 0
}
}
},
"authenticators": {
"Kerberos": "sparkmagic.auth.kerberos.Kerberos",
"None": "sparkmagic.auth.customauth.Authenticator",
"Basic_Access": "sparkmagic.auth.basic.Basic"
},
"wait_for_idle_timeout_seconds": 15,
"livy_session_startup_timeout_seconds": 60,
"fatal_error_suggestion": "The code failed because of a fatal error:\n\t{}.\n\nSome things to try:\na) Make sure Spark has enough available resources for Jupyter to create a Spark context.\nb) Contact your Jupyter administrator to make sure the Spark magics library is configured correctly.\nc) Restart the kernel.",
"ignore_ssl_errors": false,
"session_configs": {
"driverMemory": "1000M",
"executorCores": 2
},
"use_auto_viz": true,
"coerce_dataframe": true,
"max_results_sql": 2500,
"pyspark_dataframe_encoding": "utf-8",
"heartbeat_refresh_seconds": 5,
"livy_server_heartbeat_timeout_seconds": 60,
"heartbeat_retry_seconds": 1,
"server_extension_default_kernel_name": "pysparkkernel",
"custom_headers": {},
"retry_policy": "configurable",
"retry_seconds_to_sleep_list": [0.2, 0.5, 1, 3, 5],
"configurable_retry_policy_max_retries": 8
}
第二次更新 1/9
回到第一格。不断收到此错误并花了几天时间尝试调试。不知道我以前做了什么来让事情顺利进行。还检查了我的安全组配置,它看起来很好,端口 22 上的 ssh。
An error was encountered:
Error sending http request and maximum retry encountered.
【问题讨论】:
-
这里提到的微软文档是附加一个 HDInsight 集群和一个本地 Jupyter 笔记本。我建议检查 AWS 文档是否可以将 EMR 集群与本地笔记本连接。您可以参考 stackoverflow.com/questions/44800857/… christo-lagali.medium.com/…
-
可以将本地笔记本附加到远程 EMR 集群。 towardsdatascience.com/…
标签: python-3.x apache-spark pyspark amazon-emr jupyter-lab