将 Java.io.BufferedReader 转换为 Python 对象答案

【问题标题】：Transforming Java.io.BufferedReader into Python object将 Java.io.BufferedReader 转换为 Python 对象
【发布时间】：2021-02-21 15:52:11
【问题描述】：

通过使用以下代码（来源：https://docs.microsoft.com/en-us/azure/databricks/kb/python/hdfs-to-read-files）

URI = sc._gateway.jvm.java.net.URI
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
conf = sc._jsc.hadoopConfiguration()

conf.set(
  "fs.azure.account.key.<account-name>.blob.core.windows.net,
  "<account-access-key>")

fs = Path('wasbs://<container-name>@<account-name>.blob.core.windows.net/<file-path>/').getFileSystem(sc._jsc.hadoopConfiguration())
istream = fs.open(Path('wasbs://<container-name>@<account-name>.blob.core.windows.net/<file-path>/'))

reader = sc._gateway.jvm.java.io.BufferedReader(sc._jvm.java.io.InputStreamReader(istream))

while True:
  thisLine = reader.readLine()
  if thisLine is not None:
    print(thisLine)
  else:
    break

istream.close()

我收到了 java.io.BufferedReader 类型的对象读取器，我想用它来读取 pandas、geopandas 或其他库（不像示例中那样逐行读取和打印）。

你能帮帮我吗？

谢谢卢卡斯

【问题讨论】：

标签： java python azure hdfs

【解决方案1】：

我会尝试将BufferedReader 内容读入一个字符串，然后用pd.read_csv(StringIO(string)) 读这个字符串：

string = reader.lines().collect(sc._jvm.java.util.stream.Collectors.joining())
df = pd.read_csv(StringIO(string))

【讨论】：

嗨，Alexandra，感谢您的回答，我在写到这里之前尝试了这个选项。不幸的是，收集器连接没有行尾的行，而且这不适用于二进制数据（例如 shapefile、xlsx 等）。您知道将阅读器转换为 BytesIO 的方法吗？
您的环境使用的是哪个 java 版本？
一般java中的reader使用字符流操作，而IO流使用字节流操作。从java 9开始javaInputStream有方法readAllBytes()，可以在istream变量上调用，返回字节数组，可以转换为BytesIO。如果您的环境有 java 8，则需要一些样板代码才能从输入流中读取所有字节。
嗨，Alexandra，感谢您的回答，您提出的解决方案是通用的并且有效！ bytes_object = io.BytesIO(istream.readAllBytes())data = xr.open_dataset(bytes_object, engine='h5netcdf')print(data)