【问题标题】:Best way to convert online csv to dataframe scala将在线 csv 转换为数据帧 scala 的最佳方法
【发布时间】:2017-07-20 03:07:52
【问题描述】:

我正在尝试找出最有效的方法来完成将这个在线 csv 文件放入 Scala 中的数据框。

要保存下载,代码中的 csv 文件如下所示:

"Symbol","Name","LastSale","MarketCap","ADR 
TSO","IPOyear","Sector","Industry","Summary Quote"
"DDD","3D Systems Corporation","18.09","2058834640.41","n/a","n/a","Technology","Computer Software: Prepackaged Software","http://www.nasdaq.com/symbol/ddd"
"MMM","3M Company","211.68","126423673447.68","n/a","n/a","Health Care","Medical/Dental Instruments","http://www.nasdaq.com/symbol/mmm"
....

根据我的研究,我首先下载 csv,然后将其放入列表缓冲区(因为列表是不可变的,因此您不能这样做):

import scala.collection.mutable.ListBuffer

val sc = new SparkContext(conf)

var stockInfoNYSE_ListBuffer = new ListBuffer[java.lang.String]()


import scala.io.Source
    val bufferedSource = 
    Source.fromURL("http://www.nasdaq.com/screening/companies-by-
    industry.aspx?exchange=NYSE&render=download")

for (line <- bufferedSource.getLines) {
    val cols = line.split(",").map(_.trim)

    stockInfoNYSE_ListBuffer += s"${cols(0)},${cols(1)},${cols(2)},${cols(3)},${cols(4)},${cols(5)},${cols(6)},${cols(7)},${cols(8)}"

}
bufferedSource.close

val stockInfoNYSE_List = stockInfoNYSE_ListBuffer.toList

所以我们有一个列表。你基本上可以像这样得到每个值:

// SYMBOL : stockInfoNYSE_List(1).split(",")(0)
// COMPANY NAME : stockInfoNYSE_List(1).split(",")(1)
// IPOYear : stockInfoNYSE_List(1).split(",")(5)
// Sector : stockInfoNYSE_List(1).split(",")(6)
// Industry : stockInfoNYSE_List(1).split(",")(7)

这就是我卡住的地方 - 我如何将它放入数据框?我采取的错误方法。我还没有把所有的值都放进去——只是一个简单的测试。

case class StockMap(Symbol: String, Name: String)
val caseClassDS = Seq(StockMap(stockInfoNYSE_List(1).split(",")(0), 
StockMap(stockInfoNYSE_List(1).split(",")(1))).toDS()

caseClassDS.show()

上述方法的问题:我只能弄清楚如何通过硬编码来添加一个序列(行)。我想要列表中的每一行。

我的第二次尝试失败:

val sqlContext= new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val test = stockInfoNYSE_List.toDF

这只会给你数组,我想划分值。

Array(["Symbol","Name","LastSale","MarketCap","ADR TSO","IPOyear","Sector","Industry","Summary Quote"], ["DDD","3D Systems Corporation","18.09","2058834640.41","n/a","n/a","Technology","Computer Software: Prepackaged Software","http://www.nasdaq.com/symbol/ddd"], ["MMM","3M Company","211.68","126423673447.68","n/a","n/a","Health Care","Medical/Dental Instruments","http://www.nasdaq.com/symbol/mmm"],....... 

【问题讨论】:

    标签: scala apache-spark


    【解决方案1】:
    case class TestClass(Symbol:String,Name:String,LastSale:String,MarketCap :String,ADR_TSO:String,IPOyear:String,Sector: String,Industry:String,Summary_Quote:String
         | )
     defined class TestClass
    
    var stockDF= stockInfoNYSE_ListBuffer.drop(1)
    
    val demoDS = stockDF.map(line => {
      val fields = line.replace("\"","").split(",")
      TestClass(fields(0), fields(1), fields(2),fields(3), fields(4), fields(5),fields(6), fields(7), fields(8))
    })
    
    scala> demoDS.toDS.show
    
    +------+--------------------+--------+---------------+-------------+-------+-----------------+--------------------+--------------------+
    |Symbol|                Name|LastSale|      MarketCap|      ADR_TSO|IPOyear|           Sector|            Industry|       Summary_Quote|
    +------+--------------------+--------+---------------+-------------+-------+-----------------+--------------------+--------------------+
    |   DDD|3D Systems Corpor...|   18.09|  2058834640.41|          n/a|    n/a|       Technology|Computer Software...|http://www.nasdaq...|
    |   MMM|          3M Company|  211.68|126423673447.68|          n/a|    n/a|      Health Care|Medical/Dental In...|http://www.nasdaq...|
    

    【讨论】:

    • 我明白了,你基本上把每一行都映射到一个定义的结构(类),demoDS 由这些结构化类的集合组成。一个更正 - TestPerson(fields(0), fields(1)... 应该是 TestClass(fields(0), fields(1)..... 代替 - TestPerson 从未定义过。
    • 感谢您的指出,我已经更正了。在这里使用案例类。
    【解决方案2】:

    如果有人试图让这个例子正常工作,这里是使用上述解决方案的代码:

    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    
    import scala.collection.mutable.ListBuffer
    import sqlContext.implicits._
    
    var stockInfoNYSE_ListBuffer = new ListBuffer[java.lang.String]()
    
    import scala.io.Source
        val bufferedSource =
        Source.fromURL("http://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NYSE&render=download")
    
    for (line <- bufferedSource.getLines) {
        val cols = line.split(",").map(_.trim)
    
        stockInfoNYSE_ListBuffer += s"${cols(0)},${cols(1)},${cols(2)},${cols(3)},${cols(4)},${cols(5)},${cols(6)},${cols(7)},${cols(8)}"
    
    }
    bufferedSource.close
    
    
    
    case class TestClass(Symbol:String,Name:String,LastSale:String,MarketCap :String,ADR_TSO:String,IPOyear:String,Sector: String,Industry:String,Summary_Quote:String )
    
    var stockDF= stockInfoNYSE_ListBuffer.drop(1)
    
    val demoDS = stockDF.map(line => {
      val fields = line.replace("\"","").split(",")
      TestClass(fields(0), fields(1), fields(2),fields(3), fields(4), fields(5),fields(6), fields(7), fields(8))
    })
    
    demoDS.toDF().show
    

    【讨论】:

      猜你喜欢
      • 2011-01-24
      • 1970-01-01
      • 2014-12-27
      • 2014-06-28
      • 2018-03-13
      • 1970-01-01
      • 2018-12-05
      • 2014-06-17
      • 2021-04-13
      相关资源
      最近更新 更多