【问题标题】:Python Pandas vs R. Transformation Code ConcisenessPython Pandas vs R. 转换代码简洁
【发布时间】:2014-01-23 18:02:57
【问题描述】:

我转换了这个 R 代码:

# Raw data
data <- data.frame(
    metalname=c('Al','Cd','Cr','Co','Cu','Au','Fe','Pb','Mo','Ni','Pt','Au','Ta','Ti','W','Zn'),
    radius=c(0.1431,0.1490,0.1249,0.1253,0.1278,0.1442,0.1241,0.1750,0.1363,0.1246,0.1387,0.1445,0.1430,0.1445,0.1371,0.1332),
    crystal=c('FCC','HCP','BCC','HCP','FCC','FCC','BCC','FCC','BCC','FCC','FCC','FCC','BCC','HCP','BCC','HCP'))

# Calc lattice parameters (nm)
data <- rbind(
    transform(subset(data, crystal=='BCC'), N=2, latticea=4*radius/sqrt(3), latticec=0),
    transform(subset(data, crystal=='FCC'), N=4, latticea=2*radius*sqrt(2), latticec=0),
    transform(subset(data, crystal=='HCP'), N=6, latticea=2*radius, latticec=4*radius*sqrt(2/3))
)

到这个 Pandas 代码:

import pandas as pd
import numpy as np
import math
from pandas import DataFrame

# Raw data
data = DataFrame({
    'metalname': ['Al','Cd','Cr','Co','Cu','Au','Fe','Pb','Mo','Ni','Pt','Au','Ta','Ti','W','Zn'],
    'radius': [0.1431,0.1490,0.1249,0.1253,0.1278,0.1442,0.1241,0.1750,0.1363,0.1246,0.1387,0.1445,0.1430,0.1445,0.1371,0.1332],
    'crystal': ['FCC','HCP','BCC','HCP','FCC','FCC','BCC','FCC','BCC','FCC','FCC','FCC','BCC','HCP','BCC','HCP']
})

# Calc lattice parameters (nm)
databcc = data[data.crystal=='BCC']
databcc['N'] = 2
databcc['latticea'] = 4*databcc.radius/math.sqrt(3)
datafcc = data[data.crystal=='FCC']
datafcc['N'] = 4
datafcc['latticea'] = 2*datafcc.radius/math.sqrt(2)
datahcp = data[data.crystal=='HCP']
datahcp['N'] = 6
datahcp['latticea'] = 2*datahcp.radius
datahcp['latticec'] = 4*datahcp.radius*math.sqrt(2/3)
data = databcc.append(datafcc).append(datahcp)

代码有效,但有没有办法让 Python 版本更简洁?理想情况下,我可以一步完成多列计算,而无需像 R 代码这样的临时变量。这可能吗?

【问题讨论】:

  • 在python 2.7 math.sqrt(2/3) 为0,需要指定2.0/3.0才能使用浮点除法
  • 我正在使用 Python 3,但我知道,您对 Python 2.x 的看法是绝对正确的

标签: r python-3.x pandas


【解决方案1】:

这是 pandas 0.13 中新的 query/eval 功能的用例

databcc = data.query('crystal == "BCC"')
sqrt3 = sqrt(3)
databcc.eval('latticea = 4 * radius / sqrt3')

# ...

目前无法在表达式字符串中调用函数,因此您必须定义一个局部变量并改为使用它。

【讨论】:

  • Python 语法无法适应这种情况,所以我们必须转义成单独的表达式字符串语言?如果 pandas 的前提是要将 R 功能移植到更干净的语言平台,那么 Python 似乎是错误的选择。
  • 那么,为了理解你,你认为输入额外的两个单引号就足以成为重新思考 pandas 动力的理由?
  • 您只是将其轻描淡写为仅键入两个引号字符,但这些引号实际上是从 Python 中转义为一种完全独立的语言,这是一个糟糕的解决方案。如果 Python 语法不足以满足产品的需求,那么是的,它肯定应该使用另一个平台。不乏选择。
【解决方案2】:

这将非常快,因为它全部矢量化了

In [65]: data.join(
              concat([ 
                DataFrame(dict(N=2, latticea=4*data.loc[data.crystal=='BCC','radius']/np.sqrt(3))), 
                DataFrame(dict(N=4, latticea=2*data.loc[data.crystal=='FCC','radius']/np.sqrt(2))), 
                DataFrame(dict(N=6, latticea=2*data.loc[data.crystal=='HCP','radius'], 
                                    latticec=4*data.loc[data.crystal=='HCP','radius']/np.sqrt(2/3.0))) 
                    ]))
Out[65]: 
   crystal metalname  radius  N  latticea  latticec
0      FCC        Al  0.1431  4  0.202374       NaN
1      HCP        Cd  0.1490  6  0.298000  0.729948
2      BCC        Cr  0.1249  2  0.288444       NaN
3      HCP        Co  0.1253  6  0.250600  0.613842
4      FCC        Cu  0.1278  4  0.180736       NaN
5      FCC        Au  0.1442  4  0.203930       NaN
6      BCC        Fe  0.1241  2  0.286597       NaN
7      FCC        Pb  0.1750  4  0.247487       NaN
8      BCC        Mo  0.1363  2  0.314771       NaN
9      FCC        Ni  0.1246  4  0.176211       NaN
10     FCC        Pt  0.1387  4  0.196151       NaN
11     FCC        Au  0.1445  4  0.204354       NaN
12     BCC        Ta  0.1430  2  0.330244       NaN
13     HCP        Ti  0.1445  6  0.289000  0.707903
14     BCC         W  0.1371  2  0.316619       NaN
15     HCP        Zn  0.1332  6  0.266400  0.652544

[16 rows x 6 columns]

【讨论】:

  • 这几乎完全是我在原始帖子中的(一个子集)。你有什么不同的建议吗?这比 R 版本长得多,可读性也差得多,并且仍然使用临时变量。
  • 这究竟在哪里使用了临时变量?它正在返回一个新框架。
  • 您的代码只是执行其中一个步骤。在完整的操作中,我将数据帧分成三个,使用不同的方程添加三个新列,然后附加回最终的数据帧。在这种情况下,databcc 将是一个临时中间变量。
  • 哇!使用您的新代码,它很简洁,不使用中间临时变量,并且不需要将单独的表达式语言破解为字符串。那是完美的。这完全可以与 R 版本相媲美。谢谢!顺便说一句,我做了一个修复和小改动。
【解决方案3】:

这相当于原始问题代码是否比原始 R 杰作更好看:

import pdb
import pandas as pd
import numpy as np
import math
from pandas import DataFrame

# Raw data
data = DataFrame({
    'metalname': ['Al','Cd','Cr','Co','Cu','Au','Fe','Pb','Mo','Ni','Pt','Au','Ta','Ti','W','Zn'],
    'radius': [0.1431,0.1490,0.1249,0.1253,0.1278,0.1442,0.1241,0.1750,0.1363,0.1246,0.1387,0.1445,0.1430,0.1445,0.1371,0.1332],
    'crystal': ['FCC','HCP','BCC','HCP','FCC','FCC','BCC','FCC','BCC','FCC','FCC','FCC','BCC','HCP','BCC','HCP']
})

def calc_lattic_params(x):
    N = None
    l = None
    lc = None
    if x['crystal'] == 'BCC':
        N = 2
        l = 4 * x['radius'] / math.sqrt(3)
    elif x['crystal'] == 'FCC':
        N = 4
        l = 2*x['radius'] / math.sqrt(2)
    elif x['crystal'] == 'HCP':
        N = 6
        l = 2*x['radius']
        lc = 4*x['radius']*math.sqrt(2.0/3.0)

    return pd.Series({'N': N, 'latticea': l, 'latticec': lc})

data = pd.concat([data, data.apply(calc_lattic_params, axis = 1)], axis = 1)

【讨论】:

  • 这是迄今为止最好的python解决方案。 R 还是更好的。谢谢!
【解决方案4】:

如果有人感兴趣,这里是 Incanter(基于 Lisp)版本的比较:

(use '(incanter core stats charts))

; Raw data
(def data (dataset [:metalname :radius :crystal] [
    ["Al" 0.1431 "FCC"]
    ["Cd" 0.1490 "HCP"]
    ["Cr" 0.1249 "BCC"]
    ["Co" 0.1253 "HCP"]
    ["Cu" 0.1278 "FCC"]
    ["Au" 0.1442 "FCC"]
    ["Fe" 0.1241 "BCC"]
    ["Pb" 0.1750 "FCC"]
    ["Mo" 0.1363 "BCC"]
    ["Ni" 0.1246 "FCC"]
    ["Pt" 0.1387 "FCC"]
    ["Au" 0.1445 "FCC"]
    ["Ta" 0.1430 "BCC"]
    ["Ti" 0.1445 "HCP"]
    ["W" 0.1371 "BCC"]
    ["Zn" 0.1332 "HCP"]
]))

; Calc lattice parameters (nm)
(conj-rows
    (add-derived-column :latticec [] (fn [] 0)
    (add-derived-column :latticea [:radius] (fn [r] (/ (* 4 r) (sqrt 3)))
    (add-derived-column :n [] (fn [] 2)
        ($where {:crystal "BCC"} data))))
    (add-derived-column :latticec [] (fn [] 0)
    (add-derived-column :latticea [:radius] (fn [r] (* 2 r (sqrt 2)))
    (add-derived-column :n [] (fn [] 4)
        ($where {:crystal "FCC"} data))))
    (add-derived-column :latticec [:radius] (fn [r] (* 4 r (sqrt (/ 2 3))))
    (add-derived-column :latticea [:radius] (fn [r] (* 2 r))
    (add-derived-column :n [] (fn [] 6)
        ($where {:crystal "HCP"} data)))))

【讨论】:

    【解决方案5】:

    由于问题是代码简洁性的比较Python vs R),这里是使用data.tableR 解决方案:

    library(data.table)
    dt <- data.table(data, key="crystal")
    data_transformed_dt <- rbind(dt["BCC", .(metalname, radius, crystal, N=2, latticea=4*radius/sqrt(3), latticec=0)],
                                 dt['FCC', .(metalname, radius, crystal, N=4, latticea=2*radius*sqrt(2), latticec=0)],
                                 dt['HCP', .(metalname, radius, crystal, N=6, latticea=2*radius, latticec=4*radius*sqrt(2/3))])
    

    设置key="crystal"的好处是它索引了水晶列,就像数据库中的索引一样。如果数据集非常大,这将极大地节省搜索时间("BCC""FCC" ...)。


    另一种解决方法是创建另一个 key data.table,如下所示:

    # v1.9.5+, for new feature "on = ", See github project page
    require(data.table) 
    key = data.table(crystal = c("BCC", "FCC", "HCP"), 
                     latticea = c(4/sqrt(3), 2*sqrt(2), 2),
                     latticec=c(0,0,4*sqrt(2/3)), 
                     N = c(2,4,6))
    

    那么我们可以在加入的同时更新原来的data如下:

    setDT(data)[key , c("latticea", "latticec", "N") := 
                      .(radius * latticea, radius * latticec, N), 
                  on = "crystal"]
    #     metalname radius crystal  latticea  latticec N
    #  1:        Al 0.1431     FCC 0.4047479 0.0000000 4
    #  2:        Cd 0.1490     HCP 0.2980000 0.4866320 6
    #  3:        Cr 0.1249     BCC 0.2884442 0.0000000 2
    #  4:        Co 0.1253     HCP 0.2506000 0.4092281 6
    #  5:        Cu 0.1278     FCC 0.3614730 0.0000000 4
    #  6:        Au 0.1442     FCC 0.4078592 0.0000000 4
    #  7:        Fe 0.1241     BCC 0.2865967 0.0000000 2
    #  8:        Pb 0.1750     FCC 0.4949747 0.0000000 4
    #  9:        Mo 0.1363     BCC 0.3147714 0.0000000 2
    # 10:        Ni 0.1246     FCC 0.3524220 0.0000000 4
    # 11:        Pt 0.1387     FCC 0.3923028 0.0000000 4
    # 12:        Au 0.1445     FCC 0.4087077 0.0000000 4
    # 13:        Ta 0.1430     BCC 0.3302444 0.0000000 2
    # 14:        Ti 0.1445     HCP 0.2890000 0.4719350 6
    # 15:         W 0.1371     BCC 0.3166189 0.0000000 2
    # 16:        Zn 0.1332     HCP 0.2664000 0.4350294 6
    

    这应该是非常节省内存和快速的,因为我们通过引用更新(并且没有整个连接没有实现)。 on = "crystal" 对该列执行连接,并在 data 中找到与 key 中的每一行对应的匹配行索引,在这些匹配行上,我们同时更新/创建必要的列。

    请注意,结果中也保留了数据的原始顺序。

    【讨论】:

    • @Arun 感谢您的改进,感谢您制作了 data.table 如此出色的软件包!是data.table 让我决定学习R。 (顺便说一句:虽然我使用的是v1.9.5,但我必须重新安装data.table 才能运行您的代码。)
    猜你喜欢
    • 2022-01-07
    • 1970-01-01
    • 2018-11-05
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-12-06
    • 1970-01-01
    相关资源
    最近更新 更多