1、merge( )合并需要指定连接键。
参见:https://blog.csdn.net/starter_____/article/details/79198137
2、pandas库DataFrame用法
详情参考:https://blog.csdn.net/cymy001/article/details/78275886#infodescribeheadtail_261
创建一个DataFrame:
1)用字典dict,字典值value是列表list
2)用Series构建DataFrame
3)用一个字典构成的列表list of dicts来构建DataFrame
定位DataFrame里的元素
1)利用表达式boolean定位
2)利用loc,iloc,ix函数定位,可以定位数字,就可以赋值
loc函数:通过行索引 "Index" 中的具体值来取行数据(如取"Index"为"A"的行,如data.loc['A'])
iloc函数:通过行号来取行数据(如取第二行的数据,如data.iloc[0])
ix:通过行标签或者行号索引行数据(基于loc和iloc 的混合)
info()、describe()、head()、tail()
csv文件读写read_ csv/to_csv
import pandas
food_info = pandas.read_csv(file_name)# 返回一个DataFrame对象
n_rows = food_info.head(n) #获取前n行数据,返回的依旧是个DataFrame
column_names = food_info.columns #获取所有的列名
dimensions = food_info.shape #获取数据的shape
默认情况下,使用pandas.read_csv()读取csv文件的时候,会默认将数据的第一行当做列标签,还会为每一行添加一个行标签。我们可以使用这些标签来访问DataFrame中的数据
如何选择一行数据
data = food_info.loc[0] #使用loc[n]获取第n行数据,如果只是获取一行数据的话,返回Series
#如何选择多行呢,和numpy的语法是一样的
datas = food_info.loc[1:2] #返回DataFrame,返回行标签为1,2的,这个和numpy不一样,而且loc[]也不支持-n
data = food_info.loc[i][j] # i-th row, j-th column
使用DataFrame.dtypes获取每列的数据类型
使用DataFrame[indices]获取列数据。注意,要是获取多个列的话
selenium_thiamin = food_info[["Selenium_(mcg)",'Thiamin_(mg)']],不要少了个中括号。
len(data_frame) # 是data_frame的行数
len(data_frame.loc[0]) #是data_frame的列数
set(data_frame) # 返回column name
set(data_frame["column1"]) # 返回第一列中不重复的值
set(data_frame.loc[0]) #返回第一行中不重复的值
遍历 DataFrame 中的数据
# 使用 for 是不可行的!!!!!!!!!!!!!!!!!!!!!
for data in data_frame:
pass
# 使用 len 和 loc
for i in range(len(data_frame)):
cur_data = data_frame.loc[i]
# 使用 .iteriterms()
for i, series in df.iteritems():
print(i, ":", type(series))
from:https://blog.csdn.net/u012436149/article/details/67109953
官网文档:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
3、DataFrame.plot 画图
DataFrame
DataFrame.plot(x=None, y=None, kind='line', ax=None, subplots=False, sharex=None, sharey=False, layout=None, figsize=None, use_index=True, title=None, grid=None, legend=True, style=None, logx=False, logy=False, loglog=False, xticks=None, yticks=None, xlim=None, ylim=None, rot=None, fontsize=None, colormap=None, table=False, yerr=None, xerr=None, secondary_y=False, sort_columns=False, **kwds)[source]
Make plots of DataFrame using matplotlib / pylab.
| Parameters: |
data : DataFrame x : label or position, default None y : label, position or list of label, positions, default None Allows plotting of one column versus another kind : str
ax : matplotlib axes object, default None subplots : boolean, default False Make separate subplots for each column sharex : boolean, default True if ax is None else False In case subplots=True, share x axis and set some x axis labels to invisible; defaults to True if ax is None otherwise False if an ax is passed in; Be aware, that passing in both an ax and sharex=True will alter all x axis labels for all axis in a figure! sharey : boolean, default False In case subplots=True, share y axis and set some y axis labels to invisible layout : tuple (optional) (rows, columns) for the layout of subplots figsize : a tuple (width, height) in inches use_index : boolean, default True Use index as ticks for x axis title : string or list Title to use for the plot. If a string is passed, print the string at the top of the figure. If a list is passed and subplots is True, print each item in the list above the corresponding subplot. grid : boolean, default None (matlab style default) Axis grid lines legend : False/True/’reverse’ Place legend on axis subplots style : list or dict matplotlib line style per column logx : boolean, default False Use log scaling on x axis logy : boolean, default False Use log scaling on y axis loglog : boolean, default False Use log scaling on both x and y axes xticks : sequence Values to use for the xticks yticks : sequence Values to use for the yticks xlim : 2-tuple/list ylim : 2-tuple/list rot : int, default None Rotation for ticks (xticks for vertical, yticks for horizontal plots) fontsize : int, default None Font size for xticks and yticks colormap : str or matplotlib colormap object, default None Colormap to select colors from. If string, load colormap with that name from matplotlib. colorbar : boolean, optional If True, plot colorbar (only relevant for ‘scatter’ and ‘hexbin’ plots) position : float Specify relative alignments for bar plot layout. From 0 (left/bottom-end) to 1 (right/top-end). Default is 0.5 (center) table : boolean, Series or DataFrame, default False If True, draw a table using the data in the DataFrame and the data will be transposed to meet matplotlib’s default layout. If a Series or DataFrame is passed, use passed data to draw a table. yerr : DataFrame, Series, array-like, dict and str See Plotting with Error Bars for detail. xerr : same types as yerr. stacked : boolean, default False in line and bar plots, and True in area plot. If True, create stacked plot. sort_columns : boolean, default False Sort column names to determine plot ordering secondary_y : boolean or sequence, default False Whether to plot on the secondary y-axis If a list/tuple, which columns to plot on secondary y-axis mark_right : boolean, default True When using a secondary_y axis, automatically mark the column labels with “(right)” in the legend `**kwds` : keywords Options to pass to matplotlib plotting method |
|---|---|
| Returns: |
axes : |
If kind = ‘scatter’ and the argument c is the name of a dataframe column, the values of that column are used to color each point;
Set the alpha value used for blending - not supported on all backends.范围0到1(0.0透明到1.0不透明)
更多可参见:http://pandas.pydata.org/pandas-docs/stable/reference/frame.html#
例如:
df = pd.DataFrame([[5.1, 3.5, 0], [4.9, 3.0, 0], [7.0, 3.2, 1],
... [6.4, 3.2, 1], [5.9, 3.0, 2]],
... columns=['length', 'width', 'species'])
>>> ax1 = df.plot.scatter(x='length',
... y='width',
... c='DarkBlue')
也可写成:
ax1 = df.plot(kind="scatter"x='length',
... y='width',s=1
... c='DarkBlue')
结果: