对列中的字符串进行排序并打印图表答案

【问题标题】：Sort strings in column and print graph对列中的字符串进行排序并打印图表
【发布时间】：2016-07-06 20:42:55
【问题描述】：

我有数据框，但所有字符串都是重复的，当我尝试打印图表时，它包含重复的列。我尝试删除它，但后来我的图表打印不正确。我的 csv 是 here。

数据帧common_users:

     used_at  common users                     pair of websites
0       2014          1364                   avito.ru and e1.ru
1       2014          1364                   e1.ru and avito.ru
2       2014          1716                 avito.ru and drom.ru
3       2014          1716                 drom.ru and avito.ru
4       2014          1602                 avito.ru and auto.ru
5       2014          1602                 auto.ru and avito.ru
6       2014           299           avito.ru and avtomarket.ru
7       2014           299           avtomarket.ru and avito.ru
8       2014           579                   avito.ru and am.ru
9       2014           579                   am.ru and avito.ru
10      2014           602             avito.ru and irr.ru/cars
11      2014           602             irr.ru/cars and avito.ru
12      2014           424       avito.ru and cars.mail.ru/sale
13      2014           424       cars.mail.ru/sale and avito.ru
14      2014           634                    e1.ru and drom.ru
15      2014           634                    drom.ru and e1.ru
16      2014           475                    e1.ru and auto.ru
17      2014           475                    auto.ru and e1.ru
.....

您可以看到网站名称颠倒了。我尝试按pair of websites 对它进行排序，因为我有KeyError。我用代码

df = pd.read_csv("avito_trend.csv", parse_dates=[2])

def f(df):
    dfs = []
    for x in [list(x) for x in itertools.combinations(df['address'].unique(), 2)]:

        c1 = df.loc[df['address'].isin([x[0]]), 'ID']
        c2 = df.loc[df['address'].isin([x[1]]), 'ID']
        c = pd.Series(list(set(c1).intersection(set(c2))))
        #add inverted intersection c2 vs c1
        c_invert = pd.Series(list(set(c2).intersection(set(c1))))
        dfs.append(pd.DataFrame({'common users':len(c), 'pair of websites':' and '.join(x)}, index=[0]))
        #swap values in x
        x[1],x[0] = x[0],x[1]
        dfs.append(pd.DataFrame({'common users':len(c_invert), 'pair of websites':' and '.join(x)}, index=[0]))
    return pd.concat(dfs)

common_users = df.groupby([df['used_at'].dt.year]).apply(f).reset_index(drop=True, level=1).reset_index()

graph_by_common_users = common_users.pivot(index='pair of websites', columns='used_at', values='common users')
#sort by column 2014
graph_by_common_users = graph_by_common_users.sort_values(2014, ascending=False)

ax = graph_by_common_users.plot(kind='barh', width=0.5, figsize=(10,20))
[label.set_rotation(25) for label in ax.get_xticklabels()]


rects = ax.patches 
labels = [int(round(graph_by_common_users.loc[i, y])) for y in graph_by_common_users.columns.tolist() for i in graph_by_common_users.index] 
for rect, label in zip(rects, labels): 
    height = rect.get_height() 
    ax.text(rect.get_width() + 3, rect.get_y() + rect.get_height(), label, fontsize=8)

plt.show()

我的图表如下所示：

【问题讨论】：

您能否提供一份预期标签列表，因为不清楚您想要实现什么目标？
现在我还有其他问题。我传递数组并得到rects = ax1.patches labels = ["%d" % i for i in time['time online'].round()] for rect, label in zip(rects, labels): print rect, label height = rect.get_height() ax1.text(rect.get_x() + rect.get_width()/2, height + 5, label, ha='center', va='bottom') 我在question 中描述了我的问题

标签： python pandas matplotlib

【解决方案1】：

您可以先在函数f 中添加新列sort，然后按列pair of websites 对值进行排序，最后按列drop_duplicates 排序used_at 和sort：

import pandas as pd
import itertools

df = pd.read_csv("avito_trend.csv", 
                      parse_dates=[2])


def f(df):
    dfs = []
    i = 0
    for x in [list(x) for x in itertools.combinations(df['address'].unique(), 2)]:
        i += 1
        c1 = df.loc[df['address'].isin([x[0]]), 'ID']
        c2 = df.loc[df['address'].isin([x[1]]), 'ID']
        c = pd.Series(list(set(c1).intersection(set(c2))))
        #add inverted intersection c2 vs c1
        c_invert = pd.Series(list(set(c2).intersection(set(c1))))
        dfs.append(pd.DataFrame({'common users':len(c), 'pair of websites':' and '.join(x), 'sort': i}, index=[0]))
        #swap values in x
        x[1],x[0] = x[0],x[1]
        dfs.append(pd.DataFrame({'common users':len(c_invert), 'pair of websites':' and '.join(x), 'sort': i}, index=[0]))
    return pd.concat(dfs)

common_users = df.groupby([df['used_at'].dt.year]).apply(f).reset_index(drop=True, level=1).reset_index()

common_users = common_users.sort_values('pair of websites')
common_users = common_users.drop_duplicates(subset=['used_at','sort']) 
#print common_users

graph_by_common_users = common_users.pivot(index='pair of websites', columns='used_at', values='common users')
#print graph_by_common_users

#change order of columns
graph_by_common_users = graph_by_common_users[[2015,2014]]
graph_by_common_users = graph_by_common_users.sort_values(2014, ascending=False)

ax = graph_by_common_users.plot(kind='barh', width=0.5, figsize=(10,20))
[label.set_rotation(25) for label in ax.get_xticklabels()]

rects = ax.patches 
labels = [int(round(graph_by_common_users.loc[i, y])) for y in graph_by_common_users.columns.tolist() for i in graph_by_common_users.index] 
for rect, label in zip(rects, labels): 
    height = rect.get_height() 
    ax.text(rect.get_width() + 20, rect.get_y() - 0.25 + rect.get_height(), label, fontsize=8) 

#sorting values of legend
handles, labels = ax.get_legend_handles_labels()
# sort both labels and handles by labels
labels, handles = zip(*sorted(zip(labels, handles), key=lambda t: t[0]))
ax.legend(handles, labels)

我的图表：

编辑：

Comment 是：

你为什么要创建 c_invert 和 x1,x[0] = x[0],x1

因为年份 2014 和 2015 的组合不同 - 第一列中缺少 4 值，第二列中缺少 4 值：

used_at                                2015    2014
pair of websites                                   
avito.ru and drom.ru                 1491.0  1716.0
avito.ru and auto.ru                 1473.0  1602.0
avito.ru and e1.ru                   1153.0  1364.0
drom.ru and auto.ru                     NaN   874.0
e1.ru and drom.ru                     539.0   634.0
avito.ru and irr.ru/cars              403.0   602.0
avito.ru and am.ru                    262.0   579.0
e1.ru and auto.ru                     451.0   475.0
avito.ru and cars.mail.ru/sale        256.0   424.0
drom.ru and irr.ru/cars               277.0   423.0
auto.ru and irr.ru/cars               288.0   409.0
auto.ru and am.ru                     224.0   408.0
drom.ru and am.ru                     187.0   394.0
auto.ru and cars.mail.ru/sale         195.0   330.0
avito.ru and avtomarket.ru            205.0   299.0
drom.ru and cars.mail.ru/sale         189.0   292.0
drom.ru and avtomarket.ru             175.0   247.0
auto.ru and avtomarket.ru             162.0   243.0
e1.ru and irr.ru/cars                 148.0   235.0
e1.ru and am.ru                        99.0   224.0
am.ru and irr.ru/cars                   NaN   223.0
irr.ru/cars and cars.mail.ru/sale      94.0   197.0
am.ru and cars.mail.ru/sale             NaN   166.0
e1.ru and cars.mail.ru/sale           105.0   154.0
e1.ru and avtomarket.ru               105.0   139.0
avtomarket.ru and irr.ru/cars           NaN   139.0
avtomarket.ru and am.ru                72.0   133.0
avtomarket.ru and cars.mail.ru/sale    48.0   105.0
auto.ru and drom.ru                   799.0     NaN
cars.mail.ru/sale and am.ru            73.0     NaN
irr.ru/cars and am.ru                 102.0     NaN
irr.ru/cars and avtomarket.ru          73.0     NaN

然后我创建所有倒置组合 - 问题解决了。但是为什么会有NaN？为什么2014 和2015 的组合不同？

我添加到函数f：

def f(df):
    print df['address'].unique()

    dfs = []
    i = 0
    for x in [list(x) for x in itertools.combinations((df['address'].unique()), 2)]:
...
...

输出是（为什么第一次打印两次在warninghere中描述）：

['avito.ru' 'e1.ru' 'drom.ru' 'auto.ru' 'avtomarket.ru' 'am.ru'
 'irr.ru/cars' 'cars.mail.ru/sale']
['avito.ru' 'e1.ru' 'drom.ru' 'auto.ru' 'avtomarket.ru' 'am.ru'
 'irr.ru/cars' 'cars.mail.ru/sale']
['avito.ru' 'e1.ru' 'auto.ru' 'drom.ru' 'irr.ru/cars' 'avtomarket.ru'
 'cars.mail.ru/sale' 'am.ru']

所以列表不同，然后组合也不同 -> 我得到了一些 NaN 值。

解决方案是对组合列表进行排序。

def f(df):
    #print (sorted(df['address'].unique()))   
    dfs = []
    for x in [list(x) for x in itertools.combinations(sorted(df['address'].unique()), 2)]:
        c1 = df.loc[df['address'].isin([x[0]]), 'ID']
        ...
        ...

所有代码是：

import pandas as pd
import itertools

df = pd.read_csv("avito_trend.csv", 
                      parse_dates=[2])

def f(df):
    #print (sorted(df['address'].unique()))   
    dfs = []
    for x in [list(x) for x in itertools.combinations(sorted(df['address'].unique()), 2)]:
        c1 = df.loc[df['address'].isin([x[0]]), 'ID']
        c2 = df.loc[df['address'].isin([x[1]]), 'ID']
        c = pd.Series(list(set(c1).intersection(set(c2))))
        dfs.append(pd.DataFrame({'common users':len(c), 'pair of websites':' and '.join(x)}, index=[0]))
    return pd.concat(dfs)

common_users = df.groupby([df['used_at'].dt.year]).apply(f).reset_index(drop=True, level=1).reset_index()
#print common_users

graph_by_common_users = common_users.pivot(index='pair of websites', columns='used_at', values='common users')

#change order of columns
graph_by_common_users = graph_by_common_users[[2015,2014]]
graph_by_common_users = graph_by_common_users.sort_values(2014, ascending=False)
#print graph_by_common_users

ax = graph_by_common_users.plot(kind='barh', width=0.5, figsize=(10,20))
[label.set_rotation(25) for label in ax.get_xticklabels()]

rects = ax.patches 
labels = [int(round(graph_by_common_users.loc[i, y])) \
for y in graph_by_common_users.columns.tolist() \
for i in graph_by_common_users.index]

for rect, label in zip(rects, labels): 
    height = rect.get_height() 
    ax.text(rect.get_width()+20, rect.get_y() - 0.25 + rect.get_height(), label, fontsize=8)

    handles, labels = ax.get_legend_handles_labels()
    # sort both labels and handles by labels
    labels, handles = zip(*sorted(zip(labels, handles), key=lambda t: t[0]))
    ax.legend(handles, labels)

还有图：

【讨论】：

是否可以将数字降低一点，因为有些是捆绑在一起的
并在2015上方打印2014?
Oki，给我一点时间。但第一个问题已解决，请参阅edit。
你可以在右上角更改yers的顺序吗？第一个2014 和下一个2015
非常感谢。这就是我想要的。如果我对代码有疑问，可以问你吗？

【解决方案2】：

DataFrame 设置问题

看起来您的DataFrame 的结构不像您希望的那样。您的DataFrame 包含2014 和2015 作为列标题名称 not 作为used_at 索引上的行值。 used_at 也是索引名称 不是第一行的索引标签。

你可以通过执行来测试这是真的：

import pandas as pd
from cStringIO import StringIO

text_data = '''
used_at            2014  2015
address                      
am.ru               621   273
auto.ru            1752  1595
avito.ru           5460  4631
avtomarket.ru       314   215
cars.mail.ru/sale   457   271
drom.ru            1934  1623
e1.ru              1654  1359
irr.ru/cars         619   426
'''

# Read in tabular data with used_at row as header
df = pd.read_table(StringIO(text_data), sep='\s+', index_col=0)
print 'DataFrame created with used_at row as header:'
print df
print 

# print df.used_at would cause AttributeError: 'DataFrame' object has no attribute 'used_at'
print 'df columns    :', df.columns
print 'df index name :', df.index.name
print

DataFrame created with used_at row as header:
                   2014  2015
used_at                      
address             NaN   NaN
am.ru               621   273
auto.ru            1752  1595
avito.ru           5460  4631
avtomarket.ru       314   215
cars.mail.ru/sale   457   271
drom.ru            1934  1623
e1.ru              1654  1359
irr.ru/cars         619   426

df columns    : Index([u'2014', u'2015'], dtype='object')
df index name : used_at

【讨论】：