首页 > 其他分享> 文章详细

机器学习数据科学库Day5

2022-02-06 16:34:18 阅读：161 来源： 互联网

标签：index 1.0 df 0.0 Day5 NaN 学习 print 机器

Day5

对于这一组电影数据，如果我们希望统计电影分类(genre)的情况，应该如何处理数据？

思路：重新构造一个全为0的数组，列名为分类，如果某一条数据中分类出现过，就让0变为1

#set() 函数创建一个无序不重复元素集，可进行关系测试，删除重复数据，还可以计算交集、差集、并集等。

#shape[0]表示行数，shape[1]表示列数

【操作】

#coding=utf-8
#对于这一组电影数据，如果我们希望统计电影分类(genre)的情况
#应该如何处理数据？
#思路：重新构造一个全为0的数组，列名为分类，
#如果某一条数据中分类出现过，就让0变为1
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
file_path = "./IMDB-Movie-Data.csv"

df = pd.read_csv(file_path)

# print(df["Genre"])
#——————————————————————重点—————————————————————————————————————
#统计分类的列表
#列表嵌套列表的情况
#[[],[],[]]
temp_list = df['Genre'].str.split(",").tolist()
# set创建一个无序不重复元素集
genre_list = list(set([i for j in temp_list for i in j]))
# print(genre_list)
#构造全为0的数组(列为genre_list电影分类)
zeros_df = pd.DataFrame(np.zeros((df.shape[0],len(genre_list))),columns=genre_list)
# print(zeros_df)

#给每个电影出现分类的位置赋值1
#shape[0]为行数，shape[1]为列数
for i in range(df.shape[0]):
    #df.loc通过标签（字符串）索引行数据
    #例：zeros_df.loc[0,["Sci-fi","Mucical"]]=1
    zeros_df.loc[i,temp_list[i]] = 1
print(zeros_df)
# print(zeros_df.head(3))
#统计每个分类的电影的数量和
genre_count = zeros_df.sum(axis=0)
print(genre_count)
#排序
genre_count = genre_count.sort_values()
#————————————————————————————————————————————————————————————————
_x = genre_count.index
_y = genre_count.values
#画图(条形图）
plt.figure(figsize=(20,8),dpi=80)
#绘图
plt.bar(range(len(_x)),_y)
#x刻度
plt.xticks(range(len(_x)),_x)
plt.show()

结果：

Biography Thriller Horror Fantasy ... Comedy Sport Animation Musical

0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0

1 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0

2 0.0 1.0 1.0 0.0 ... 0.0 0.0 0.0 0.0

3 0.0 0.0 0.0 0.0 ... 1.0 0.0 1.0 0.0

4 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0

.. ... ... ... ... ... ... ... ... ...

995 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0

996 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0

997 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0

998 0.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0

999 0.0 0.0 0.0 1.0 ... 1.0 0.0 0.0 0.0

[1000 rows x 20 columns]

Biography 81.0

Thriller 195.0

Horror 119.0

Fantasy 101.0

Mystery 106.0

Sci-Fi 120.0

Western 7.0

Family 51.0

History 29.0

Drama 513.0

Adventure 259.0

Action 303.0

Crime 150.0

War 13.0

Music 16.0

Romance 141.0

Comedy 279.0

Sport 18.0

Animation 49.0

Musical 5.0

dtype: float64

数据合并之join

join:默认情况下他是把行索引相同的数据合并到一起

（例如都是A,B,C行，则能合并相同行索引的数据；join（）：列索引不能相同）

【示例】行索引相同数据（行索引分别为A,B和A,B,C）

#coding=utf-8
import pandas as pd
import numpy as np
t1 = pd.DataFrame(np.ones((2,4)),columns=list('abcd'),index=["A","B"])
print(t1)
t2 = pd.DataFrame(np.zeros((3,3)),index=["A","B","C"],columns=["X","Y","Z"])
print(t2)
print("t1.join(t2):\n",t1.join(t2))
print("t2.join(t1):\n",t2.join(t1))

结果：

a b c d

A 1.0 1.0 1.0 1.0

B 1.0 1.0 1.0 1.0

X Y Z

A 0.0 0.0 0.0

B 0.0 0.0 0.0

C 0.0 0.0 0.0

t1.join(t2):

a b c d X Y Z

A 1.0 1.0 1.0 1.0 0.0 0.0 0.0

B 1.0 1.0 1.0 1.0 0.0 0.0 0.0

t2.join(t1):

X Y Z a b c d

A 0.0 0.0 0.0 1.0 1.0 1.0 1.0

B 0.0 0.0 0.0 1.0 1.0 1.0 1.0

C 0.0 0.0 0.0 NaN NaN NaN NaN

【示例】行索引数据不同（行索引分别为A,B和QWE）

#coding=utf-8
import pandas as pd
import numpy as np
t1 = pd.DataFrame(np.ones((2,4)),columns=list('abcd'),index=["A","B"])
print(t1)
t2 = pd.DataFrame(np.zeros((3,3)),index=["Q","W","E"],columns=["X","Y","Z"])
print(t2)
print("t1.join(t2):\n",t1.join(t2))
print("t2.join(t1):\n",t2.join(t1))

结果：

a b c d

A 1.0 1.0 1.0 1.0

B 1.0 1.0 1.0 1.0

X Y Z

Q 0.0 0.0 0.0

W 0.0 0.0 0.0

E 0.0 0.0 0.0

t1.join(t2):

a b c d X Y Z

A 1.0 1.0 1.0 1.0 NaN NaN NaN

B 1.0 1.0 1.0 1.0 NaN NaN NaN

t2.join(t1):

X Y Z a b c d

Q 0.0 0.0 0.0 NaN NaN NaN NaN

W 0.0 0.0 0.0 NaN NaN NaN NaN

E 0.0 0.0 0.0 NaN NaN NaN NaN

数据合并之merge

merge:按照指定的列把数据按照一定的方式合并到一起

默认的合并方式inner，取交集

merge outer，并集，NaN补全

merge left，以左边为准，左边有几行就是几行，NaN补全

merge right，以右边为准，右边有几行就是几行，NaN补全

【操作】

# coding=utf-8
import pandas as pd
import numpy as np
t1 = pd.DataFrame(np.ones((2, 4)), columns=list('abcd'), index=["A", "B"])
t1.loc["A","a"] = 100
print(t1)
t3 = pd.DataFrame(np.arange(9).reshape((3,3)),columns=list("fax"))
print(t3)
#内连接（只取a的交集部分）
print("t1.merge内连接\n",t1.merge(t3,on="a")) #a的交集只有1，所以只合并一行
#外连接
print("t1.merge外连接\n",t1.merge(t3,on="a",how="outer"))
#左连接(以t1为准)
print("t1.merge左连接\n",t1.merge(t3,on="a",how="left"))
#右连接（以t3为准）
print("t1.merge右连接\n",t1.merge(t3,on="a",how="right"))

结果：

a b c d

A 100.0 1.0 1.0 1.0

B 1.0 1.0 1.0 1.0

f a x

0 0 1 2

1 3 4 5

2 6 7 8

t1.merge内连接 #取并集，a为1时

a b c d f x

0 1.0 1.0 1.0 1.0 0 2

t1.merge外连接 #取交集，剩下nan补全

a b c d f x

0 100.0 1.0 1.0 1.0 NaN NaN

1 1.0 1.0 1.0 1.0 0.0 2.0

2 4.0 NaN NaN NaN 3.0 5.0

3 7.0 NaN NaN NaN 6.0 8.0

t1.merge左连接 #以左边为准

a b c d f x

0 100.0 1.0 1.0 1.0 NaN NaN

1 1.0 1.0 1.0 1.0 0.0 2.0

t1.merge右连接 #以右边为准

a b c d f x

0 1.0 1.0 1.0 1.0 0 2

1 4.0 NaN NaN NaN 3 5

2 7.0 NaN NaN NaN 6 8

现在我们有一组关于全球星巴克店铺的统计数据，如果我想知道美国的星巴克数量和中国的哪个多，或者我想知道中国每个省份星巴克的数量的情况，那么应该怎么办？

思路：遍历一遍，每次加1 ？？？

分组和聚合

在pandas中类似的分组的操作我们有很简单的方式来完成

df.groupby(by="columns_name")

那么问题来了，调用groupby方法之后返回的是什么内容？

grouped = df.groupby(by="columns_name")

grouped是一个DataFrameGroupBy对象，是可迭代的

grouped中的每一个元素是一个元组

元组里面是（索引(分组的值)，分组之后的DataFrame）

那么，回到之前的问题：

要统计美国和中国的星巴克的数量，我们应该怎么做？

分组之后的每个DataFrame的长度？

长度是一个思路，但是我们有更多的方法(聚合方法)来解决这个问题

要统计美国和中国的星巴克的数量，我们应该怎么做？

【操作】中国和美国星巴克数量

#coding=utf-8
#想知道美国的星巴克数量和中国的哪个多
#或者我想知道中国每个省份星巴克的数量的情况
import pandas as pd
import numpy as np

file_path="./starbucks_store_worldwide.csv"

df = pd.read_csv(file_path)
# print(df.head(1))
# print(df.info())
grouped = df.groupby(by="Country")
print(grouped)

#DataFrameGroupBy
#可以遍历，
# for i,j in grouped:
#     print(i)
#     print("-"*10)
#     print(j)
#     print('******************')
# df[df["Country"]="US"]
#调用聚合方法
country_count = grouped["Brand"].count()
print("US:",country_count["US"])
print("CN",country_count["CN"])

结果：
US: 13608

CN 2734

DataFrameGroupBy对象有很多经过优化的方法

如果我们需要对国家和省份进行分组统计，应该怎么操作呢？

【操作】

#coding=utf-8

import pandas as pd
import numpy as np

file_path="./starbucks_store_worldwide.csv"

df = pd.read_csv(file_path)

china_data = df[df["Country"]=="CN"]
grouped = china_data.groupby(by="State/Province").count()["Brand"]
print(grouped)

结果：

State/Province

11 236

12 58

13 24

14 8

15 8

21 57

22 13

23 16

31 551

32 354

33 315

34 26

35 75

36 13

37 75

41 21

42 76

43 35

44 333

45 21

46 16

50 41

51 104

52 9

53 24

61 42

62 3

63 3

64 2

91 162

92 13

Name: Brand, dtype: int64

grouped = df.groupby(by=[df["Country"],df["State/Province"]])

很多时候我们只希望对获取分组之后的某一部分数据，或者说我们只希望对某几列数据进行分组，这个时候我们应该怎么办呢？

获取分组之后的某一部分数据：

df.groupby(by=["Country","State/Province"])["Country"].count()

【示例】

#coding=utf-8
import pandas as pd
import numpy as np

file_path="./starbucks_store_worldwide.csv"

df = pd.read_csv(file_path)

#数据按照多个条件进行分组，返回series
grouped = df["Brand"].groupby(by=[df["Country"],df["State/Province"]]).count()
print(grouped)
print(type(grouped)) #有两个索引，只有最后一列是值

#数据按照多个条件进行分组，返回DataFrame
# grouped1 = df["Brand"].groupby(by=[df["Country"],df["State/Province"]]).count()
# grouped2 = df.groupby(by=[df["Country"],df["State/Province"]])[["Brand"]].count()
# grouped3 = df.groupby(by=[df["Country"],df["State/Province"]]).count()[["Brand"]]
# print(grouped1,type(grouped1))
# print('*****')
# print(grouped2,type(grouped2))
# print('*****')
# print(grouped3,type(grouped3))

结果：

Country State/Province

AD 7 1

AE AJ 2

AZ 48

DU 82

FU 2

US WV 25

WY 23

VN HN 6

SG 19

ZA GT 3

Name: Brand, Length: 545, dtype: int64

对某几列数据进行分组：

df["Country"].groupby(by=[df["Country"],df["State/Province"]]).count()

观察结果，由于只选择了一列数据，所以结果是一个Series类型

如果我想返回一个DataFrame类型呢？

t1=df[["Country"]].groupby(by=[df["Country"],df["State/Province"]]).count()
t2 = df.groupby(by=["Country","State/Province"])[["Country"]].count()

以上的两条命令结果一样

和之前的结果的区别在于当前返回的是一个DataFrame类型

那么问题来了：

和之前使用一个分组条件相比，当前的返回结果的前两列是什么？

是两个索引

索引和复合索引

简单的索引操作：

获取index：df.index

指定index ：df.index = ['x','y']

重新设置index : df.reindex(list("abcedf"))

指定某一列作为index ：df.set_index("Country",drop=False)

返回index的唯一值：df.set_index("Country").index.unique()

【示例】

#coding=utf-8
import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.ones((2,4)),columns=['A','B','C','D'],index=['a','b'])
df2 = pd.DataFrame(np.ones((2,4)),columns=['A','B','C','D'],index=['a','b'])
df1.loc["a","A"] = 100
df2.loc["a","A"] = 100
print(df1)
#获取index
print(df1.index)
#指定index
df1.index=["x","y"]
print(df1)
#重新设置index
df3 = df2.reindex(list('af'))
print(df3)
#指定某一列作为index,drop=False即列中不删除A
print(df2.set_index("A",drop=False))
#返回index的唯一值
t= df1["B"].unique
print(t)
#转换为列表
print(list(df1.set_index("A").index))

结果：

A B C D

a 100.0 1.0 1.0 1.0

b 1.0 1.0 1.0 1.0

Index(['a', 'b'], dtype='object')

A B C D

x 100.0 1.0 1.0 1.0

y 1.0 1.0 1.0 1.0

A B C D

a 100.0 1.0 1.0 1.0

f NaN NaN NaN NaN

A B C D

100.0 100.0 1.0 1.0 1.0

1.0 1.0 1.0 1.0 1.0

<bound method Series.unique of x 1.0

y 1.0

Name: B, dtype: float64>

[100.0, 1.0]

假设a为一个DataFrame,那么当a.set_index(["c","d"])即设置两个索引的时候是什么样子的结果呢？

a=pd.DataFrame({'a':range(7),'b':range(7,0,-1),'c':['one','one','one','two','two','two', 'two'],'d': list("hjklmno")})

那么问题来了：我只想取索引h对应值怎么办？

那么：DataFrame是怎样取值呢？

注：

#①取b中'one','h'对应的值
print(b.loc["one"].loc["h"])
#②交换内外层索引（本来是one在外面，现在将hjk换到外层索引，直接取）
print(b.swaplevel().loc["h"])

【示例】

#coding=utf-8
import pandas as pd
import numpy as np

a=pd.DataFrame({'a':range(7),'b':range(7,0,-1),
'c':['one','one','one','two','two','two', 'two'],
'd': list("hjklmno")})
print(a)
b = a.set_index(["c","d"])
print(b)
print(b.index)
print('********')
c = b["a"] #c是series类型
print(c)
print('*******')
print(c["one"])
#①取b中'one','j'对应的值
print(b.loc["one"].loc["j"])
#②交换内外层索引（本来是one在外面，现在将hjk换到外层索引，直接取）
print(b.swaplevel().loc["j"])

结果：

a b c d

0 0 7 one h

1 1 6 one j

2 2 5 one k

3 3 4 two l

4 4 3 two m

5 5 2 two n

6 6 1 two o

a b

c d

one h 0 7

j 1 6

k 2 5

two l 3 4

m 4 3

n 5 2

o 6 1

MultiIndex([('one', 'h'),

('one', 'j'),

('one', 'k'),

('two', 'l'),

('two', 'm'),

('two', 'n'),

('two', 'o')],

names=['c', 'd'])

********

c d

one h 0

j 1

k 2

two l 3

m 4

n 5

o 6

Name: a, dtype: int64

*******

h 0

j 1

k 2

Name: a, dtype: int64

a 1

b 6

Name: j, dtype: int64

a b

one 1 6

练习：

使用matplotlib呈现出店铺总数排名前10的国家

#coding=utf-8
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
#使用matplotlib呈现出店铺总数排名前10的国家

file_path="./starbucks_store_worldwide.csv"

df = pd.read_csv(file_path)
#首先按照国家进行分组，统计数量
#准备数据(ascending=False降序),取Brand这一列
data1 = df.groupby(by="Country").count()["Brand"].sort_values(ascending=False)[:10]
print(data1)
_x = data1.index #Index(['US', 'CN', 'CA',... 'AD'],
_y = data1.values #[13608 2734 1468...1]

#画条形图图
plt.figure(figsize=(20,8),dpi=80)
plt.bar(range(len(_x)),_y)
plt.xticks(range(len(_x)),_x)
plt.show()

结果：

Country

US 13608

CN 2734

CA 1468

JP 1237

KR 993

GB 901

MX 579

TW 394

TR 326

PH 298

Name: Brand, dtype: int64

使用matplotlib呈现出每个中国每个城市的店铺数量

#coding=utf-8
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from matplotlib import font_manager

#设置字体
my_font = font_manager.FontProperties(fname='C:/Windows/Fonts/simhei.ttf')
#使用matplotlib呈现出每个中国每个城市的店铺数量
file_path="./starbucks_store_worldwide.csv"

df = pd.read_csv(file_path)
df = df[df["Country"]=="CN"]
#准备数据
data1 = df.groupby(by="City").count()["Brand"].sort_values(ascending=False)[:20]
print(data1)
_x = data1.index
_y = data1.values

#画条形图图
plt.figure(figsize=(20,8),dpi=80)
# plt.bar(range(len(_x)),_y,width=0.3,color="orange")
plt.barh(range(len(_x)),_y,height=0.3,color="orange")
plt.yticks(range(len(_x)),_x,fontproperties=my_font)
plt.show()

结果：

City

上海市 542

北京市 234

杭州市 117

深圳市 113

广州市 106

Hong Kong 104

成都市 98

苏州市 90

南京市 73

武汉市 67

宁波市 59

天津市 58

重庆市 41

西安市 40

无锡市 40

佛山市 33

东莞市 31

厦门市 31

青岛市 28

长沙市 26

Name: Brand, dtype: int64

练习：

现在我们有全球排名靠前的10000本书的数据，那么请统计一下下面几个问题：

不同年份书的数量

#coding=utf-8
#现在我们有全球排名靠前的10000本书的数据，那么请统计一下
#不同年份书的数量
import pandas as pd
from matplotlib import pyplot as plt

file_path = "./books.csv"

df = pd.read_csv(file_path)
# print(df.head(1))
# print(df.info())
#---------不同年份书的数量--------
#取original_publication_year中不为nan的数据（pd.notnull）
data1 = df[pd.notnull(df["original_publication_year"])]
grouped = data1.groupby(by="original_publication_year").count()["title"]
print(grouped)

结果：

original_publication_year

-1750.0 1

-762.0 1

-750.0 2

-720.0 1

-560.0 1

...

2013.0 518

2014.0 437

2015.0 306

2016.0 198

2017.0 11

Name: title, Length: 293, dtype: int64

不同年份书的平均评分情况

#coding=utf-8
#现在我们有全球排名靠前的10000本书的数据，那么请统计一下
#不同年份书的数量
#不同年份书的平均评分情况
import pandas as pd
from matplotlib import pyplot as plt

file_path = "./books.csv"

df = pd.read_csv(file_path)
# print(df.head(1))
# print(df.info())
#---------不同年份书的平均评分情况--------
#按年份分组（去除original_publication_year列中nan的行）
data = df[pd.notnull(df["original_publication_year"])]
#按年份分类，并取平均值
grouped = data["average_rating"].groupby(by
=data["original_publication_year"]).mean()
print(grouped)

_x = grouped.index
_y = grouped.values
#绘条形图
plt.figure(figsize=(20,8),dpi=80)
plt.plot(range(len(_x)),_y)
plt.xticks(list(range(len(_x)))[::100],_x[::100],rotation=45)
plt.show()

结果：

original_publication_year

-1750.0 3.630000

-762.0 4.030000

-750.0 4.005000

-720.0 3.730000

-560.0 4.050000

...

2013.0 4.012297

2014.0 3.985378

2015.0 3.954641

2016.0 4.027576

2017.0 4.100909

Name: average_rating, Length: 293, dtype: float64

Day05总结

标签：index,1.0,df,0.0,Day5,NaN,学习,print,机器
来源： https://blog.csdn.net/birdooo/article/details/122799111

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9

机器学习 数据科学库Day5

Day5

机器学习数据科学库Day5