Pandas 資料分析 (下) - Python 基本使用 - AI 相關 | Zrn Code = 為了夢想、永不停歇

Ch.4 Grouping and Sorting
Ch.5 Data Types and Missing Values
Ch.6 Renaming and Combining

# Chapter 4

# Groupwise analysis

使用 groupby() 方法可以將資料依照自己要的 column 分組，它會將類別相同的數值進行合併並一同分析。
這邊我們使用 count() 統計各個 points 出現的數量:

reviews.groupby('points').points.count()

運算結果

points
80     397
81     692
      ... 
99      33
100     19
Name: points, Length: 21, dtype: int64

除此之外，我們也可以使用 describe() 中用過的所有函式對數值進行分析。
例如：我想知道各項的 points 中，最便宜的價格為多少。

reviews.groupby('points').price.min() #運算結果見註解

運算結果

points
80      5.0 #所有80分的酒中，最低價的是5.0元
81      5.0
       ... 
99     44.0
100    80.0 #所有100分的酒中，最低價的是80.0元
Name: price, Length: 21, dtype: float64

使用 groupby() 建立的表格數值本身就已經有對應上，因此，我們可以直接使用 apply() 對資料做映射處理:

reviews.groupby('winery').apply(lambda df: df.title.iloc[0])

運算結果

winery
1+1=3                          1+1=3 NV Rosé Sparkling (Cava)
10 Knots                 10 Knots 2010 Viognier (Paso Robles)
                                  ...                        
àMaurice    àMaurice 2013 Fred Estate Syrah (Walla Walla V...
Štoka                         Štoka 2009 Izbrani Teran (Kras)
Length: 16757, dtype: object

groupby() 也可以同時對多 column 做分組分析 (也就是 Multi-indexes )
例如：列出每一個國家與其省份分數最高的酒類資料。

reviews.groupby(['country', 'province']).apply(lambda df: df.loc[df.points.idxmax()]).head()

idmax() 會回傳第一個最大值的索引值

運算結果

	country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
('Argentina', 'Mendoza Province')	Argentina	If the color doesn't tell the full story, the nose will: it's impeccably perfumed, round and sweet, with amazing fruit aromas and a toasty background. The feel is as lush and rich as they come, and there's outstanding purity to the lusty black fruit flavors. A masher with elegance. Maybe the best Malbec Argentina has ever produced. Drink from 2011–15.	Nicasia Vineyard	97	120	Mendoza Province	Mendoza	nan	Michael Schachner	@wineschach	Bodega Catena Zapata 2006 Nicasia Vineyard Malbec (Mendoza)	Malbec	Bodega Catena Zapata
('Argentina', 'Other')	Argentina	Take note, this could be the best wine Colomé has ever bottled. Cedary oak aromas grace ripe, almost desiccated wild berry notes that are slightly herbal and sauvage. A well-layered mouthfeel is ideal in power and precision. Requisite blackberry, cassis, toast and peppery flavors finish dry and long, with little heat or burn. Drink through 2021.	Reserva	95	90	Other	Salta	nan	Michael Schachner	@wineschach	Colomé 2010 Reserva Malbec (Salta)	Malbec	Colomé
('Armenia', 'Armenia')	Armenia	Deep salmon in color, this wine offers a bouquet of cantaloupe and nectarine. It is bright on entry, with full mouthfeel and flavors of pineapple, grapefruit and white peach that culminate in a bright citrus finish.	Estate Bottled	88	15	Armenia	nan	nan	Mike DeSimone	@worldwineguys	Van Ardi 2015 Estate Bottled Rosé (Armenia)	Rosé	Van Ardi
('Australia', 'Australia Other')	Australia	Writes the book on how to make a wine filled with gigantic jammy fruit that's nonetheless dry, balanced and harmonious. The Cabernet contributes blackcurrants and structural tannins. The Syrah brings a peppery richness, while Merlot adds soft fruitiness. You can actually taste each varietal, yet the wine hangs together, dry and rich. Best Buy.	Sarah's Blend	93	15	Australia Other	South Eastern Australia	nan	nan	nan	Marquis Philips 2000 Sarah's Blend Red (South Eastern Australia)	Red Blend	Marquis Philips
('Australia', 'New South Wales')	Australia	De Bortoli's Noble One is as good as ever in 2007, offering intense botrytis, dried apricots, honey and bergamot in a sweetly unctuous style. Think of a Sauternes, but with more sugar and less alcohol. It should age well for 10 or more years.	Noble One Bortytis	94	32	New South Wales	New South Wales	nan	Joe Czerwinski	@JoeCz	De Bortoli 2007 Noble One Bortytis Semillon (New South Wales)	Sémillon	De Bortoli

我們可以使用 agg() ，同時對一組數據做不同的函式比較:

	# 同時對不同 country 的 price 分別做分析 (資料數，最小值，最大值)
	reviews.groupby(['country']).price.agg([len, min, max]).head()

運算結果

country	len	min	max
Argentina	3800	4	230
Armenia	2	14	15
Australia	2329	5	850
Austria	3345	7	1100
Bosnia and Herzegovina	2	12	13

# Multi-indexes

之前我們就稍有提到，我們建立的群組也可以對多個 column 進行描述與運算。
這裡舉個簡單的例子：我想知道各國家的各省份各有多少筆資料:

	countries_reviewed = reviews.groupby(['country', 'province']).description.agg([len])
	countries_reviewed.head()

運算結果

	len
('Argentina', 'Mendoza Province')	3264
('Argentina', 'Other')	536
('Armenia', 'Armenia')	2
('Australia', 'Australia Other')	245
('Australia', 'New South Wales')	85

使用多個 column 資料所建立的表格，其索引值並非傳統的 Index ，而是 padas 內建的 MultiIndex 。
pandas 有特定的方法可以對 MutiIndex 做出更複雜的操作。

	mi = countries_reviewed.index
	type(mi) # 運算結果為: pandas.core.indexes.multi.MultiIndex

但通常我會直接使用 reset_index() 這方法對現有的表格重製為比較單純的索引值。

countries_reviewed.reset_index()

運算結果

	country	province	len
0	Argentina	Mendoza Province	3264
1	Argentina	Other	536
2	Armenia	Armenia	2
3	Australia	Australia Other	245
4	Australia	New South Wales	85

# Sorting

使用 sort_values() 可針對特定數值進行排列。(預設由小到大)

	countries_reviewed = countries_reviewed.reset_index()
	countries_reviewed.sort_values(by='len').head(n=10)

運算結果

	index	country	province	len
179	179	Greece	Muscat of Kefallonian	1
192	192	Greece	Sterea Ellada	1
194	194	Greece	Thraki	1
354	354	South Africa	Paardeberg	1
40	40	Brazil	Serra do Sudeste	1
114	114	Egypt	Egypt	1
316	316	Serbia	Pocerina	1
112	112	Cyprus	Pitsilia Mountains	1
110	110	Cyprus	Lemesos	1
301	301	Portugal	Vinho da Mesa	1

將 sort_value() 中的 ascending 調整為 False ，數值即可有大排到小。

countries_reviewed.sort_values(by='len', ascending=False).head(n=10)

運算結果

	level_0	index	country	province	len
392	392	392	US	California	36247
415	415	415	US	Washington	8639
118	118	118	France	Bordeaux	5941
227	227	227	Italy	Tuscany	5897
409	409	409	US	Oregon	5373
119	119	119	France	Burgundy	3980
375	375	375	Spain	Northern Spain	3851
224	224	224	Italy	Piedmont	3729
0	0	0	Argentina	Mendoza Province	3264
228	228	228	Italy	Veneto	2716

sort_index() 可依照 index 做排序 (非常白話)

countries_reviewed.sort_index()

運算結果

	level_0	index	country	province	len
0	0	0	Argentina	Mendoza Province	3264
1	1	1	Argentina	Other	536
2	2	2	Armenia	Armenia	2
3	3	3	Australia	Australia Other	245
4	4	4	Australia	New South Wales	85
5	5	5	Australia	South Australia	1349
6	6	6	Australia	Tasmania	42
7	7	7	Australia	Victoria	322
8	8	8	Australia	Western Australia	286
9	9	9	Austria	Austria	26

排序數值也能對字串做排序，其順序參照 ASCII Code，並且排序的參照也可以不只一個。
如下所示：我們先依照國家名稱進行 (由小到大的) 排序，若國家名稱相同，則依照資料總數 len (由小到大的) 排序

countries_reviewed.sort_values(by=['country', 'len'])

運算結果

	level_0	index	country	province	len
1	1	1	Argentina	Other	536
0	0	0	Argentina	Mendoza Province	3264
2	2	2	Armenia	Armenia	2
6	6	6	Australia	Tasmania	42
4	4	4	Australia	New South Wales	85
3	3	3	Australia	Australia Other	245
8	8	8	Australia	Western Australia	286
7	7	7	Australia	Victoria	322
5	5	5	Australia	South Australia	1349
20	20	20	Austria	Südburgenland	2

# Chapter 5

# Dtypes

dtype 可以取得特定 column 的資料型態

	#查看 price 的資料型態
	reviews.price.dtype

運算結果

dtype('float64')

當然的， dtype 也可以取得該 DataFrame 中所有 column 的資料型態

reviews.dtypes

運算結果

country                   object
description               object
designation               object
points                     int64
price                    float64
province                  object
region_1                  object
region_2                  object
taster_name               object
taster_twitter_handle     object
title                     object
variety                   object
winery                    object
dtype: object

字串組成的那整欄會被視為 object ，而不是 stringㄅ

使用 astype() 可以強制轉換型態。
例如：將 points 原有的型態 int64 轉成 float64 。

reviews.points.astype('float64')

運算結果

0         87.0
1         87.0
          ... 
129969    90.0
129970    90.0
Name: points, Length: 129971, dtype: float64

# Missing data

NaN ，全名: Not a Number ，表示該筆資料有缺失，在資料分析時經常會見到此情形。
isnull() 可以判斷該筆資料是否為空資料 (回傳布林值)，搭配上 [] 就能過濾出有空資料的該列。

reviews[pd.isnull(reviews.country)].head()

運算結果

	country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
913	nan	Amber in color, this wine has aromas of peach jam and orange blossom. Flavors are a mélange of mellow fruit and savory notes, in particular apricot, lemon curd, quince and bitter almond. Despite strong tannins, there are moments of brightness across the entire palate that linger into the finish.	Asureti Valley	87	30	nan	nan	nan	Mike DeSimone	@worldwineguys	Gotsa Family Wines 2014 Asureti Valley Chinuri	Chinuri	Gotsa Family Wines
3131	nan	Soft, fruity and juicy, this is a pleasant, simple wine that is a blend from different European countries. It is balanced and very drinkable for parties or barbecues. Screwcap.	Partager	83	nan	nan	nan	nan	Roger Voss	@vossroger	Barton & Guestier NV Partager Red	Red Blend	Barton & Guestier
4243	nan	Violet-red in color, this semisweet wine has a nose of blackberry and vanilla. It is soft on entry, with flavors of black cherry, blueberry and creme brûlée. It is smooth on the palate, while the blueberry pie finish is brightened up by a brief but not unwelcome tart note.	Red Naturally Semi-Sweet	88	18	nan	nan	nan	Mike DeSimone	@worldwineguys	Kakhetia Traditional Winemaking 2012 Red Naturally Semi-Sweet Ojaleshi	Ojaleshi	Kakhetia Traditional Winemaking
9509	nan	This mouthwatering blend starts with a nose of fresh mint, pineapple and mango, followed by flavors of orange peel, peach and lemon. Fresh and fruity with a mineral backbone, the wine finishes on an earthy spin. Pair with grilled sardines or poultry dishes.	Theopetra Malagouzia-Assyrtiko	92	28	nan	nan	nan	Susan Kostrzewa	@suskostrzewa	Tsililis 2015 Theopetra Malagouzia-Assyrtiko White	White Blend	Tsililis
9750	nan	This orange-style wine has a cloudy yellow-gold appearance because it spent 30 days on the skins and 100 days on the lees. It has aromas and flavors of aged apple cider and tart apple pie with an astringent finish. This style is not for everyone, but for those who appreciate orange wine made from Chardonnay, this is a solid example.	Orange Nikolaevo Vineyard	89	28	nan	nan	nan	Jeff Jenssen	@worldwineguys	Ross-idi 2015 Orange Nikolaevo Vineyard Chardonnay	Chardonnay	Ross-idi

fillna() 可以將空資料 NaN 改成指定的字串或數值，更方便我們進行操作:

reviews.region_2.fillna("Unknown")

運算結果

0         Unknown
1         Unknown
           ...   
129969    Unknown
129970    Unknown
Name: region_2, Length: 129971, dtype: object

replace(A,B) 將所有的 A 取代為 B (就字面上的意思):

reviews.taster_twitter_handle.replace("@kerinokeefe", "@kerino")

運算結果

0            @kerino
1         @vossroger
             ...    
129969    @vossroger
129970    @vossroger
Name: taster_twitter_handle, Length: 129971, dtype: object

# Chapter 6

# Renaming

column 的名稱可以透過 rename() 這方法進行改變:

reviews.rename(columns={'points': 'score'}).head(n=1)

運算結果

	country	description	designation	score	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
0	Italy	Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity.	Vulkà Bianco	87	nan	Sicily & Sardinia	Etna	nan	Kerin O’Keefe	@kerinokeefe	Nicosia 2013 Vulkà Bianco (Etna)	White Blend	Nicosia

index 的名稱也可以使用 rename() 改動:

reviews.rename(index={0: 'firstEntry', 1: 'secondEntry'}).head(n=3)

運算結果

	country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
firstEntry	Italy	Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity.	Vulkà Bianco	87	nan	Sicily & Sardinia	Etna	nan	Kerin O’Keefe	@kerinokeefe	Nicosia 2013 Vulkà Bianco (Etna)	White Blend	Nicosia
secondEntry	Portugal	This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled out with juicy red berry fruits and freshened with acidity. It's already drinkable, although it will certainly be better from 2016.	Avidagos	87	15	Douro	nan	nan	Roger Voss	@vossroger	Quinta dos Avidagos 2011 Avidagos Red (Douro)	Portuguese Red	Quinta dos Avidagos
2	US	Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through, with crisp acidity underscoring the flavors. The wine was all stainless-steel fermented.	nan	87	14	Oregon	Willamette Valley	Willamette Valley	Paul Gregutt	@paulgwine	Rainstorm 2013 Pinot Gris (Willamette Valley)	Pinot Gris	Rainstorm

大多情況下，我會使用 set_index() 就是了

每一行與列的索引值也可以有他們各自的名稱，可以透過 rename_axis() 改動:

	df = pd.DataFrame({"num_legs": [4, 4, 2], #運算結果 1
	"num_arms": [0, 0, 2]},
	["dog", "cat", "monkey"])
	df = df.rename_axis("animal", axis="rows") #運算結果 2
	df = df.rename_axis("limbs", axis="columns") #運算結果 3

運算結果1

        num_legs  num_arms
dog            4         0
cat            4         0
monkey         2         2

運算結果2

        num_legs  num_arms
animal
dog            4         0
cat            4         0
monkey         2         2

運算結果3

limbs   num_legs  num_arms
animal
dog            4         0
cat            4         0
monkey         2         2

# Combining

如果我們想要將兩個表格合併在一起，使用 concat() 就可以簡單達成：(此資料會接續在一起，無論是否重複)

	df1 = pd.DataFrame([['a', 1], ['b', 2]], columns=['letter', 'number'])
	df2 = pd.DataFrame([['c', 3], ['d', 4]], columns=['letter', 'number'])
	pd.concat([df1, df2])

運算結果

  letter  number
0      a       1
1      b       2
0      c       3
1      d       4

倘若兩個表格在同一欄中有數個相同的數值時，我們可以使用 join() 合併表格，此表格會更加清晰。

	df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
	'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})

	other = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
	'B': ['B0', 'B1', 'B2']})
	#lsuffix 代表 df 原有的數值後綴，rsuffix 代表 other 的數值後綴
	result = df.join(other, lsuffix='_caller', rsuffix='_other')

運算結果 - df

  key   A
0  K0  A0
1  K1  A1
2  K2  A2
3  K3  A3
4  K4  A4
5  K5  A5

運算結果 - other

  key   B
0  K0  B0
1  K1  B1
2  K2  B2

運算結果 - result

  key_caller   A key_other    B
0         K0  A0        K0   B0
1         K1  A1        K1   B1
2         K2  A2        K2   B2
3         K3  A3       NaN  NaN
4         K4  A4       NaN  NaN
5         K5  A5       NaN  NaN

如果想要使用特定的索引值，可以對兩者皆使用 set_index() 函式再 join() 即可。

df.set_index('key').join(other.set_index('key'))

運算結果

      A    B
key
K0   A0   B0
K1   A1   B1
K2   A2   B2
K3   A3  NaN
K4   A4  NaN
K5   A5  NaN

# Chapter 4

# Groupwise analysis

# Multi-indexes

# Sorting

# Chapter 5

# Dtypes

# Missing data

# Chapter 6

# Renaming

# Combining

Pandas 資料分析(上)

Chapter 5 Techniques of Integrals(1/3)