• Ch.4 Grouping and Sorting
  • Ch.5 Data Types and Missing Values
  • Ch.6 Renaming and Combining

# Chapter 4

# Groupwise analysis

使用 groupby() 方法可以將資料依照自己要的 column 分組,它會將類別相同的數值進行合併並一同分析。
這邊我們使用 count() 統計各個 points 出現的數量:

reviews.groupby('points').points.count()
運算結果
points
80     397
81     692
      ... 
99      33
100     19
Name: points, Length: 21, dtype: int64

除此之外,我們也可以使用 describe() 中用過的所有函式對數值進行分析。
例如:我想知道各項的 points 中,最便宜的價格為多少。

reviews.groupby('points').price.min() #運算結果見註解
運算結果
points
80      5.0 #所有80分的酒中,最低價的是5.0元
81      5.0
       ... 
99     44.0
100    80.0 #所有100分的酒中,最低價的是80.0元
Name: price, Length: 21, dtype: float64

使用 groupby() 建立的表格數值本身就已經有對應上,因此,我們可以直接使用 apply() 對資料做映射處理:

reviews.groupby('winery').apply(lambda df: df.title.iloc[0])
運算結果
winery
1+1=3                          1+1=3 NV Rosé Sparkling (Cava)
10 Knots                 10 Knots 2010 Viognier (Paso Robles)
                                  ...                        
àMaurice    àMaurice 2013 Fred Estate Syrah (Walla Walla V...
Štoka                         Štoka 2009 Izbrani Teran (Kras)
Length: 16757, dtype: object

groupby() 也可以同時對多 column 做分組分析 (也就是 Multi-indexes )
例如:列出每一個國家與其省份分數最高的酒類資料。

reviews.groupby(['country', 'province']).apply(lambda df: df.loc[df.points.idxmax()]).head()

idmax() 會回傳第一個最大值的索引值

運算結果
countrydescriptiondesignationpointspriceprovinceregion_1region_2taster_nametaster_twitter_handletitlevarietywinery
('Argentina', 'Mendoza Province')ArgentinaIf the color doesn't tell the full story, the nose will: it's impeccably perfumed, round and sweet, with amazing fruit aromas and a toasty background. The feel is as lush and rich as they come, and there's outstanding purity to the lusty black fruit flavors. A masher with elegance. Maybe the best Malbec Argentina has ever produced. Drink from 2011–15.Nicasia Vineyard97120Mendoza ProvinceMendozananMichael Schachner@wineschachBodega Catena Zapata 2006 Nicasia Vineyard Malbec (Mendoza)MalbecBodega Catena Zapata
('Argentina', 'Other')ArgentinaTake note, this could be the best wine Colomé has ever bottled. Cedary oak aromas grace ripe, almost desiccated wild berry notes that are slightly herbal and sauvage. A well-layered mouthfeel is ideal in power and precision. Requisite blackberry, cassis, toast and peppery flavors finish dry and long, with little heat or burn. Drink through 2021.Reserva9590OtherSaltananMichael Schachner@wineschachColomé 2010 Reserva Malbec (Salta)MalbecColomé
('Armenia', 'Armenia')ArmeniaDeep salmon in color, this wine offers a bouquet of cantaloupe and nectarine. It is bright on entry, with full mouthfeel and flavors of pineapple, grapefruit and white peach that culminate in a bright citrus finish.Estate Bottled8815ArmenianannanMike DeSimone@worldwineguysVan Ardi 2015 Estate Bottled Rosé (Armenia)RoséVan Ardi
('Australia', 'Australia Other')AustraliaWrites the book on how to make a wine filled with gigantic jammy fruit that's nonetheless dry, balanced and harmonious. The Cabernet contributes blackcurrants and structural tannins. The Syrah brings a peppery richness, while Merlot adds soft fruitiness. You can actually taste each varietal, yet the wine hangs together, dry and rich. Best Buy.Sarah's Blend9315Australia OtherSouth Eastern AustralianannannanMarquis Philips 2000 Sarah's Blend Red (South Eastern Australia)Red BlendMarquis Philips
('Australia', 'New South Wales')AustraliaDe Bortoli's Noble One is as good as ever in 2007, offering intense botrytis, dried apricots, honey and bergamot in a sweetly unctuous style. Think of a Sauternes, but with more sugar and less alcohol. It should age well for 10 or more years.Noble One Bortytis9432New South WalesNew South WalesnanJoe Czerwinski@JoeCzDe Bortoli 2007 Noble One Bortytis Semillon (New South Wales)SémillonDe Bortoli

我們可以使用 agg() ,同時對一組數據做不同的函式比較:

# 同時對不同 country 的 price 分別做分析 (資料數,最小值,最大值)
reviews.groupby(['country']).price.agg([len, min, max]).head()
運算結果
countrylenminmax
Argentina38004230
Armenia21415
Australia23295850
Austria334571100
Bosnia and Herzegovina21213

# Multi-indexes

之前我們就稍有提到,我們建立的群組也可以對多個 column 進行描述與運算。
這裡舉個簡單的例子:我想知道各國家的各省份各有多少筆資料:

countries_reviewed = reviews.groupby(['country', 'province']).description.agg([len])
countries_reviewed.head()
運算結果
len
('Argentina', 'Mendoza Province')3264
('Argentina', 'Other')536
('Armenia', 'Armenia')2
('Australia', 'Australia Other')245
('Australia', 'New South Wales')85

使用多個 column 資料所建立的表格,其索引值並非傳統的 Index ,而是 padas 內建的 MultiIndex
pandas 有特定的方法可以對 MutiIndex 做出更複雜的操作。

mi = countries_reviewed.index
type(mi) # 運算結果為: pandas.core.indexes.multi.MultiIndex

但通常我會直接使用 reset_index() 這方法對現有的表格重製為比較單純的索引值。

countries_reviewed.reset_index()
運算結果
countryprovincelen
0ArgentinaMendoza Province3264
1ArgentinaOther536
2ArmeniaArmenia2
3AustraliaAustralia Other245
4AustraliaNew South Wales85

# Sorting

使用 sort_values() 可針對特定數值進行排列。(預設由小到大)

countries_reviewed = countries_reviewed.reset_index()
countries_reviewed.sort_values(by='len').head(n=10)
運算結果
indexcountryprovincelen
179179GreeceMuscat of Kefallonian1
192192GreeceSterea Ellada1
194194GreeceThraki1
354354South AfricaPaardeberg1
4040BrazilSerra do Sudeste1
114114EgyptEgypt1
316316SerbiaPocerina1
112112CyprusPitsilia Mountains1
110110CyprusLemesos1
301301PortugalVinho da Mesa1

sort_value() 中的 ascending 調整為 False ,數值即可有大排到小。

countries_reviewed.sort_values(by='len', ascending=False).head(n=10)
運算結果
level_0indexcountryprovincelen
392392392USCalifornia36247
415415415USWashington8639
118118118FranceBordeaux5941
227227227ItalyTuscany5897
409409409USOregon5373
119119119FranceBurgundy3980
375375375SpainNorthern Spain3851
224224224ItalyPiedmont3729
000ArgentinaMendoza Province3264
228228228ItalyVeneto2716

sort_index() 可依照 index 做排序 (非常白話)

countries_reviewed.sort_index()
運算結果
level_0indexcountryprovincelen
000ArgentinaMendoza Province3264
111ArgentinaOther536
222ArmeniaArmenia2
333AustraliaAustralia Other245
444AustraliaNew South Wales85
555AustraliaSouth Australia1349
666AustraliaTasmania42
777AustraliaVictoria322
888AustraliaWestern Australia286
999AustriaAustria26

排序數值也能對字串做排序,其順序參照 ASCII Code,並且排序的參照也可以不只一個。
如下所示:我們先依照國家名稱進行 (由小到大的) 排序,若國家名稱相同,則依照資料總數 len (由小到大的) 排序

countries_reviewed.sort_values(by=['country', 'len'])
運算結果
level_0indexcountryprovincelen
111ArgentinaOther536
000ArgentinaMendoza Province3264
222ArmeniaArmenia2
666AustraliaTasmania42
444AustraliaNew South Wales85
333AustraliaAustralia Other245
888AustraliaWestern Australia286
777AustraliaVictoria322
555AustraliaSouth Australia1349
202020AustriaSüdburgenland2

# Chapter 5

# Dtypes

dtype 可以取得特定 column 的資料型態

#查看 price 的資料型態
reviews.price.dtype
運算結果
dtype('float64')

當然的, dtype 也可以取得該 DataFrame 中所有 column 的資料型態

reviews.dtypes
運算結果
country                   object
description               object
designation               object
points                     int64
price                    float64
province                  object
region_1                  object
region_2                  object
taster_name               object
taster_twitter_handle     object
title                     object
variety                   object
winery                    object
dtype: object

字串組成的那整欄會被視為 object ,而不是 stringㄅ

使用 astype() 可以強制轉換型態。
例如:將 points 原有的型態 int64 轉成 float64

reviews.points.astype('float64')
運算結果
0         87.0
1         87.0
          ... 
129969    90.0
129970    90.0
Name: points, Length: 129971, dtype: float64

# Missing data

NaN ,全名: Not a Number ,表示該筆資料有缺失,在資料分析時經常會見到此情形。
isnull() 可以判斷該筆資料是否為空資料 (回傳布林值),搭配上 [] 就能過濾出有空資料的該列。

reviews[pd.isnull(reviews.country)].head()
運算結果
countrydescriptiondesignationpointspriceprovinceregion_1region_2taster_nametaster_twitter_handletitlevarietywinery
913nanAmber in color, this wine has aromas of peach jam and orange blossom. Flavors are a mélange of mellow fruit and savory notes, in particular apricot, lemon curd, quince and bitter almond. Despite strong tannins, there are moments of brightness across the entire palate that linger into the finish.Asureti Valley8730nannannanMike DeSimone@worldwineguysGotsa Family Wines 2014 Asureti Valley ChinuriChinuriGotsa Family Wines
3131nanSoft, fruity and juicy, this is a pleasant, simple wine that is a blend from different European countries. It is balanced and very drinkable for parties or barbecues. Screwcap.Partager83nannannannanRoger Voss@vossrogerBarton & Guestier NV Partager RedRed BlendBarton & Guestier
4243nanViolet-red in color, this semisweet wine has a nose of blackberry and vanilla. It is soft on entry, with flavors of black cherry, blueberry and creme brûlée. It is smooth on the palate, while the blueberry pie finish is brightened up by a brief but not unwelcome tart note.Red Naturally Semi-Sweet8818nannannanMike DeSimone@worldwineguysKakhetia Traditional Winemaking 2012 Red Naturally Semi-Sweet OjaleshiOjaleshiKakhetia Traditional Winemaking
9509nanThis mouthwatering blend starts with a nose of fresh mint, pineapple and mango, followed by flavors of orange peel, peach and lemon. Fresh and fruity with a mineral backbone, the wine finishes on an earthy spin. Pair with grilled sardines or poultry dishes.Theopetra Malagouzia-Assyrtiko9228nannannanSusan Kostrzewa@suskostrzewaTsililis 2015 Theopetra Malagouzia-Assyrtiko WhiteWhite BlendTsililis
9750nanThis orange-style wine has a cloudy yellow-gold appearance because it spent 30 days on the skins and 100 days on the lees. It has aromas and flavors of aged apple cider and tart apple pie with an astringent finish. This style is not for everyone, but for those who appreciate orange wine made from Chardonnay, this is a solid example.Orange Nikolaevo Vineyard8928nannannanJeff Jenssen@worldwineguysRoss-idi 2015 Orange Nikolaevo Vineyard ChardonnayChardonnayRoss-idi

fillna() 可以將空資料 NaN 改成指定的字串或數值,更方便我們進行操作:

reviews.region_2.fillna("Unknown")
運算結果
0         Unknown
1         Unknown
           ...   
129969    Unknown
129970    Unknown
Name: region_2, Length: 129971, dtype: object

replace(A,B) 將所有的 A 取代為 B (就字面上的意思):

reviews.taster_twitter_handle.replace("@kerinokeefe", "@kerino")
運算結果
0            @kerino
1         @vossroger
             ...    
129969    @vossroger
129970    @vossroger
Name: taster_twitter_handle, Length: 129971, dtype: object

# Chapter 6

# Renaming

column 的名稱可以透過 rename() 這方法進行改變:

reviews.rename(columns={'points': 'score'}).head(n=1)
運算結果
countrydescriptiondesignationscorepriceprovinceregion_1region_2taster_nametaster_twitter_handletitlevarietywinery
0ItalyAromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity.Vulkà Bianco87nanSicily & SardiniaEtnananKerin O’Keefe@kerinokeefeNicosia 2013 Vulkà Bianco (Etna)White BlendNicosia

index 的名稱也可以使用 rename() 改動:

reviews.rename(index={0: 'firstEntry', 1: 'secondEntry'}).head(n=3)
運算結果
countrydescriptiondesignationpointspriceprovinceregion_1region_2taster_nametaster_twitter_handletitlevarietywinery
firstEntryItalyAromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity.Vulkà Bianco87nanSicily & SardiniaEtnananKerin O’Keefe@kerinokeefeNicosia 2013 Vulkà Bianco (Etna)White BlendNicosia
secondEntryPortugalThis is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled out with juicy red berry fruits and freshened with acidity. It's already drinkable, although it will certainly be better from 2016.Avidagos8715DouronannanRoger Voss@vossrogerQuinta dos Avidagos 2011 Avidagos Red (Douro)Portuguese RedQuinta dos Avidagos
2USTart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through, with crisp acidity underscoring the flavors. The wine was all stainless-steel fermented.nan8714OregonWillamette ValleyWillamette ValleyPaul Gregutt@paulgwineRainstorm 2013 Pinot Gris (Willamette Valley)Pinot GrisRainstorm

大多情況下,我會使用 set_index() 就是了

每一行與列的索引值也可以有他們各自的名稱,可以透過 rename_axis() 改動:

df = pd.DataFrame({"num_legs": [4, 4, 2], #運算結果 1
                   "num_arms": [0, 0, 2]},
                  ["dog", "cat", "monkey"])
df = df.rename_axis("animal", axis="rows") #運算結果 2
df = df.rename_axis("limbs", axis="columns") #運算結果 3
運算結果1
        num_legs  num_arms
dog            4         0
cat            4         0
monkey         2         2
運算結果2
        num_legs  num_arms
animal
dog            4         0
cat            4         0
monkey         2         2
運算結果3
limbs   num_legs  num_arms
animal
dog            4         0
cat            4         0
monkey         2         2

# Combining

如果我們想要將兩個表格合併在一起,使用 concat() 就可以簡單達成:(此資料會接續在一起,無論是否重複)

df1 = pd.DataFrame([['a', 1], ['b', 2]], columns=['letter', 'number'])
df2 = pd.DataFrame([['c', 3], ['d', 4]], columns=['letter', 'number'])
pd.concat([df1, df2])
運算結果
  letter  number
0      a       1
1      b       2
0      c       3
1      d       4

倘若兩個表格在同一欄中有數個相同的數值時,我們可以使用 join() 合併表格,此表格會更加清晰。

df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
                   'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
other = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
                      'B': ['B0', 'B1', 'B2']})
#lsuffix 代表 df 原有的數值後綴,rsuffix 代表 other 的數值後綴
result = df.join(other, lsuffix='_caller', rsuffix='_other')
運算結果 - df
  key   A
0  K0  A0
1  K1  A1
2  K2  A2
3  K3  A3
4  K4  A4
5  K5  A5
運算結果 - other
  key   B
0  K0  B0
1  K1  B1
2  K2  B2
運算結果 - result
  key_caller   A key_other    B
0         K0  A0        K0   B0
1         K1  A1        K1   B1
2         K2  A2        K2   B2
3         K3  A3       NaN  NaN
4         K4  A4       NaN  NaN
5         K5  A5       NaN  NaN

如果想要使用特定的索引值,可以對兩者皆使用 set_index() 函式再 join() 即可。

df.set_index('key').join(other.set_index('key'))
運算結果
      A    B
key
K0   A0   B0
K1   A1   B1
K2   A2   B2
K3   A3  NaN
K4   A4  NaN
K5   A5  NaN
更新於 閱讀次數

用實際行動犒賞爆肝的我😀

Zrn Ye LinePay

LinePay