Pandas 資料分析 (上) - Python 基本使用 - AI 相關 | Zrn Code = 為了夢想、永不停歇

Ch.1 Creating , Reading and Writing
Ch.2 Indexing , Selecting & Assigning
Ch.3 Summary Functions and Maps

# Intro

Pandas 結合 NumPy 的特性，以及試算表和關連式資料庫（SQL）的資料操作能力，可用來對資料進行重構、切割、聚合等操作。使你能夠快速的發現資料中的資訊以及其中蘊藏的意義。

在 Python 中使用 Pandas :

import pandas as pd

透過 pip 下載 Pandas : https://pypi.org/project/pandas/

# Chapter 1

# Creating data

Panda 有兩種核心物件，分別是 DataFrame 及 Series

# DataFrame

DataFrame 是一個二維表格的資料形式。我們可以使用字典 dictionary 來創建它:

pd.DataFrame({"animal_1": ["elk", "pig"], "animal_2": ["dog", "quetzal"]})

運算結果

	animal_1	animal_2
0	elk	dog
1	pig	quetzal

	computers = {
	'notebook': [5, 6, 3],
	'desktop': [1, 2, 4]
	}
	pd.DataFrame(computers, index=["Tom", "Bob", "Elise"]) # 修改索引值

運算結果

	notebook	desktop
Tom	5	1
Bob	6	2
Elise	3	4

# Series

Series 是一個簡易的一維資料結構，和試算表中的欄（column）類似，我們能以 array 、 Numpy array ，或是 dictionary 來建立一個 Series :

	# 使用 array 建立一個 Series
	pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')

	# 使用 dictionary 建立一個 Series
	pd.Series({'2015 Sales': 30,'2016 Sales': 35,'2017 Sales': 40}, name='Product A')

運算結果

	Product A
2015 Sales	30
2016 Sales	35
2017 Sales	40

# Reading data files

我們可以透過匯入 .csv 檔案，讀取現有的資料:

wine_reviews = pd.read_csv("../input/wine/wine_reviews.csv")

什麼是 CSV 檔案?

全名: Comma-Separated Values (逗號分隔值) 是一種文字檔案，可讓您以表格結構化格式儲存資料。
更詳細的內容可到維基百科了解一下

	wine_reviews.shape #(列，行）
	wine_reviews.size #資料總數

	wine_reviews.head() #預設顯示前 5 項
	wine_reviews.tail(n=3) #顯示最後 3 項

# Saving DataFrame

	mid_term_marks = {"Student": ["Kamal", "Arun", "David", "Thomas"],
	"Economics": [10, 8, 6, 5],
	"Fine Arts": [7, 8, 5, 9],
	"Mathematics": [7, 3, 5, 8]}

	mid_term_marks_df = pd.DataFrame(mid_term_marks)
	mid_term_marks_df.to_csv("midterm.csv")

# Chapter 2

wine_reviews 資料來源: https://www.kaggle.com/datasets/zynicide/wine-reviews

# Indexing in pandas

# `Index-based` selection

.iloc 可用來檢視特定欄列的 DataFrame 資料

reviews.iloc[0] # 第 0 列的全部資料

運算結果

country                                                    Italy
description    Aromas include tropical fruit, broom, brimston...
                                     ...                        
variety                                              White Blend
winery                                                   Nicosia
Name: 0, Length: 13, dtype: object

.iloc 與 .loc 在使用的時候，都是遵守 [row-first, column-second] 的規則:

	reviews.iloc[:, 0] # 每一列 - 第 0 欄的資料
	reviews.iloc[0:3, 0] # 第 0 ~ 2 列 - 第 0 欄的資料
	reviews.iloc[[0, 1, 2], 0] # 第 0、1、2 列 - 第 0 欄的資料
	reviews.iloc[-5:] # 倒數 5 列的所有資料

# `Label-based` selection

這邊我們嘗試使用 .loc 來檢視資料:

reviews.loc[0, 'country'] # 第 0 列 - country 欄的資料

運算結果

'Italy'

.iloc 的使用大多較 .loc 方便，但當資料集的指標已經很明確時， .loc 反而會變得比較清晰易懂:

	# 展示 'taster_name', 'taster_twitter_handle', 'points' 欄的資料
	reviews.loc[:, ['taster_name', 'taster_twitter_handle', 'points']]

運算結果

taster_name	taster_twitter_handle	points	antiderivatives
0	Kerin O’Keefe	@kerinokeefe	87
1	Roger Voss	@vossroger	87
...	...	...	...
129969	Roger Voss	@vossroger	90
129970	Roger Voss	@vossroger	90

# Manipulating the index

我們可以使用 .set_index() 在 DataFrame 上新增一列資料:

reviews.set_index("title")

# Conditional selection

我們可以使用 .loc 對資料做一定的篩選:

	# 展示每一列 'country' 欄為 'Italy' 的資料
	reviews.loc[reviews.country == 'Italy']

運算結果

country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
0	Italy	Aromas include tropical fruit, broom, brimston...	Vulkà Bianco	87	NaN	Sicily & Sardinia	Etna	NaN	Kerin O’Keefe	@kerinokeefe	Nicosia 2013 Vulkà Bianco (Etna)	White Blend	Nicosia
6	Italy	Here's a bright, informal red that opens with ...	Belsito	87	16.0	Sicily & Sardinia	Vittoria	NaN	Kerin O’Keefe	@kerinokeefe	Terre di Giurfo 2013 Belsito Frappato (Vittoria)	Frappato	Terre di Giurfo
...	...	...	...	...	...	...	...	...	...	...	...	...	...
129961	Italy	Intense aromas of wild cherry, baking spice, t...	NaN	90	30.0	Sicily & Sardinia	Sicilia	NaN	Kerin O’Keefe	@kerinokeefe	COS 2013 Frappato (Sicilia)	Frappato	COS
129962	Italy	Blackberry, cassis, grilled herb and toasted a...	Sàgana Tenuta San Giacomo	90	40.0	Sicily & Sardinia	Sicilia	NaN	Kerin O’Keefe	@kerinokeefe	Cusumano 2012 Sàgana Tenuta San Giacomo Nero d...	Nero d'Avola	Cusumano

除了 == 外，也能使用其他布林運算子做篩選:

	reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)] # 且 90 分以上
	reviews.loc[(reviews.country == 'Italy') \| (reviews.points >= 90)] # 或 90 分以上

我們也可用 .isin() 簡單的篩選資料:

	reviews.loc[reviews.country.isin(['Italy', 'France'])]
	# 等價於
	reviews.loc[(reviews.country == 'Italy') \| (reviews.country == 'France')]

.notnull() 可幫我們過濾空資料 (例: NaN ):

reviews.loc[reviews.price.notnull()] # 'price' 欄不為空的資料

# Assigning data

我們可以輕鬆的對 DataFrame 賦值:

	reviews['critic'] = 'everyone'
	reviews['critic']

運算結果

0         everyone
1         everyone
            ...   
129969    everyone
129970    everyone
Name: critic, Length: 129971, dtype: object

也可以使用迭代的方式賦值:

	reviews['index_backwards'] = range(len(reviews), 0, -1)
	reviews['index_backwards']

運算結果

0         129971
1         129970
           ...  
129969         2
129970         1
Name: index_backwards, Length: 129971, dtype: int64

# Chapter 3

# Summary functions

describe() 能列出前面所提到的一些統計數據，如中位數、四分位數等，並忽略所有空 (null) 值

reviews.points.describe()

運算結果

count    129971.000000
mean         88.447138
             ...      
75%          91.000000
max         100.000000
Name: points, Length: 8, dtype: float64

	# 最小值 (Min)
	reviews.points.min()
	# 最大值 (Max)
	reviews.points.max()

	# 平均值 (Mean)
	reviews.points.mean()
	# 四分位數 (Quartile)
	reviews.points.quantile([0.25, 0.5, 0.75, 1])
	# 中位數 (Median)
	reviews.points.median()

	# 變異數 (Variance)
	reviews.points.var()
	# 標準差 (Standard Deviation)
	reviews.points.std()

reviews.taster_name.describe()

# Maps

有兩種常見的映射方法，分別是 map() 跟 apply() :

# `map()`

假設我們現在要計算 reviews.points 中每個數值的離差 (deviation):

	review_points_mean = reviews.points.mean()
	reviews.points.map(lambda p: p - review_points_mean)

# `apply()`

我們也可以使用 apply() 來做映射:

	def remean_points(row): # 對每一行做轉換
	row.points = row.points - review_points_mean
	return row

	reviews.apply(remean_points, axis='columns') # 改變 points 那列的值

我們可以使用 axis='index' 對每一列做轉換

使用 apply() 跟 map() 會回傳新資料，不會修改原始資料

# mapping operations

pandas 也提供了一些基本運算子作為映射的方法:

	review_points_mean = reviews.points.mean()
	reviews.points - review_points_mean

如果資料數量相等，操作也相對簡單的話， pandas 也能理解你的行為，
例如：我想將產地表達成 國家 - 地區 的形式

reviews.country + " - " + reviews.region

運算結果

0            Italy - Etna
1                     NaN
               ...       
129969    France - Alsace
129970    France - Alsace
Length: 129971, dtype: object

# Intro

# Chapter 1

# Creating data

# DataFrame

# Series

# Reading data files

# Saving DataFrame

# Chapter 2

# Indexing in pandas

# Index-based selection

# Label-based selection

# Manipulating the index

# Conditional selection

# Assigning data

# Chapter 3

# Summary functions

# Maps

# map()

# apply()

# mapping operations

Chapter 3 Applications of Differentiation(2/2)

Pandas 資料分析(下)

# `Index-based` selection

# `Label-based` selection

# `map()`

# `apply()`