
Stock Fraud Detection Using Peer Group Analysis
금융공학 네트워킹 논문리딩 발표자료 - Stock fraud detection using peer group analysis¶
๊ธ์ต๊ณตํ ๋คํธ์ํน ๋
ผ๋ฌธ๋ฆฌ๋ฉ์์ ๋ฐํํ ์๋ฃ์ ์ฝ๋ ์
๋๋ค. ์ฝ๋๋ ํด๋น ๋
ผ๋ฌธ์ ๋ณด๋ฉด์ ์ง์ ๋ง๋ค์ด ๋ณด์์ต๋๋ค.
- ๋
ผ๋ฌธ ์ฃผ์ : https://www.sciencedirect.com/science/article/pii/S0957417412002692
- ๋ฐํ ์๋ฃ : https://www2.slideshare.net/ParkJunPyo1/stock-fraud-detection-using-peer-group-analysis
by JunPyo Park
Import Modules¶
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
plt.rcParams["font.family"] = 'GULIM'
plt.rcParams["axes.grid"] = True
plt.rcParams["figure.figsize"] = (12,6)
plt.rcParams["axes.formatter.useoffset"] = False
plt.rcParams['axes.unicode_minus'] = False
plt.rcParams["axes.formatter.limits"] = -10000, 10000
Read Data¶
# ๋ฐ์ดํฐ๊ฐ ์๋ค๋ฉด FinanceDataReader ์ค์นํ์ฌ ์๋ ์ฝ๋๋ฅผ ํตํด ๋ค์ด๋ก๋ ๊ฐ๋ฅ
import FinanceDataReader as fdr
stocks = fdr.StockListing('KOSDAQ') # ์ฝ์ค๋ฅ
name_code = stocks[['Name','Symbol']]
for name, code in name_code.head().values:
print(name,code)
df_list = [fdr.DataReader(code, '2016-05-01', '2019-09-30')['Close'] for name, code in name_code.values]
df = pd.concat(df_list, axis=1)
df.columns = name_code['Name']
df = df.dropna(axis=1) # drop nan
df = pd.read_pickle('kosdaq.pkl')
df.head()
Preprocessing¶
๋ ผ๋ฌธ์์ ๋ถ์ ๊ธฐ๊ฐ์ ์ข ๊ฐ์ ์ต์๊ฐ์ด 5์ฒ์ ์ด์์ธ ์ข ๋ชฉ๋ค์ ํํฐ๋ง ํ์๋๋ฐ ์ฌ์ค ์ ํ๋์ง ๋ชจ๋ฅด๊ฒ ๋ค.... ์๊ฐ ์ด์ก๋ ์๋๊ณ ์ข ๊ฐ๋ฅผ ๊ธฐ์ค์ผ๋ก?? ์ผ๋จ ๋น์ทํ๊ฒ ๋ง์ ์ด์์ธ ์ข ๋ชฉ๋ค์ ๋ ๋ ค๋ณด์
classify a company as a global outlier if the minimum daily closing price since 2004 is greater than m=5000
(df.min() >= 10000).sum()
df = df[df.columns[df.min() < 10000]]
df.head()
Train, Test 분리¶
๋ ผ๋ฌธ๊ณผ ๊ฐ์ด 40๊ฐ์ ์ฝ์ค๋ฅ ๋ฐ์ดํฐ๋ฅผ ์ฌ์ฉํ์ผ๋ฉฐ 18๊ฐ์(2016-05-01 ~ 2017-10-31)์ ํธ๋ ์ด๋ ์ ์ผ๋ก ์ฌ์ฉ
train_df = df[:'2017-10-31']
len(train_df)
train_df = train_df.iloc[3:]
len(train_df)
11์ ๋ถํฐ ํ ์คํธ ์
test_df = df['2017-10-31':]
len(test_df)
Smoothing the dataFrame¶
5์ผ์ฉ Non overlap ํ๊ฒ ์งค๋ผ์ ํ๊ท ์ ๋ด์ด smoothing
def smooth_df(df):
D = len(df) # 365
N = int(D/5) # 73
window_length = int(D/N) # 5
print(D, N, window_length)
df_smoothed = pd.DataFrame(columns = df.columns, index=range(N))
for i in range(N):
start_iloc = i * window_length
end_iloc = start_iloc + window_length
y_i = df.iloc[start_iloc:end_iloc].mean(axis=0)
df_smoothed.iloc[i] = y_i
return df_smoothed
train_df
train_df_smoothed = smooth_df(train_df)
train_df_smoothed
Normalize¶
def normalize_df(df):
return (df - df.mean())/df.std()
train_df_normalized = normalize_df(train_df_smoothed)
train_df_normalized
Target 선정¶
target1 : ๋ค์ด์ฒ์
test period์ ์ฃผ๊ฐ ์กฐ์ ์ด์ ์์์
https://www.hankyung.com/finance/article/2018061279391
df['๋ค์ด์ฒ์
'].plot(title='๋ค์ด์ฒ์
์ข
๊ฐ'); # target1
span_start = test_df.index[0]
span_end = test_df.index[-1]
plt.axvspan(span_start, span_end, facecolor='gray', alpha=0.3);
target2 : ์ ์ผ์ ๊ฐ
test period ์ ๋ณด๋ฌผ์ ๊ด๋ จ ์ฃผ๊ฐ์กฐ์ ์ด์ ์์์
https://news.joins.com/article/22919304
df['์ ์ผ์ ๊ฐ'].plot(title='์ ์ผ์ ๊ฐ ์ข
๊ฐ'); # target2
span_start = test_df.index[0]
span_end = test_df.index[-1]
plt.axvspan(span_start, span_end, facecolor='gray', alpha=0.3);
๋ค์ด์ฒ์ ๋ก ์งํ
target_series = train_df_normalized['๋ค์ด์ฒ์
'] # X_i
Train 기간 동안 Peer Group 선정¶
def calc_euc_dist(series): # Euclidean Distance
return np.sqrt(((target_series - series) ** 2).sum())
dist = train_df_normalized.apply(lambda x : calc_euc_dist(x), axis=0)
dist.head()
dist_sorted = dist.sort_values()[1:] # ์๊ธฐ ์์ ์ ์ธ
dist_sorted
dist_sorted.describe()
거리가 가까운 k(=50)개의 peer 선택¶
k = 50
top_k_peers_distance = dist_sorted.head(k)
top_k_peers_distance
# target
target_series.plot();
# Peer Group
top_k_peers = top_k_peers_distance.index
train_df_normalized[top_k_peers].plot(figsize=(12,7));
Peer Group Summary 계산¶
- Simple Mean : ๋จ์ ํ๊ท
- Weighted Mean : ๋ ผ๋ฌธ์์ ์๋ก ๊ณ ์ํ weight๋ฅผ ๋ถ์ฌ
top_k_values = train_df_normalized[top_k_peers] # y_i_pi(j)
top_k_values
gamma = 10
prox = np.exp(-gamma * top_k_peers_distance)
weight = prox / prox.sum()
weight
p_i_simplemean = top_k_values.apply(lambda x : x.mean(), axis=1) # simple mean
p_i_weighted = top_k_values.apply(lambda x : x.dot(weight.T), axis=1) # weighted mean
p_i_weighted
Peer group summary plot¶
p_i_weighted.plot(label='weighted');
p_i_simplemean.plot(label='simple');
target_series.plot();
plt.legend();
Peer group updates
weight
Peer Group Summary Update¶
ํ ์คํธ ๊ธฐ๊ฐ์ ๋๋ฉด์ ๋ํฏ๊ฐ์ ์ ๋ฐ์ดํธ ํด์ค์ผ ํจ
Test Set Smoothing and Normalizing¶
len(test_df)
test_df_smoothed = smooth_df(test_df) # smoothing
test_df_smoothed
# train set ์ mean ๊ณผ std ์ฌ์ฉํ์ฌ normalize
mean = train_df_smoothed.mean()
std = train_df_smoothed.std()
test_df_normalized = (test_df_smoothed - mean)/std
test_df_normalized
test_target_series = test_df_normalized[target_series.name]
test_target_series
test_df_normalized[top_k_peers]
Update Weight¶
param_lambda = 0.2
T = len(test_df_normalized)
weight_df = pd.DataFrame(columns=top_k_peers, index=range(T))
weight_df
for t in range(T):
sum_prox_j = 0
x = test_target_series[t]
prox_j = {}
for j in top_k_peers: # j : company name
x_pi = test_df_normalized[j][t]
d_pi = np.abs(x - x_pi)
prox_j[j] = np.exp(-gamma * d_pi)
sum_prox_j += prox_j[j]
for j in top_k_peers:
weight[j] = (1-param_lambda) * weight[j] + param_lambda * (prox_j[j] / sum_prox_j)
weight_df.iloc[t] = weight
Weight Plot¶
์ต๊ทผ ๊ธฐ๊ฐ์ ์์ง์์ด ์ ์ฌํ peer์๋ ํฐ weight๊ฐ ๋ฐ๋๋ก ์ ์ฌํ์ง ์์ peer์๋ ์์ ๊ฐ์ด ๋ถ์ฌ๋๋๋ก ์กฐ์ข ๋จ
weight_df.plot();
plt.legend().set_visible(False);
Update Peer Group Summary¶
๊ณ์ฐ๋ ๋ณ๋ weight๋ฅผ ํตํด ๋ํฏ๊ฐ ์ ๋ฐ์ดํธ
test_p_i = (weight_df * test_df_normalized[top_k_peers]).sum(axis=1)
test_p_i
# ์ข
๊ฐ
test_df_smoothed[target_series.name].plot();
test_target_series.plot(label = 'target');
test_p_i.plot(label = 'centroid');
plt.title(target_series.name)
plt.legend();
Evaluation¶
๊ทธ๋ฃน์ ๋ํฏ๊ฐ (centroid)๋ฅผ threshold ๋ฒ์ ์ด์ ๋ฒ์ด๋ ์ fraudํ ์ข ๋ชฉ์ผ๋ก ์ธ์
๋ ผ๋ฌธ์์๋ Fraudํ ์ข ๋ชฉ์ ๋ํ ๋ ์ด๋ธ๋ ๋ฐ์ดํฐ๊ฐ ์์ด์ ๋ชจ๋ธ ํ๊ฐ๋ฅผ ์งํํ๊ณ ๊ฐ์ข parameter์ ๋ํ sensitivity ๋ถ์ ๋ฑ์ ์งํ (์ ๋ ๊ทธ๋ฐ ๋ฐ์ดํฐ๊ฐ ์์ด์...)
๋ค๋ฅธ ์ข ๋ชฉ์ ๋๋ฆฐ ๊ฒฐ๊ณผ