LightFM 설명

01 Jun 2020 | Machine_Learning Recommendation System Paper_Review

본 글에서는 2015년에 Lyst에서 발표한 Hybrid Matrix Factorization Model인 LightFM에 관한 내용을 다룰 것이며 순서는 아래와 같다.

1) 논문 요약 리뷰
2) LightFM 라이브러리 사용법 소개
3) HyperOpt를 이용한 Hyperparameter 튜닝법 소개

1. Metadata Embeddings for User and Item Cold-start Recommendations 논문 리뷰

1.1. Introduction

cold-start 상황에서 추천 시스템을 만드는 것은 아직까지도 쉽지 않은 일이다. 기본적인 행렬 분해(Matrix Factorization) 기법들은 이러한 상황에서 형편 없는 성능을 보여준다. 왜냐하면 Collaborative Interaction 데이터가 희소할 때는 User와 Item의 잠재 벡터를 효과적으로 추정하는 일이 굉장히 어렵기 때문이다.

Content-based 방법은 메타데이터를 통해 Item이나 User를 표현(Represent)한다. 이러한 정보는 미리 알고 있기 때문에 Collaborative 데이터가 존재하지 않아도 추천 로직은 성립할 수 있다. 그러나 이러한 모델에서는 Transfer Learning은 불가능하다. 왜냐하면 각 User는 독립적으로 추정되기 때문이다. 결과적으로 CB 모델은 Collaborative 데이터가 이용 가능하고 각 User에 대해 많은 양의 데이터를 필요로 할 때, 기존 행렬 분해 모델보다 더 안좋은 성능을 보인다.

패션 온라인 몰인 Lyst에서는 이러한 문제를 해결하는 것이 매우 중요했다. 매일 같이 수만 개의 상품이 등록되고, 웹 상에는 800만 개가 넘는 패션 아이템이 등록되어 있었기 때문이다. 많은 Item, 새로운 상품의 잦은 등록(Cold-Start), 고객의 다수가 신규 고객(Cold-Start)라는 3가지의 어려운 조건 속에서, 본 논문은 LightFM이라는 Hybrid형 모델을 제시한다.

본 모델은 Content-based와 Collaborative Filtering의 장점을 결합하였다. 본 모델의 가장 중요한 특징은 아래와 같다.

1) 학습데이터에서 Collaborative 데이터와 User/Item Feature를 모두 사용한다.
2) LightFM에서 생성된 Embedding 벡터는 feature에 대한 중요한 의미 정보를 포함하고 있고, 이는 tag 추천과 같은 일에서 중요하게 사용될 수 있다.

1.2. LightFM

모델 구성 자체는 어렵지 않다. 가장 특징적인 것은 기존의 Classic한 행렬 분해 모델들과 다르게, User Feature와 Item Feature를 학습 과정에 포함하는 데에 적합한 구조로 만들어져 있다는 것이다.

잠시 기호에 대해 설명하겠다.

기호	설명
$U$	User의 집합
$I$	Item의 집합
$F^U$	User Feature의 집합
$F^I$	Item Feature의 집합
$f_u$	$u$라는 User의 features, $f_u \subset F^U$
$f_i$	$i$라는 Item의 features, $f_i \subset F^I$
$e_f^U$	$f_u$의 각 User feature들에 대한 d-차원 Embedding 벡터
$e_f^I$	$f_i$의 각 Item feature들에 대한 d-차원 Embedding 벡터
$b_f^U$	$u$라는 User의 features, $f_u \subset F^U$
$b_f^I$	$i$라는 Item의 features, $f_i \subset F^I$

User $u$에 대한 잠재 벡터는 그 User의 Features의 잠재 벡터들의 합으로 구성되며, Item 또한 같은 방식으로 계산한다. Bias 항 또한 아래와 같이 계산된다.

$q_u = \sum_{j \in f_u}e_j^U$ $p_i = \sum_{j \in f_i}e_j^I$ $b_u = \sum_{j \in f_u}b_j^U$ $b_i = \sum_{j \in f_i}b_j^I$

User $i$와 Item $i$에 대한 모델의 예측 값은, 이 User와 Item의 Representation(잠재 벡터)의 내적으로 이루어진다.

[\hat{r}_{ui} = sigmoid(q_u \odot p_i + b_u + b_i)]

최적화 목적함수는 parameter들이 주어졌을 때의 데이터에 대한 우도를 최대화 하는 것으로 설정된다. 이는 아래와 같다.

[L(e^U, e^I, b^U, b^I) = \prod_{(u,i) \in S^+} \hat{r}{ui} \times \prod{(u,i) \in S^-} (1- \hat{r}_{ui})]

여기서 $S^+$는 Positive Interaction, $S^-$는 Negative Interaction을 가리킨다.

이 식들만 봐서는 모델의 구조에 대해 완벽히 이해를 하지 못할 수도 있다. 아래 그림을 보면 이해가 될 것이다.

위 그림의 경우, User Feature를 예시로 든 것이고, Item Feature에 대해서도 같은 논리가 적용된다. $m$은 User의 수이다.

지금까지 논문에서 소개된 모델에 대해 알아보았다. Experiment 부분은 직접 읽어보도록 하고, 이제는 코드로 넘어가도록 하겠다.

2. LightFM 학습 및 HyperOpt를 활용한 Bayesian Optimization

2.1. Data Preparation

학습에 사용될 데이터는 Goodbook 데이터이다. 이 데이터셋에는 여러 독자(User)가 책(Item)에 대해 평점을 남긴 데이터이다. 사실 Implicit Feedback이 아닌 Explicit Feedback이기에 학습이 더욱 쉬울 수는 있지만, 그 부분은 잠시 접어두기로 하자. 데이터는 이곳에서 직접 다운로드할 수 있다.

학습에 사용한 파일은 ratings.csv와 books.csv인데, 아래와 같은 형상을 지녔다.

# ratings.csv
   user_id  book_id  rating
      1      258       5
      2     4081       4
      2      260       5
      2     9296       5
      2     2318       3

# books.csv
   book_id                      authors  average_rating        original_title
      1              Suzanne Collins            4.34      The Hunger Games
      2  J.K. Rowling, Mary GrandPré            4.44      Harry Potter ...
      3              Stephenie Meyer            3.57              Twilight
      4                   Harper Lee            4.25 To Kill a Mockingbird
      5          F. Scott Fitzgerald            3.89      The Great Gatsby

이 데이터를 그대로 LightFM에 Input으로 넣을 수는 없다. 다소 귀찮은 전처리 과정을 거쳐야 한다.

import pandas as pd
from lightfm.data import Dataset
from scipy.io import mmwrite

# Data Load
# ratings_source: build_interactions 재료, list of tuples
# --> [(user1, item1), (user2, item5), ... ]
# item_features_source: build_item_features 재료
# --> [(item1, [feature, feature, ...]), (item2, [feature, feature, ...])]
ratings = pd.read_csv('data/ratings.csv')
ratings_source = [(ratings['user_id'][i], ratings['book_id'][i]) for i in range(ratings.shape[0])]

item_meta = pd.read_csv('data/books.csv')
item_meta = item_meta[['book_id', 'authors', 'average_rating', 'original_title']]

item_features_source = [(item_meta['book_id'][i],
                        [item_meta['authors'][i],
                         item_meta['average_rating'][i]]) for i in range(item_meta.shape[0])]

코드를 보면 알 수 있겠지만, ratings_souce와 item_features_source라는 iterable 객체가 필요하다. 먼저 전자는 LightFM Dataset clss의 build_interactions 메서드의 재료로 활용되며, 후자의 경우 build_item_features의 재료가 된다. 본 학습에서는 User Feature를 따로 사용하지는 않았지만, Item Feature와 사용법이 동일하니, 참고해두면 되겠다.

이렇게 재료가 준비가 되었으면 LightFM의 Dataset 클래스를 불러온 후, fit을 해준다.

dataset = Dataset()
dataset.fit(users=ratings['user_id'].unique(),
            items=ratings['book_id'].unique(),
            item_features=item_meta[item_meta.columns[1:]].values.flatten()
            )

여기서 중요한 것은, 이 때 argument로 들어가는 객체에 결측값은 없어야 한다는 것이다.
이후 build를 해주면 데이터셋은 완성되었다.

interactions, weights = dataset.build_interactions(ratings_source)
item_features = dataset.build_item_features(item_features_source)

# Save
mmwrite('data/interactions.mtx', interactions)
mmwrite('data/item_features.mtx', item_features)
mmwrite('data/weights.mtx', weights)

# Split Train, Test data
train, test = random_train_test_split(interactions, test_percentage=0.1)
train, test = train.tocsr().tocoo(), test.tocsr().tocoo()
train_weights = train.multiply(weights).tocoo()

2.2. Hyper Parameter Optimization with HyperOpt

hyperopt는 꽤 오래 전부터 사용되던 Hyper Parameter 최적화 라이브러리이다. skopt도 널리 사용되고 있지만, 앞으로 업데이트가 계속 진행될 지 확실하지 않으므로… 본 글에서는 hyperopt를 소개하도록 하겠다.

먼저 Search Space를 정의해 주어야 한다.

from hyperopt import fmin, hp, tpe, Trials

# Define Search Space
trials = Trials()
space = [hp.choice('no_components', range(10, 50, 10)),
         hp.uniform('learning_rate', 0.01, 0.05)]

자세한 정보는 이곳에서 확인할 수 있다. space는 아래에서 소개할 objective 함수의 argument로 활용된다. space는 반드시 리스트로 작성할 필요는 없고, 필요에 따라 Dictionary나 OrderedDict 같은 객체를 사용해주면 좋다.

다음으로는 목적 함수를 정의해보자.

# Define Objective Function
def objective(params):
    no_components, learning_rate = params

    model = LightFM(no_components=no_components,
                    learning_schedule='adagrad',
                    loss='warp',
                    learning_rate=learning_rate,
                    random_state=0)

    model.fit(interactions=train,
              item_features=item_features,
              sample_weight=train_weights,
              epochs=3,
              verbose=False)

    test_precision = precision_at_k(model, test, k=5, item_features=item_features).mean()
    print("no_comp: {}, lrn_rate: {:.5f}, precision: {:.5f}".format(
      no_components, learning_rate, test_precision))
    # test_auc = auc_score(model, test, item_features=item_features).mean()
    output = -test_precision

    if np.abs(output+1) < 0.01 or output < -1.0:
        output = 0.0

    return output

일반적으로 위 함수의 반환 값은 loss가 되는데, 본 모델의 경우 loss를 직접 반환하는 메서드가 존재하지 않기 때문에 evaluation metric을 불러온 후, 이를 음수화하는 작업을 거쳤다.

이제는 fmin 함수를 불러와서 최적화 작업을 진행해보자.
max_evals 인자는 최대 몇 번 모델 적합을 진행할 것인가를 결정하며, timeout 인자를 투입할 경우 최대 search 시간을 제한할 수도 있다. best_params는 가장 좋은 Hyperparameter 조합에 관한 정보를 담은 Dictionary이다.

best_params = fmin(fn=objective, space=space, algo=tpe.suggest, max_evals=10, trials=trials)

2.3. 결과 확인

학습만 하고 끝낼 수는 없다. 학습이 끝난 모델을 활용하여 유사한 책(Item)에 대한 정보를 얻어보자. 유사도 측정은 코사인 유사도를 활용하였다.

# Find Similar Items
item_biases, item_embeddings = model.get_item_representations(features=item_features)

def make_best_items_report(item_embeddings, book_id, num_search_items=10):
    item_id = book_id - 1

    # Cosine similarity
    scores = item_embeddings.dot(item_embeddings[item_id])  # (10000, )
    item_norms = np.linalg.norm(item_embeddings, axis=1)    # (10000, )
    item_norms[item_norms == 0] = 1e-10
    scores /= item_norms

    # best: score가 제일 높은 item의 id를 num_search_items 개 만큼 가져온다.
    best = np.argpartition(scores, -num_search_items)[-num_search_items:]
    similar_item_id_and_scores = sorted(zip(best, scores[best] / item_norms[item_id]),
                                        key=lambda x: -x[1])

    # Report를 작성할 pandas dataframe
    best_items = pd.DataFrame(columns=['book_id', 'title', 'author', 'score'])

    for similar_item_id, score in similar_item_id_and_scores:
        book_id = similar_item_id + 1
        title = item_meta[item_meta['book_id'] == book_id].values[0][3]
        author = item_meta[item_meta['book_id'] == book_id].values[0][1]

        row = pd.Series([book_id, title, author, score], index=best_items.columns)
        best_items = best_items.append(row, ignore_index=True)

    return best_items


# book_id 2: Harry Potter and the Philosopher's Stone by J.K. Rowling, Mary GrandPré
# book_id 9: Angels & Demons by Dan Brown
report01 = make_best_items_report(item_embeddings, 2, 10)
report02 = make_best_items_report(item_embeddings, 9, 10)

해리포터와 마법사의 돌 그리고 천사와 악마, 이 두 권의 책과 유사한 책에 관한 정보를 확인해 보자.

# 해리포터와 마법사의 돌
book_id                                              title                        author     score
         Harry Potter and the Philosopher's Stone   J.K. Rowling, Mary GrandPré  1.000000
                                       Blue Smoke                  Nora Roberts  0.768227
                                 Prince of Thorns                Mark  Lawrence  0.767087
                                   The Ugly Truth                   Jeff Kinney  0.761519
                                     Spirit Bound                 Richelle Mead  0.760111
Being Mortal: Medicine and What Matters in the...                  Atul Gawande  0.755845
                               The Black Cauldron               Lloyd Alexander  0.739562
                                       Frog Music                 Emma Donoghue  0.739197
                                The Darkest Night                Gena Showalter  0.735191
                                 Children of Dune                 Frank Herbert  0.735112

# 천사와 악마
book_id                                 title                                            author     score
                    Angels & Demons                                          Dan Brown  1.000000
                         Anansi Boys                                       Neil Gaiman  0.876268
                     Lord of Misrule                                      Rachel Caine  0.869406
                                 NaN                                   Francine Rivers  0.859091
              Can You Keep a Secret?                                   Sophie Kinsella  0.847986
                                 NaN                   Marcus Pfister, J. Alison James  0.847010
                  The Scarlet Letter Nathaniel Hawthorne, Thomas E. Connolly, Nina ...  0.840049
                          The Rescue                                   Nicholas Sparks  0.834288
The Immortal Life of Henrietta Lacks                                    Rebecca Skloot  0.834270
               2001: A Space Odyssey                                  Arthur C. Clarke  0.812411

결과에 대해서는 독자의 판단에 맡기겠다.

Reference

1) LightFM 공식 문서 2) LigghtFM 관련 블로그 3) Hyperopt 깃헙

Comment Read more

GitHub 사용법 - 09. Overall(Git 명령어 정리, Git 사용법)

27 May 2020 | GitHub usage

저번 글에서는 Conflict에 대해서 알아보았다.
이번 글에서는, 전체 Git 명령어들의 사용법을 살펴본다.

명령어에 일반적으로 적용되는 규칙:

이 글에서 <blabla>와 같은 token은 여러분이 알아서 적절한 텍스트로 대체하면 된다.
각 명령에는 여러 종류의 옵션이 있다. ex) git log의 경우 --oneline, -<number>, -p 등의 옵션이 있다.
각 옵션은 많은 경우 축약형이 존재한다. 일반형은 -가 2개 있으며, 축약형은 -가 1개이며 보통 첫 일반형의 첫 글자만 따온다. ex) --patch = -p. 축약형과 일반형은 효과가 같다.
각 옵션의 순서는 상관없다. 명령의 필수 인자와 옵션의 순서를 바꾸어도 상관없다.
각 명령에 대한 자세한 설명은 git help <command-name>으로 확인할 수 있다.
ticket branch는 parent branch로부터 생성되어, 어떤 특정 기능을 추가하고자 만든 실험적 branch라 생각하면 된다.

Working tree(작업트리) 생성

git init

빈 디렉토리나, 기존의 프로젝트를 git 저장소(=git repository)로 변환하고 싶다면 이 문단을 보면 된다.

일반적인 디렉토리(=git 저장소가 아닌 디렉토리)를 git working tree로 만드는 방법은 다음과 같다. 명령창(cmd / terminal)에서 다음을 입력한다.

git init

# 결과 예시
Initialized empty Git repository in blabla/sample_directory/.git/

그러면 해당 디렉토리에는 .git 이라는 이름의 숨김처리된 디렉토리가 생성된다. 이 디렉토리 안에 든 것은 수동으로 건드리지 않도록 한다.

참고) git init 명령만으로는 인터넷(=원격 저장소 = remote repository)에 그 어떤 연결도 되어 있지 않다. 여기를 참조한다.

git clone

인터넷에서 이미 만들어져 있는 작업트리를 본인의 컴퓨터(=로컬)로 가져오고 싶을 때에는 해당 git repository의 https://github.com/blabla.git 주소를 복사한 뒤 다음과 같은 명령어를 입력한다.

git clone <git-address>

# 명령어 예시 
git clone https://github.com/greeksharifa/git_tutorial.git

# 결과 예시
Cloning into 'git_tutorial'...
remote: Enumerating objects: 56, done.
remote: Total 56 (delta 0), reused 0 (delta 0), pack-reused 56
Unpacking objects: 100% (56/56), done.

그러면 현재 폴더에 해당 프로젝트 이름의 하위 디렉토리가 생성된다. 이 하위 디렉토리에는 인터넷에 올라와 있는 모든 내용물을 그대로 가져온다(.git 디렉토리 포함).
단, 다른 branch의 내용물을 가져오지는 않는다. 다른 branch까지 가져오려면 추가 작업이 필요하다.

Git Repository 연결

이 과정은 git clone으로 원격저장소의 로컬 사본을 생성한 경우에는 필요 없다.

먼저 github 등에서 원격 저장소(remote repository)를 생성한다.

로컬 저장소를 원격저장소에 연결하는 방법은 다음과 같다.

git remote add <remote-name> <git address>

# 명령어 예시
git remote add origin https://github.com/greeksharifa/git_tutorial.git

<remote-name>은 원격 저장소에 대한 일종의 별명인데, 보통은 origin을 쓴다. 큰 프로젝트라면 여러 개를 쓸 수도 있다.

이것만으로는 완전히 연결되지는 않았다. upstream 연결을 지정하는 git push -u 명령을 사용해야 수정사항이 원격 저장소에 반영된다.

연결된 원격 저장소 확인

git remote --verbose
git remote -v

# 결과 예시
origin  https://github.com/greeksharifa/git_tutorial.git (fetch)
origin  https://github.com/greeksharifa/git_tutorial.git (push)

git remote -v의 결과는 <remote-name> <git-address> <fetch/push>로 이루어져 있다.
(fetch)는 새 작업을 다운로드하는 장소이고, (push)는 새 작업을 업로드하는 장소이다.

원격 저장소의 이름만을 보거나, 해당 이름의 자세한 정보를 알고 싶다면 git remote show나, git remote show <remote-name>을 입력한다.

git remote show
---
git remote show origin

# 결과 예시
origin
---
* remote origin
  Fetch URL: https://github.com/greeksharifa/git_tutorial.git
  Push  URL: https://github.com/greeksharifa/git_tutorial.git
  HEAD branch: main
  Remote branches:
    2nd-branch    tracked
    3rd-branch    tracked
    fourth-branch tracked
    main          tracked
  Local branches configured for 'git pull':
    2nd-branch merges with remote 2nd-branch
    main       merges with remote main
  Local refs configured for 'git push':
    2nd-branch pushes to 2nd-branch (up to date)
    main       pushes to main     (local out of date)

해당 원격 저장소의 url은 무엇인지, 어떤 branch가 있는지, 로컬 branch는 원격 저장소의 어떤 branch와 연결되어 있는지 등을 확인할 수 있다.

원격 저장소 이름 변경

git remote rename <old-remote-name> <new-remote-name>

# 명령어 예시
git remote rename origin official

원격 연결 삭제

git remote remove <remote-name>

Git 설정하기

git 설정에는 계정 설정이나 변경 등이 있다. 그리고, 모든 git 설정은 2종류가 있다.

해당 컴퓨터의 모든 git 프로젝트에 적용되는 전역(global) 설정
- Linux에서는 ~/.gitconfig 파일에 저장된다. 윈도우에서는 C:/Users/<user-name>/.gitconfig에 있다.
특정 프로젝트에만 적용되는 로컬(local) 설정
- 해당 프로젝트 root directory의 .git/config 파일에 저장된다.

컴퓨터를 공유해서 쓰는 것이 아니라면 보통은 global 설정을 주로 다루게 될 것이다.

설정된 값 보기:

git config --get <setting-name>
git config --get user.name

# 모든 설정값 보기
git config --list

설정값 설정하기: 보통 자신의 계정명과 계정을 설정하게 될 것이다. 최초 로그인 창이 뜰 수 있다.

git config --global <setting-name> <value>

# 명령어 예시
git config --global user.name 'greeksharifa'
git config --global user.name 'greeksharifa@gmail.com'

전역 설정이 아닌 해당 프로젝트에만 적용시키고 싶다면 --global 대신 --local을 사용한다.

git 기본 에디터 변경

git의 기본 에디터는 Vim인데, 이를 변경할 수 있다. bash 등이 있다.

# 명령어 예시
git config --global core.editor mate -w
git config --global core.editor subl -n -w
git config --global core.editor '"C:\Program Files\Vim\gvim.exe" --nofork'

더 자세한 설정들은 git help config를 입력해서 찾아보자.

인증 정보 저장: Credential

SSH protocol을 사용하여 원격 저장소에 접근할 때는 암호를 매번 입력하지 않아도 되지만 HTTP protocol을 사용한다면 매번 인증 정보를 입력해야 한다.
하지만 git에는 이런 인증 정보(credential)을 저장해 둘 수 있다.

인증 정보를 임시로(cache) 저장하려면 다음을 사용한다. 기본적으로 15분간 임시로 저장하며, timeout 시간을 설정해 줄 수도 있다. 아래는 1시간(3600초) 기준이다.

git config --global credential.helper cache
git config --global credential.helper 'cache --timeout=3600'

임시가 아니라 계속 저장해 두려면 cache 대신 store를 사용한다. 저장할 파일을 지정할 수도 있다.

git config --global credential.helper store
git config --global credential.helper 'store --file <file-path>'

Git 준비 영역(index)에 파일 추가

로컬 저장소의 수정사항이 반영되는 과정은 총 3단계를 거쳐 이루어진다.

git add 명령을 통해 준비 영역에 변경된 파일을 추가하는 과정(stage라 부른다)
git commit 명령을 통해 여러 변경점을 하나의 commit으로 묶는 과정
git push 명령을 통해 로컬 commit 내용을 원격 저장소에 올려 변경사항을 반영하는 과정

이 중 git add 명령은 첫 단계인, 준비 영역에 파일을 추가하는 것이다.

git add <filename1> [<filename2>, ...]
git add <directory-name>
git add *
git add --all
git add .

# 명령어 예시
git add third.py fourth.py
git add temp_dir/*

*은 와일드카드로 그냥 쓰면 변경점이 있는 모든 파일을 준비 영역에 추가한다(git add *). 특정 directory 뒤에 쓰면 해당 directory의 모든 파일을, *.py와 같이 쓰면 확장자가 .py인 모든 파일이 준비 영역에 올라가게 된다.
git add .을 현재 directory(.)의 모든 파일을 추가하는 명령으로 git add --all과 효과가 같다.

git add 명령을 실행하고 이미 준비 영역에 올라간 파일을 또 수정한 뒤 git status 명령을 실행하면 같은 파일이 Changes to be committed 분류와 Changes not staged for commit 분류에 동시에 들어가 있을 수 있다. 딱히 오류는 아니고 해당 파일을 다음 commit에 반영할 계획이면 한번 더 git add를 실행시켜주자.

한 파일 내 수정사항의 일부만 준비 영역에 추가

예를 들어 fourth.py를 다음과 같이 변경한다고 하자.

# 변경 전
print('hello')

print(1)

print('bye')

#변경 후
print('hello')
print('git')

print('bye')
print('20000')

이 중 print('bye'); print('20000')을 제외한 나머지 변경사항만을 준비 영역에 추가하고 싶다고 하자. 그러면 git add <filename> 명령에 다음과 같이 --patch 옵션을 붙인다.

git add --patch fourth.py
git add fourth.py -p

# 결과 예시
diff --git a/fourth.py b/fourth.py
index 13cc618..4c8cfb6 100644
--- a/fourth.py
+++ b/fourth.py
@@ -1,5 +1,5 @@
 print('hello')
+print('git')

-print(1)
-
-print('bye')
\ No newline at end of file
+print('bye')
+print('20000')
\ No newline at end of file
stage this hunk [y,n,q,a,d,s,e,?]? 

그러면 수정된 코드 덩이(hunk)마다 선택할지를 물어본다. 인접한 초록색(+) 덩이 또는 인접한 빨간색 덩이(-)가 하나의 코드 덩이가 된다.

각 옵션에 대한 설명은 다음과 같다. ?를 입력해도 도움말을 볼 수 있다.

Option	Description
y	stage this hunk
n	do not stage this hunk
q	quit; do not stage this hunk or any of the remaining ones
a	stage this hunk and all later hunks in the file
d	do not stage this hunk or any of the later hunks in the file
s	split the current hunk into smaller hunks
e	manually edit the current hunk
?	print help

여기서는 y, y, n을 차례로 입력하면 원하는 대로 추가/추가하지 않을 수 있다. (영어 원문을 보면 알 수 있듯이 (stage) = (준비 영역에 추가하다)와 같은 의미라고 보면 된다.)

-p 옵션으로는 인접한 추가/삭제 줄들이 전부 하나의 덩이로 묶이기 때문에, 이를 더 세부적으로 하고 싶다면 위 옵션에서 e를 선택하면 된다.

git add -p 명령을 통해 준비 영역에 파일의 일부 변경사항만 추가하고 나면 같은 파일이 Changes to be committed 분류와 Changes not staged for commit 분류에 동시에 들어가게 된다.

Commit하기

준비 영역에 올라간 파일들의 변경사항을 하나로 묶는 작업이라 보면 된다. Git에서는 이 commit(커밋)이 변경사항 적용의 기본 단위가 된다.

git commit [-m “message”] [–amend]

기본적으로, commit은 다음 명령어로 수행할 수 있다.

git commit

# 결과 예시:
All text in first line will be showed at --oneline

Maximum length is 50 characters.
Below, is for detailed message.

# Please enter the commit message for your changes. Lines starting
# with '#' will be ignored, and an empty message aborts the commit.
#
# On branch main
# Your branch is up to date with 'origin/main'.
#
# Changes to be committed:
#       modified:   .gitignore
#       new file:   third.py
#
~
~

git commit을 입력하면 vim 에디터가 열리면서 commit 메시지 편집을 할 수 있다. 방법은:

i를 누른다. insert의 약자이다.
이후 메시지를 마음대로 수정할 수 있다. 이 때 규칙이 있는데,
- 첫 번째 줄은 log를 볼 때 --oneline 옵션에서 나타나는 대표 commit 메시지이다. 기본값으로, 50자 이상은 무시된다.
- 그 아래 줄에 쓴 텍스트는 해당 commit의 자세한 메시지를 포함한다.
- 맨 앞에 #이 있는 줄은 주석 처리되어 commit 메시지에 포함되지 않는다.
편집을 마쳤으면 다음을 순서대로 누른다. ESC, :wq, Enter.
- ESC는 vim 에디터에서 명령 모드로 들어가가, :wq는 저장 및 종료 모드 입력을 뜻한다. 잘 모르겠으면 그냥 따라하라.
맨 밑에 있는 물결 표시(~)는 파일의 끝이라는 뜻이다. 빈 줄도 아니다.

commit의 자세한 메시지를 작성하기 귀찮다면(별로 좋은 습관은 아니다.), 간단한 메시지만 작성할 수 있다:

git commit -m "<message>"

# 명령 예시:
git commit -m "hotfix for typr error"

물론 이미 작성한 commit 메시지를 변경할 수 있다.

git commit --amend

그러면 vim 에디터에서 수정할 수 있다.

원래는 git add 후 git commit을 하는 것이 일반적이지만, 모든 파일을 추가하면서 commit을 한다면 다음 단축 명령을 쓸 수 있다: -a 옵션을 붙인다.

git commit -a -m "<commit-message>"

수정사항을 원격저장소에 반영하기: git push

upstream 연결

git remote add 명령으로 원격저장소를 연결했으면 git push <git-address> 명령으로 로컬 저장소의 commit을 원격 저장소에 반영할 수 있다. 즉, 최종 반영이다.

git push <git-address>
git push https://github.com/greeksharifa/gitgitgit.git

# 결과 예시
Enumerating objects: 3, done.
Counting objects: 100% (3/3), done.
Writing objects: 100% (3/3), 200 bytes | 200.00 KiB/s, done.
Total 3 (delta 0), reused 0 (delta 0)
To https://github.com/greeksharifa/gitgitgit.git
 * [new branch]      main -> main

그러나 매번 git address를 인자로 주어가며 변경사항을 저장하는 것은 매우 귀찮으니, 다음 명령을 통해 upstream 연결을 지정할 수 있다. 이는 git remote add 명령을 통해 원격 저장소의 이름을 이미 지정한 경우의 얘기이다.

혹시 로컬에서 git을 처음 쓰거나 다른 사람의 작업트리를 처음 쓰는 경우라면 github id/pw를 입력해야 할 수 있다.

git push --set-upstream <remote-name> <branch-name>
git push -u <remote-name> <branch-name>

# 명령어 예시
git push --set-upstream origin main
git push -u origin main

# 결과 예시
Everything up-to-date
Branch 'main' set up to track remote branch 'main' from 'origin'.

git push --set-upstream <remote-name> <branch-name> 명령은 <branch-name> branch의 upstream을 원격 저장소 <remote-name>로 지정하는 것으로, 앞으로 git push나 git pull 명령 등을 수행할 때 <branch name>과 <remote name>을 지정할 필요가 없도록 지정하는 역할을 한다. 즉, 앞으로는 commit을 원격 저장소에 반영할 때 git push만 입력하면 된다.

위와 같은 방법으로 지정하지 않은 branch나 원격 저장소에 push하고자 하는 경우, git push <remote-name> <branch-name>을 사용한다.

# 명령어 예시
git push origin ticket-branch

upstream 삭제

더 이상 필요 없는 원격 branch를 삭제할 때는 다음 명령을 사용한다.

git push --delete <remote-name> <remote-branch-name>

# 명령어 예시
git push --delete origin ticket-branch
git push -d origin ticket-branch

수정사항 반영하기

일반적으로 로컬 저장소의 commit을 원격 저장소에 반영하려면 다음 명령어를 입력한다.

git push <remote-name> <branch-name>

# 명령어 예시
git push origin main

위에서 --set-upstream 옵션을 사용해 업로드 branch와 장소를 지정했다면 git push만으로도 원격 저장소에 업로드가 가능하다.

git push

위와 같은 방식으로는 기본적으로 로컬 branch의 이름(<branch-name>)과 원격 저장소에 저장될 branch의 이름이 같게 된다. 이를 다르게 지정해서 업로드하려면 다음과 같이 쓴다.

git push <remote-name> <local-branch-name>:<remote-branch-name>

# 명령어 예시
git push origin fourth:ticket

목적지인 원격 저장소의 해당 branch에 현재 로컬 저장소에는 없는 commit이 존재한다면 push가 진행되지 않는다. 원격 저장소의 변경점을 먼저 로컬에 복사해야 한다. 이는 git pull 명령을 써서 해결한다. 여기를 참고한다.

모든 branch의 수정사항 반영하기

git push --all <remote-name>

모든 branch의 수정사항을 반영하므로 <branch-name>은 지정할 필요 없다.

원격 저장소의 수정사항을 로컬로 가져오기: git pull

사실 git pull 명령은 git fetch와 git merge FETCH_HEAD를 합친 명령과 같다. 즉 원격 저장소의 수정사항을 먼저 확인한 다음, 로컬 저장소에는 없는 모든 commit들을 로컬로 가져오는 작업과 같다.

다음 상황을 가정하자:

	  A---B---C main on origin
	 /
    D---E---F---G main
	^
	origin/main in your repository

현재 로컬 저장소의 main branch에는 A, B, C commit이 존재하지 않는다. 이를 로컬에 반영하려면 git pull을 입력한다. 어디서 받아올지 지정되어 있지 않다면 git pull <remote-name> <remote-branch-name>을 입력한다.

	  A---B---C origin/main
	 /         \
    D---E---F---G---H main

수정사항 사이에 충돌이 없다면 자동으로 진행된다. 만약 충돌이 일어났다면, 먼저 충돌 사항을 해결한 다음 add/commit/push 과정을 거치면 된다.

Git Directory 상태 확인

git status

현재 git 저장소의 상태를 확인하고 싶다면 다음 명령어를 입력한다.

git status

# 결과 예시 1:
On branch main
Your branch is up to date with 'origin/main'.

nothing to commit, working tree clean

# 결과 예시 2:

On branch main
Your branch is up to date with 'origin/main'.

Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

        modified:   first.py

Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

        modified:   .gitignore
        deleted:    second.py

Untracked files:
  (use "git add <file>..." to include in what will be committed)

        third.py

git status로는 로컬 git 저장소에 변경점이 생긴 파일을 크게 세 종류로 나누어 보여준다.

Changes to be committed
- Tracking되는 파일이며, 준비 영역(stage)에 이름이 올라가 있는 파일들. 이 단계에 있는 파일들만이 commit 명령을 내릴 시 다음 commit에 포함된다. (그래서 to be commited이다)
- 마지막 commit 이후 git add 명령으로 준비 영역에 추가가 된 파일들.
Changes not staged for commit:
- Tracking되는 파일이지만, 다음 commit을 위한 준비 영역에 이름이 올라가 있지 않은 파일들.
- 마지막 commit 이후 git add 명령의 대상이 된 적 없는 파일들.
Untracked files:
- Tracking이 안 되는 파일들.
- 생성 이후 한 번도 git add 명령의 대상이 된 적 없는 파일들.

위와 같이 준비 영역 또는 tracked 목록에 올라왔는지가 1차 분류이고, 2차 분류는 해당 파일이 처음 생성되었는지(ex. third.py), 변경되었는지(modified), 삭제되었는지(deleted)로 나눈다.

수정된 파일을 보다 간략히 보려면 --short 옵션을 사용한다.

git status --short
git status -s

# 결과 예시
 M .gitignore
A  doonggoos.py
D  first.py
 M fourth.py
R  third.py -> what.py

추가된 파일은 A, 수정된 파일은 M, 삭제된 파일은 D, 이름이 바뀐 파일은 R로 표시된다.

특정 파일/디렉토리 무시하기: .gitignore

프로젝트의 최상위 디렉토리에 .gitignore라는 이름을 갖는 파일을 생성한다. 윈도우에서는 copy con .gitignore라 입력한 뒤, 내용을 다 입력하고, Ctrl + C를 누르면 파일이 저장되면서 생성된다.

.gitignore 파일을 열었으면 안에 원하는 대로 파일명이나 디렉토리 이름 등을 입력한다. 그러면 앞으로 해당 프로젝트에서는 git add 명령으로 준비 영역에 해당 종류의 파일 등이 추가되지 않는다.

예시는 다음과 같다.

dum_file.py             # `dum_file.py`라는 이름의 파일을 무시한다.
*.zip                   # 확장자가 `.zip`인 모든 파일을 무시한다.
data/                   # data/ 디렉토리 전체를 무시한다.
!data/regression.csv    # data/ 디렉토리는 무시되지만, data/regression.csv 파일은 무시되지 않는다. 
                        # 이 경우는 data/ 이전 라인에 작성하면 적용되지 않는다.
**/*.json               # 모든 디렉토리의 *.json 파일을 무시한다.

.gitignore 파일을 저장하고 나면 앞으로는 해당 파일들은 tracking되지 않는다. 즉, 준비 영역에 추가될 수 없다.
그러나 이미 tracking되고 있는 파일들은 영향을 받지 않는다. 따라서 git rm --cached 명령을 통해 tracking 목록에서 제거해야 한다.

전체 프로젝트에 .gitignore 적용하기

특정 프로젝트가 아닌 모든 프로젝트 전체에 적용하고 싶으면 다음 명령을 입력한다.

git config --global core.excludesfile <.gitignore-file-path>

# 명령 예시
git config --global core.excludesfile ~/.gitignore
git config --global core.excludesfile C:\.gitignore

그러면 해당 위치에 .gitignore 파일이 생성되고, 이는 모든 프로젝트에 적용된다. 일반적으로 git config --global 명령을 통해 설정하는 것은 특정 프로젝트가 아닌 해당 로컬에서 작업하는 모든 프로젝트에 영향을 준다. 여기를 참고하라.

History 검토

현재 존재하는 commit 검토: git log

저장소 commit 메시지의 모든 history를 역순으로 보여준다. 즉, 가장 마지막에 한 commit이 가장 먼저 보여진다.

git log

# 결과 예시
commit da446019230a010bf333db9d60529e30bfa3d4e3 (HEAD -> main, origin/main, origin/HEAD)
Merge: 4a521c5 2eae048
Author: greeksharifa <greeksharifa@gmail.com>
Date:   Sun Aug 19 20:59:24 2018 +0900

    Merge branch '3rd-branch'

commit 2eae048f725c1d843cad359d655c193d9fd632b4
Author: greeksharifa <greeksharifa@gmail.com>
Date:   Sun Aug 19 20:29:48 2018 +0900

    Unwanted commit from 2nd-branch

...
:

이때 commit의 수가 많으면 다음 명령을 기다리는 커서가 깜빡인다. 여기서 space bar를 누르면 다음 commit들을 계속해서 보여주고, 끝에 다다르면(저장소의 최초 commit에 도달하면) (END)가 표시된다.
끝에 도달했거나 이전 commit들을 더 볼 필요가 없다면, q를 누르면 log 보기를 중단한다(quit).

git log 옵션: –patch(-p), –max-count(-<number>), –oneline(–pretty=oneline), –graph

각 commit의 diff 결과(commit의 세부 변경사항, 변경된 파일의 변경된 부분들을 보여줌)를 보고 싶으면 다음을 입력한다.

git log --patch

# 결과 예시
commit 2eae048f725c1d843cad359d655c193d9fd632b4
Author: greeksharifa <greeksharifa@gmail.com>
Date:   Sun Aug 19 20:29:48 2018 +0900

    Unwanted commit from 2nd-branch

diff --git a/first.py b/first.py
index 2d61b9f..c73f054 100644
--- a/first.py
+++ b/first.py
@@ -9,3 +9,5 @@ print("This is the 1st sentence written in 3rd-branch.")
 print('2nd')

 print('test git add .')
+
+print("Unwanted sentence in 2nd-branch")

현재 branch가 아닌 다른 branch의 log를 보고 싶다면 <branch-name>을 추가 입력해 준다.

git log -p origin/main

# 결과 예시
commit 2eae048f725c1d843cad359d655c193d9fd632b4
Author: greeksharifa <greeksharifa@gmail.com>
Date:   Sun Aug 19 20:29:48 2018 +0900

    Unwanted commit from 2nd-branch

diff --git a/first.py b/first.py
index 2d61b9f..c73f054 100644
--- a/first.py
+++ b/first.py
@@ -9,3 +9,5 @@ print("This is the 1st sentence written in 3rd-branch.")
 print('2nd')

 print('test git add .')
+
+print("Unwanted sentence in 2nd-branch")

가장 최근의 commit들 3개만 보고 싶다면 다음과 같이 입력한다.

git log -3

commit의 대표 메시지와 같은 핵심 내용만 보고자 한다면 다음과 같이 입력한다.

git log --oneline

# 결과 예시
da44601 (HEAD -> main, origin/main, origin/HEAD) Merge branch '3rd-branch'
2eae048 Unwanted commit from 2nd-branch
4a521c5 Desired commit from 2nd-branch

참고로, 다음과 같이 입력하면 commit의 고유 id의 전체가 출력된다.

git log --pretty=oneline

# 결과 예시
da446019230a010bf333db9d60529e30bfa3d4e3 (HEAD -> main, origin/main, origin/HEAD) Merge branch '3rd-branch'
2eae048f725c1d843cad359d655c193d9fd632b4 Unwanted commit from 2nd-branch
4a521c56a6c2e50ffa379a7f2737b5e90e9e6df3 Desired commit from 2nd-branch

옵션들은 중복이 가능하다.

git log --oneline -5

--graph 옵션은 branch이 어디서 분기되고 합쳐졌는지와 같은 정보를 그래프로 보여준다. 분기된 지점이 없으면 일렬로 보인다.

git log --graph

# 결과 예시
* commit e8a20c960cfcd3f444d93b735f6bed7bd40ed7c5 (HEAD -> main, origin/main, origin/HEAD)
| Author: greeksharifa <greeksharifa@gmail.com>
| Date:   Fri May 29 23:25:35 2020 +0900
|
|     accelerate page load speed
|
* commit abbe725235f3144ef6df02c4b1b34cd1804ccd50
| Author: greeksharifa <greeksharifa@gmail.com>
| Date:   Fri May 29 22:22:49 2020 +0900
|
|     permalink test
|
...

--merges, --no-merges 옵션은 여기를 참고한다.

commit 검색하기

-S 옵션은 commit message나 수정사항 내에 주어진 문자열이 포함되어 있다면 해당 commit이 검색된다.
-G 옵션은 -S와 비슷하지만 정규식 표현으로 검색할 수 있다.

git log -S <string>
git log -G <regex-expression>

일부 commit만 확인하기

가장 최신 commit을 제외하고 log를 보려면 git log HEAD^를 사용한다.
가장 최신 2개의 commit을 제외하고 보려면 git log HEAD~2를 사용한다.
특정 범위의 commit을 확인하려면 git log <commit-1>..<commit-2>를 이용한다.
2개의 branch 사이의 차이를 확인하려면 git log <branch-name-1>..<branch-name-2>를 이용한다. 원격 저장소의 branch도 확인 가능하다.

commit과 commit의 변화 과정 전체를 검토: git reflog

git reflog

# 결과 예시:
87ab51e (HEAD -> main, tag: specific_tag) HEAD@{0}: commit: All text in first line will be showed at --onel
ine
da44601 (origin/main, origin/HEAD) HEAD@{1}: clone: from https://github.com/greeksharifa/git_tutorial.git

위와 같이 HEAD@{0}: commit과 HEAD@{1}: clone 이라는 변화를 볼 수 있다. git reflog는 commit 뿐 아니라 commit이 삭제되었는지, 재배치했는지, clone이나 rebase 같은 변화가 있었는지 등등 git에서 일어난 모든 변화를 기록한다.

특정 파일의 수정사항 history 보기: git blame

git blame <filename>의 형태로 사용한다. 파일 히스토리가 나타나는데,
해당 수정사항을 포함하는 commit id, 수정한 사람, 수정 일시, 줄 번호, 수정 내용을 볼 수 있다.

blame이라고 해서 누군가를 비난하는 것은 아니다.

git blame fourth.py

# 결과 예시
8506cef2 (greeksharifa      2020-05-27 21:42:19 +0900 1) print('hello')
dd65e051 (greeksharifa      2020-05-28 23:21:01 +0900 2) print('git')
8506cef2 (greeksharifa      2020-05-27 21:42:19 +0900 3)
dd65e051 (greeksharifa      2020-05-28 23:21:01 +0900 4) print('bye')
00000000 (Not Committed Yet 2020-05-30 14:26:53 +0900 5) print('20000')
00000000 (Not Committed Yet 2020-05-30 14:26:53 +0900 6)
00000000 (Not Committed Yet 2020-05-30 14:26:53 +0900 7) print('for test')
00000000 (Not Committed Yet 2020-05-30 14:26:53 +0900 8) print('for test 2')
00000000 (Not Committed Yet 2020-05-30 14:26:53 +0900 9) print('repeating test')

단, 수정사항을 묶어서 보여주지는 않는다.

다른 commit / branch와의 자세한 차이 확인: git diff

git diff 명령으로는 branch 간 차이를 확인하거나, commit 간 차이를 확인할 수 있다. 다음 예시들을 살펴보자.

git diff는 최신 commit과 현재 상태를 비교한다. 수정된 파일이 있으면 내용이 뜨고, 없으면 아무것도 출력되지 않는다.

git diff

# 결과 예시 1
(빈 줄)

# 결과 예시 2
diff --git a/fourth.py b/fourth.py
index 4c8cfb6..e69de29 100644
--- a/fourth.py
+++ b/fourth.py
@@ -1,5 +0,0 @@
-print('hello')
-print('git')
-
-print('bye')
-print('20000')
\ No newline at end of file

git diff <commit>은 해당 commit 이후 수정된 코드를 보여준다.

git diff <branch-name-1> <branch-name-2>는 두 branch 간 차이를 전부 보여준다. branch를 지정할 때 두 branch의 순서를 바꾸면 추가된 줄과 삭제된 줄이 뒤바뀌니 주의하자.
<branch-name-1>에서 <branch-name-2>로 이동할 때의 변화를 기준으로 +, -가 보여진다. 즉 <branch-name-1>에는 없고 <branch-name-2>에는 있는 코드라면 +로 표시된다.

git diff main 2nd-branch

# 결과 예시
diff --git a/.gitignore b/.gitignore
index 15c8c56..8d16a4b 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1,3 +1,3 @@
-
+third.py
 .idea/
 *dummy*
diff --git a/first.py b/first.py
index baba21f..2d61b9f 100644
--- a/first.py
+++ b/first.py
@@ -1 +1,11 @@
-print("Hello, git!") 
+print("Hello, git!") # instead of "Hello, World!"
...

<branch-name-2>를 생략할 수도 있다. 위의 결과와는 +, -가 다르다.

git diff 2nd-branch

# 결과 예시
diff --git a/.gitignore b/.gitignore
index 8d16a4b..15c8c56 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1,3 +1,3 @@
-third.py
+
 .idea/
 *dummy*
diff --git a/first.py b/first.py
index 2d61b9f..baba21f 100644
--- a/first.py
+++ b/first.py
@@ -1,11 +1 @@
-print("Hello, git!") # instead of "Hello, World!"
-print("Hi, git!!")
...

difftool

diff의 결과를 보거나 수정하고자 할 때 본인이 쓰는 에디터가 아니라 git bash 내에서 수행하려면 difftool을 사용한다.

git difftool <branch-name-1>..<branch-name-2>
git difftool <commit-1>..<commit-2>

HEAD: branch의 tip

HEAD는 현 branch history의 가장 끝을 의미한다. 여기서 끝은 가장 최신 commit 쪽의 끝이다(시작점을 가리키지 않는다).
다른 의미로는 checkout된 commit, 또는 현재 작업중인 commit이다.

예를 들어, HEAD@{0}은 1번째 최신 commit(즉, 가장 최신 commit)을 의미한다. index는 많은 프로그래밍 언어가 그렇듯 0부터 시작한다. 비슷하게, HEAD@{1}은 2번째 최신 commit을 의미한다.

HEAD^는 HEAD의 직전, 즉 가장 최신 commit을 가리킨다.

범위를 나타낼 땐 ~를 사용한다. 예를 들어, HEAD~3은 가장 최신 commit(1번째)부터 3번째 commit까지를 가리킨다.

HEAD~2^는 HEAD^(가장 최신, 즉 1번째 commit)보다 2번 더 이전 commit까지 간 것이고, 범위(~)를 나타내므로 1~3번째 commit을 가리킨다. 헷갈리니까 3개의 commit을 다루고 싶으면 그냥 HEAD~3을 쓰자.

Tag 붙이기

태그는 특정한 commit을 찾아내기 위해 사용된다. 즐겨찾기와 같은 개념이기 때문에, 여러 commit에 동일한 태그를 붙이지 않도록 한다.

우선 태그를 붙이고 싶은 commit을 찾자.

# 명령어 예시 1
git log --oneline -3

# 결과 예시 1
87ab51e (HEAD -> main) All text in first line will be showed at --oneline
da44601 (origin/main, origin/HEAD) Merge branch '3rd-branch'
2eae048 Unwanted commit from 2nd-branch

# 명령어 예시 2
git log 87ab51e --max-count=1
git show 87ab51e

# 결과 예시 2
commit 87ab51eecef1a526cb504846ddcaed0459f685c8 (HEAD -> main)
Author: greeksharifa <greeksharifa@gmail.com>
Date:   Thu May 28 14:49:13 2020 +0900

    All text in first line will be showed at --oneline

    Maximum length is 50 characters.
    Below, is for detailed message.

git tag

이제 태그를 commit에 붙여보자.

git tag <tag-name> 87ab51e

# 명령어 예시
git tag specific_tag 87ab51e

지금까지 붙인 태그 목록을 보려면 다음 명령을 입력한다.

git tag

# 결과 예시
specific_tag

해당 태그가 추가된 commit을 보려면 여기를 참조한다.

특정 commit 보기

git show

commit id를 사용해서 특정 commit을 보고자 하면 다음과 같이 쓴다.

git log 87ab51e --max-count=1
git show 87ab51e

# 결과 예시
Author: greeksharifa <greeksharifa@gmail.com>
Date:   Thu May 28 14:49:13 2020 +0900

    All text in first line will be showed at --oneline

    Maximum length is 50 characters.
    Below, is for detailed message.

git show <tag-name>

git show <tag-name>

# 명령어 예시
git show specific_tag

# 결과 예시
commit 87ab51eecef1a526cb504846ddcaed0459f685c8 (HEAD -> main, tag: specific_tag)
Author: greeksharifa <greeksharifa@gmail.com>
Date:   Thu May 28 14:49:13 2020 +0900

    All text in first line will be showed at --oneline

    Maximum length is 50 characters.
    Below, is for detailed message.

diff --git a/.gitignore b/.gitignore
index 8d16a4b..6ec8ec8 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1,3 +1,2 @@
-third.py
 .idea/
 *dummy*
diff --git a/third.py b/third.py
new file mode 100644
index 0000000..0360dad
--- /dev/null
+++ b/third.py
@@ -0,0 +1 @@
+print('hello 3!')

Git Branch

branch 목록 업데이트하기

git fetch --all
git fetch -a

특정 원격 저장소의 것만을 업데이트하려면 다음과 같이 한다.

git fetch <remote-name>

branch 목록 보기

로컬 branch 목록을 보려면 다음을 입력한다.

git branch
git branch --list
git branch -l

# 결과 예시
* main

branch 목록을 보여주는 모든 명령에서, 현재 branch(작업 중인 branch)는 맨 앞에 asterisk(*)가 붙는다.

모든 branch 목록 보기:

git branch --all
git branch -a

# 결과 예시
* main
  remotes/origin/2nd-branch
  remotes/origin/3rd-branch
  remotes/origin/HEAD -> origin/main
  remotes/origin/main

remotes/가 붙은 것은 원격 branch라는 뜻이며, branch의 실제 이름에는 remotes/가 포함되지 않는다.

--verbose 옵션을 붙이면 최신 commit까지 출력해 준다.

git branch --all --verbose

# 결과 예시
  2nd-branch                   1be03c8 Remove files that were uploaded incorrectly
* main                       94d511c [ahead 3] fourth ticket
  remotes/origin/2nd-branch    1be03c8 Remove files that were uploaded incorrectly
  remotes/origin/3rd-branch    90ce4f2 Merge branch '3rd-branch'
  remotes/origin/HEAD          -> origin/main
  remotes/origin/fourth-branch 94d511c fourth tickek
  remotes/origin/main        da44601 Merge branch '3rd-branch'

main branch의 설명에 붙어 있는 [ahead 3]이라는 문구는 현재 로컬 저장소에는 3개의 commit이 있지만 아직 원격 저장소에 psuh되지 않았음을 의미한다.

원격 branch 목록만 보기:

git branch --remotes
git branch -r

# 결과 예시
  origin/2nd-branch
  origin/3rd-branch
  origin/HEAD -> origin/main
  origin/main

branch 이름 변경

먼저 현재 branch의 이름 변경하는 방법은 다음과 같다.

git checkout <old-branch-name>
git branch -m <new-branch-name>

지금 branch가 main(master)이라면 다른 branch의 이름을 바로 변경할 수 있다.

git branch -m <old-branch-name> <new-branch-name>

branch 이름 변경 시 로컬 저장소의 branch 이름도 변경

아래 예시는 master를 main으로 바꿨을 때의 코드이다.

git branch -m main main
git fetch origin
git branch -u origin/main main
git remote set-head origin -a

원격 branch 목록 업데이트

로컬 저장소와 원격 저장소는 실시간 동기화가 이루어지는 것이 아니기 때문에(일부 git 명령을 내릴 때에만 통신이 이루어짐), 원격 branch 목록은 자동으로 최신으로 유지되지 않는다. 목록을 새로 확인하려면 다음을 입력한다.

git fetch

별다른 변경점이 없으면 아무 것도 표시되지 않는다.

branch 전환

branch를 전환하려면 저장되지 않은 수정사항이 없어야 한다.
수정사항을 다른 데다 임시로 저장하려면 stash를 참고한다.

단순히 branch 간 전환을 하고 싶으면 다음 명령어를 입력한다.

git checkout <branch-name>

# 명령어 예시
git checkout main

# 결과 예시
Switched to branch 'main'
M       .gitignore
D       second.py
Your branch is ahead of 'origin/main' by 1 commit.
  (use "git push" to publish your local commits)

전환을 수행하면,

변경된 파일의 목록과
현재 로컬 브랜치가 연결되어 있는 원격 브랜치 사이에 얼마만큼의 commit 차이가 있는지

도 알려준다.

로컬에 새 branch를 생성하되, 그 내용을 원격 저장소에 있는 어떤 branch의 내용으로 하고자 하면 다음 명령을 사용한다.

git checkout --track -b <local-branch-name> <remote-branch-name>

# 명령어 예시
git checkout --track -b 2nd-branch origin/2nd-branch

# 결과 예시
Switched to a new branch '2nd-branch'
M       .gitignore
D       second.py
Branch '2nd-branch' set up to track remote branch '2nd-branch' from 'origin'.

출력에서는 2nd-branch라는 이름의 새 branch로 전환하였고, 파일의 현재 수정 사항을 간략히 보여주며, 로컬 branch 2nd-branch가 origin의 원격 branch 2nd-branch를 추적하게 되었음을 알려준다.
즉 원격 branch의 로컬 사본이 생성되었음을 알 수 있다.

새 branch 생성

git branch <new-branch-name>

# 명령어 예시
git branch fourth-branch

위 명령은 branch를 생성만 한다. 생성한 브랜치에서 작업을 시작하려면 checkout 과정을 거쳐야 한다.

branch 생성과 같이 checkout하기

git checkout -b <new-branch-name> <parent-branch-name>

# 명령어 예시
git checkout -b fourth-branch main

# 결과 예시
Switched to a new branch 'fourth-branch'

새로운 branch는 생성 시점에서 parent branch와 같은 history(commit 기록들)을 갖는다.

원격 저장소의 branch를 로컬 저장소에 복사하며 checkout하기

git checkout -b <local-branch-name> --track <remote-branch-name>

# 명령어 예시
git branch -a
git checkout -b 3rd-branch --track remotes/origin/3rd-branch
git branch

# 결과 예시
  2nd-branch
* main
  remotes/origin/2nd-branch
  remotes/origin/3rd-branch
  remotes/origin/HEAD -> origin/main
  remotes/origin/fourth-branch
  remotes/origin/main


Switched to a new branch '3rd-branch'
Branch '3rd-branch' set up to track remote branch '3rd-branch' from 'origin'.


  2nd-branch
* 3rd-branch
  main

branch 병합: git merge

git merge <branch-name>를 사용한다. <branch-name> branch의 수정 사항들(commit)을 현재 branch로 가져와 병합한다. 이 방식은 완전 병합 방식이다.

git merge <branch-name>

# 명령어 예시
git merge ticket-branch

# 결과 예시
Updating 96c99dc..94d511c
Fast-forward
 .gitignore | 2 +-
 fourth.py  | 5 +++++
 second.py  | 9 ---------
 third.py   | 0
 4 files changed, 6 insertions(+), 10 deletions(-)
 create mode 100644 fourth.py
 delete mode 100644 second.py
 create mode 100644 third.py

이와 같은 방법을 history fast-forward라 한다(히스토리 빨리 감기).

병합할 때 ticket branch의 모든 commit들을 하나의 commit으로 합쳐서 parent branch에 병합하고자 할 때는 --squash 옵션을 사용한다.

# 현재 branch가 parent branch일 때
git merge ticket-branch --squash

--squash 옵션은 애초에 branch를 분리하지 말았어야 할 상황에서 쓰면 된다. 즉, 병합 후 parent branch 입장에서는 그냥 하나의 commit이 반영된 것과 같은 효과를 갖는다.

위와 같이 처리했을 때는 ticket branch가 더 이상 필요 없으니 삭제하도록 하자.

병합 시 현 branch의 작업만을 최우선으로 남겨둔다면 다음 옵션을 사용한다.

git merge -X ours <branch-name>

반대로 가져오고자 하는 branch의 작업을 최우선으로 남긴다면 다음을 쓴다.

git merge -X theirs <branch-name>

branch 삭제

git branch --delete <branch-name>
git branch -d <branch-name>

# 명령어 예시
git branch --delete ticket-branch

# 결과 예시
Deleted branch fourth-branch (was 94d511c).

branch 삭제는 해당 branch의 수정사항들이 다른 branch에 병합되어서, 더 이상 필요없음이 확실할 때에만 문제없이 실행된다.
아직 수정사항이 남아 있음에도 그냥 해당 branch 자체를 폐기처분하고 싶으면 --delete 대신 -D 옵션을 사용한다.

이미 원격 저장소에 올라간 branch를 삭제하려면 여기를 참조한다.

작업 취소하기

먼저 가능한 작업 취소 명령들을 살펴보자.

원하는 것	명령어
특정 파일의 수정사항 되돌리기	`git checkout -- <filename>`
모든 수정사항을 되돌리기	`git reset --hard`
준비 영역의 모든 수정사항을 삭제	`git reset --hard <commit>`
여러 commit 통합	`git reset <commit>`
이전 commit들을 수정 또는 통합, 혹은 분리	`git rebase --interactive <commit>`
untracked 파일을 포함해 모든 수정사항을 되돌리기	`git clean -fd`
이전 commit을 삭제하되 history는 그대로 두기	`git revert <commit>`

아래는 Git for Teams라는 책에서 가져온 flowchart이다. 뭔가 잘못되었을 때 사용해보도록 하자.

여러 명이 협업하는 프로젝트에서 이미 원격 저장소에 잘못된 수정사항이 올라갔을 때, 이를 강제로 되돌리는 것은 금물이다. ‘잘못된 수정사항을 삭제하는’ 새로운 commit을 만들어 반영시키는 쪽이 훨씬 낫다.

물론 branch를 잘 만들고, pull request 시스템을 적극 활용해서 그러한 일이 일어나지 않도록 하는 것이 최선이다.
혹시나 그런 일이 발생했다면, revert를 사용하라. 다른 명령들은 아직 원격 저장소에 push하지 않았을 때 쓰는 명령들이다.

특정 파일의 수정사항 되돌리기: checkout, reset

특정 파일을 지워 버렸거나 수정을 잘못했다고 하자. 이 때에는 다음 전제조건이 있다.

수정사항을 commit하지 않았을 때

commit하지 않았다면, 다음 두 가지 경우가 있다. git status를 입력하면 친절히 알려준다.

git status

#결과 예시
On branch main

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

        modified:   third.py

no changes added to commit (use "git add" and/or "git commit -a")

마지막 줄에서 아직 commit된 것이 없다는 것을 확인해야 한다.

수정사항을 준비 영역에 올리지 않았을 때(git add를 안 수행했을 때)
- git checkout -- <filename>
- 그러면 파일이 원래대로 복구된다.
수정사항을 stage했을 때(git add를 수행했을 때)
- 그러면 위 결과 예시처럼 no changes added to commit ...이라는 메시지가 없다. 다음 두 명령을 입력한다.
- git reset HEAD <filename>
- git checkout -- <filename> 을 입력한다.
- 그러면 가장 최신(HEAD) commit에 저장되어 있는 파일의 원래 상태가 복구된다. commit하지 않았을 때 사용할 수 있는 이유가 이것이다.
- 아니면 명령어 두 개를 합친 다음 명령을 써도 된다.
- git reset --hard HEAD -- <filename>

git reset <filename>은 git add <filename>의 역방향이라고 보면 된다. 물론 git reset <commit> <filename>은 파일을 여러 commit 이전으로 되돌릴 수 있기 때문에 상황에 따라서는 다른 작업일 수 있다.

비슷하게, git reset -p <filename>은 git add -p <filename>의 역 작업이다.

git reset의 옵션은 여러 개가 있다.

git reset [-q | -p] [--] <paths>: <paths>는 <filename>을 포함한다. 즉, filename 뿐만 아니라 디렉토리 등도 가능하다. 이 명령의 효과는 git add [-p]의 역 작업이다.
git reset [--soft | --mixed [-N] | --hard | --merge | --keep] -[q] [<commit>]
- --hard: <commit> 이후 발생한 모든 수정사항과 준비 영역의 수정사항이 폐기된다.
- --soft는 파일의 수정사항이 남아 있으며, 수정된 파일들이 모두 Changes to be committed 상태가 된다.
- --mixed는 파일의 수정사항은 남아 있으나 준비 영역의 수정사항은 폐기된다. mixed가 기본 옵션이다.
- --merge는 준비 영역의 수정사항은 폐기하고 <commit>과 HEAD 사이 수정된 파일들을 업데이트하지만 수정된 파일들은 stage되지 않는다.
- --keep은 --merge와 비슷하나 <commit>때와 HEAD 때가 다른 파일에 일부 변화가 있는 경우에는 reset 과정이 중단된다.

모든 파일의 수정사항 되돌리기:

git reset --hard HEAD

branch 병합 취소하기

먼저 다음 flowchart를 살펴보자.

바로 직전에 한 병합(merge)를 취소하려면 다음 명령어를 입력한다.

git reset --merge ORIG_HEAD

병합 후 추가한 commit이 있으면 해당 지점의 commit을 지정해야 한다.

git reset <commit>

어디인지 잘 모르겠으면 reflog를 사용해보자.

이미 원격 저장소에 공유된 branch 병합을 취소하는 방법은 여기를 참고한다.

커밋 합치기: git reset <commit>

기본적으로, git reset은 branch tip을 <commit>으로 옮기는 과정이다. 그래서, git reset <option> HEAD는 마지막 commit의 상태로 준비 영역 또는 파일 내용을 되돌리는(reset) 작업이다.
또한, 바로 위에서 살펴봤듯이, git reset은 기본 옵션이 --mixed이며, 이는 옵션을 따로 명시하지 않으면 git reset은 파일의 수정사항은 그대로 둔 채 준비 영역에는 추가된 수정사항이 없는 상태로 만든다.

그래서 특정 이전 commit을 지정하여 git reset <commit>을 수행하면 해당 <commit>부터 HEAD까지의 파일의 수정사항은 작업트리(=프로젝트 디렉토리 전체)에 그대로 남아 있지만, 준비 영역에는 아무런 변화도 기록되어 있지 않다.
먼저 어떤 커밋들을 합칠지 git log --oneline으로 확인해보자.

# 결과 예시
c8c731b (HEAD -> main, origin/main, origin/HEAD) doong commit
87ab51e (tag: specific_tag) All text in first line will be showed at --oneline
da44601 Merge branch '3rd-branch'
2eae048 Unwanted commit from 2nd-branch
4a521c5 Desired commit from 2nd-branch

이제 가장 최신 2개의 commit을 합치고 싶으면, 현재 branch의 HEAD를 c8c731b에서 da44601로 옮기면 된다.

git reset da44601

그러면 직전 2개의 commit의 수정사항이 파일에는 그대로 남아 있지만, 준비 영역이나 commit 내역에선 사라진다. 이제 stage, commit, push 3단계를 수행하면 최종적으로 commit 2개가 1개로 합쳐진다.

<commit> id를 지정하는 것이 헷갈린다면 git reset HEAD~2로 실행하자. 이는 여기에서 볼 수 있듯이 범위로 2개의 commit을 포함한다.

git rebase

rebase는 일반적으로 history rearrange의 역할을 한다. 즉, 여러 commit들의 순서를 재배치하는 작업이라 할 수 있다. 혹은 parent branch의 수정사항을 가져오면서 자신의 commit은 그 이후에 추가된 것처럼 하는, 마치 분기된 시점을 뒤로 미룬 듯한 작업을 수행할 수도 있다.

그러나 rebase와 같은 기존 작업을 취소 또는 변경하는 명령은 일반적으로 충돌(conflict)이 일어나는 경우가 많다. 충돌이 발생하면 git은 작업을 일시 중지하고 사용자에게 충돌을 처리하라고 한다.

main branch의 commit을 topic branch로 가져오기

다음과 같은 상황을 가정하자. 각 알파벳은 하나의 commit이며, 각 이름은 branch의 이름을 나타낸다.
아래 각 예시는 git help에 나오는 도움말을 이용하였다.

          A---B---C topic
         /
    D---E---F---G main

commit F, G를 topic branch에 반영(포함)시키려 한다면,

                  A'--B'--C' topic
                 /
    D---E---F---G main

commit A’와 A는 프로젝트에 동일한 수정사항을 적용시키지만, 16진수로 된 commit의 고유 id(da44601 같은)는 다르다. 즉, 엄밀히는 다른 commit이다.

commit을 재배열하는 명령어는 다음과 같다. 현재 branch는 topic이라 가정한다.

git rebase main
git rebase main topic

commit A, B, C가 F, G와 코드 상으로 동일한 파일 또는 다른 일부분을 수정하지 않았다면, 이 rebase 작업은 자동으로 완료된다.

만약 topic branch에 이미 main branch로부터 가져온 commit이 일부 존재하면, 이 commit들은 새로 배치되지 않는다.

          A---B---C topic
         /
    D---E---A'---F main

에서

                   B'---C' topic
                  /
    D---E---A'---F main

로 바뀐다.

branch의 parent 바꾸기: –onto

topic을 next가 아닌 main에서 분기된 것처럼 바꾸고자 한다. 즉,

    o---A---B---o---C  main
         \
          D---o---o---o---E  next
                           \
                            o---o---o  topic

이걸 아래와 같이 바꿔보자.

    o---A---B---o---C  main
        |            \
        |             o'--o'--o'  topic
         \
          D---o---o---o---E  next

topic branch의 history에는 이제 commit D~E 대신 commit A~B가 포함되어 있다.

이는 다음과 같은 명령어로 수행할 수 있다:

git rebase --onto main next topic

다른 예시는:

                            H---I---J topicB
                           /
                  E---F---G  topicA
                 /
    A---B---C---D  main

git rebase --onto main topicA topicB

                 H'--I'--J'  topicB
                /
                | E---F---G  topicA
                |/
    A---B---C---D  main

특정 범위의 commit들 제거하기

    E---F---G---H---I---J  topic

topic branch의 5번째 최신 commit부터, 3번째 최신 commit 직전까지 commit을 topic branch에서 폐기하고 싶다고 하자. 그러면 다음 명령어로 사용 가능하다.

git rebase --onto <branch-name>~<start-number> <branch-name>~<end-number> <branch-name>

# 명령어 예시
git rebase --onto topic~5 topic~3 topic

    E---H'---I'---J'  topic

여기서 5(번째 최신 commit, F)은 삭제되고, 3(번째 최신 commit, H)은 삭제되지 않음을 주의하라. rebase가 되기 때문에 commit의 고유 id는 바뀐다(H -> H’)

충돌 시 해결법

일반적으로 rebase에서 수정하는 2개 이상의 commit이 같은 파일을 수정하면 충돌이 발생한다.

보통은 다음 과정을 거치면 해결된다.

충돌이 일어난 파일에 적절한 조취를 취한다. 파일을 남기거나/삭제하거나, 또는 파일 일부분에서 남길 부분을 찾는다. 코드 중 다음과 비슷해 보이는 부분이 있을 것이다. 적절히 지워서 해결하자.

ㅤ<<<<<<<< HEAD
ㅤ<current-code>
ㅤ========
ㅤ<incoming-code>
ㅤ>>>>>>>> da446019230a010bf333db9d60529e30bfa3d4e3

git add <conflict-resolved-filename>
git rebase --continue

그냥 다 모르겠고(?) rebase 작업을 취소하고자 하면 다음을 입력한다.

git rebased --abort

rebase로 commit 합치거나 수정하기

다음과 같은 history가 있다고 하자.

c3eace0 (HEAD -> main, origin/main, origin/HEAD) git checkout, reset, rebase
f6c56ef what igt
bd80626 github hem
b7801a2 github overall
608a518 highlighter theme change

여러 개의 commit들을 합치거나, commit message를 수정하거나 하는 작업은 모두 rebase로 가능하다.
실행하면, vim 에디터가 열릴 것이다(ubuntu의 경우 nano일 수 있다). vim을 쓰는 방법은 여기를 참고한다.

rebase하는 부분에서는 다른 git command들과는 달리 수정할 commit 중 가장 오래된 commit이 가장 위에 온다.

git rebase --interactive <commit>
git rebase -i <commit>

# 명령 예시
git rebase -interactive 608a518
git rebase -i HEAD~4

# 결과 예시

pick c3eace0 (HEAD -> main, origin/main, origin/HEAD) git checkout, reset, rebase
pick f6c56ef what igt
pick bd80626 github hem
pick b7801a2 github overall
# Rebase 608a518..c3eace0 onto 608a518
#
# Commands:
# p, pick = use commit
# r, reword = use commit, but edit the commit message
# e, edit = use commit, but stop for amending
# s, squash = use commit, but meld into previous commit
# f, fixup = like "squash", but discard this commit's log message
# x, exec = run command (the rest of the line) using shell
#
# These lines can be re-ordered; they are executed from top to bottom.
#
# If you remove a line here THAT COMMIT WILL BE LOST.
#
# However, if you remove everything, the rebase will be aborted.
#
# Note that empty commits are commented out

설명을 잘 살펴보면 다음을 알 수 있다:

pick = p는 수정 사항과 commit을 그대로 둔다. 각 commit의 맨 앞에는 기본적으로 pick으로 설정되어 있다. 이 상태에서 아무 것도 안 하고 나간다면 이번 rebase는 아무 효과도 없다.
reword = r은 pick과 거의 같지만 commit message를 수정할 수 있다. commit message를 수정하고 앞의 pick을 reword나 r로 바꾸면 commit의 메시지를 수정할 수 있다. 가장 최신의 commit에 r을 붙였다면 git commit --amend와 효과가 같다.
edit = e는 해당 commit을 수정할 수 있다. reset 등의 작업이 가능하다.
squash = s는 해당 commit이 바로 이전 commit에 흡수되며, commit message 또한 합쳐져서 하나로 된다. 합친 메시지들이 존재하는 에디터가 다시 열린다.
fixup = f는 squash와 비슷하지만, 해당 commit의 message는 삭제된다.
exec = x는 commit들 아래 줄에 명령어를 추가하여 실행하게 할 수 있다.

수정한 예시는 다음과 같다. 약어를 써도 되고 안 써도 된다.

pick c3eace0 (HEAD -> main, origin/main, origin/HEAD) git checkout, reset, rebase
f f6c56ef what igt
f bd80626 github hem
fixup b7801a2 github overall
...(아래 주석은 지워도 되고 안 지워도 된다. 어차피 commit에서는 무시되는 도움말이다)

하나의 commit을 2개로 분리하기

가장 최신 commit이라면 git reset HEAD~1을 사용하여 직전 commit 상태로 되돌린 뒤 stage-commit을 2번 수행하면 되고, 그 이전 commit이라면 rebase에서 해당 commit을 edit으로 두고 같은 과정을 반복하면 된다.

# 명령어 예시
git rebase HEAD~4
# pick -> edit
git add -p <filename>
git commit -m <1st-commit-message>
git add -p <filename1> <filename2>
git commit -m <2nd-commit-message>
git rebase --continue

commit을 되돌리는 commit: git revert

예를 들어, 4a521c5이라는 commit이 코드 3줄을 수정하고, 2줄을 제거하는 commit이라고 하자. 나중에, 이 commit이 완전히 잘못된 내용임을 알았으나, 이미 원격 저장소에 push되었다고 하자. 이럴 때 해당 commit을 취소하는 작업을 git revert로 수행할 수 있다.
아니, 정확히는 commit을 되돌리는 역할을 하는 commit을 추가하는 commit을 새로 생성할 수 있다.

git revert <commit>

# 명령어 예시
git revert 4a521c5

# 결과 예시
[main 4a521c5] Revert "specific_commit_description"

공유된 branch 병합 취소하기

먼저 어디서 병합이 일어났는지를 살펴본다. git log --merges를 쓰면 병합 commit만을 볼 수 있다. 반대로 --no-merges는 병합 commit은 제외하고 log를 보여준다.

git log --merges

# 결과 예시
commit da446019230a010bf333db9d60529e30bfa3d4e3 (origin/main, origin/HEAD)
Merge: 4a521c5 2eae048
Author: greeksharifa <greeksharifa@gmail.com>
Date:   Sun Aug 19 20:59:24 2018 +0900

    Merge branch '3rd-branch'

commit 90ce4f2ec8b5cd26af51e03401fb4541abfffbc2 (tag: v0.5, origin/3rd-branch)
Merge: e934e3e 317200f
Author: greeksharifa <greeksharifa.gmail.com>
Date:   Sun Aug 12 15:42:06 2018 +0900

    Merge branch '3rd-branch'

아니면 git log --graph나 git reflog를 활용한다.

이제 다음 그림을 참고하자.

완전 병합인 경우 다음 명령을 사용한다.

git revert --mainline <branch-number> <commit>

# 명령어 예시
git revert --maineline 1 4a521c5

여기서 <branch-number>는 남길 branch의 번호이다. git log --graph에서 보여지는 선들 중에서 가장 왼쪽부터 1번이며, 보통은 1번을 남기게 된다.

병합 commit이 따로 없다면 잘못된 commit들을 개별적으로 처리해야 한다.

특정 commit을 포함하는 모든 branch의 목록을 보자.

git branch --contains <commit>

취소할 commit들이 인접해 있다면 다음 명령으로 하나의 취소 commit을 생성할 수 있다.

git revert --no-commit <last commit to keep>..<newest commit to reject>

# 결과 예시
git revert --no-commit 4a521c5..2eae048

변경 사항을 검토하고 취소 과정을 끝내자.

git revert --continue

인접해 있지 않다면 각 commit을 하나씩 취소 작업을 해야 한다. 심심한 위로의 말을 전한다.

git revert <commit-1>
git revert <commit-2>
...

history 완전 삭제하기: 완전범죄?

혹시나 비밀번호 같은 걸 원격 저장소에 올려버렸다면, 다른 팀원들이 봤든 안 봤든 최대한 흔적도 없이 날려버려야 한다. 이 때는 다음 명령들을 실행한다. 삭제할 파일이 password.crypt라고 하자.

git filter-branch --index-filter 'git rm --cached --ignore-unmatch password.crypt' HEAD
git reflog expire --expire=now --all
git gc --prune=now
git push origin --force --all --tags

각각 특정 파일을 저장소에서 완전히 삭제하고, history에서 없애고, 모든 commit되지 않은 수정사항을 작업트리에서 삭제하는 명령이다.

다른 팀원들에게는 rebase를 진행시키거나 아예 로컬 저장소를 밀어버린 다음 새로 clone해서 받으라고 말한다.

git pull --rebase=preserve

수정사항 임시 저장하기: git stash

지금 당장 branch를 전환해서 다른 branch의 내용을 봐야 하는데 commit할 만큼은 안 되는 수정사항이 작업트리에 남아 있을 때가 있다. 그럴 때는 잠시 넣어 두는 명령이 필요하다.

git stash
git stash save
git stash save "stash message"

# 결과 예시
Saved working directory and index state WIP on main: 94d511c fourth ticket

commit message처럼 간략한 메시지를 적고 싶다면 git stash save "<stash-message>"로 사용한다.

그러나 git stash [save] 명령은 untracked 파일들은 저장하지 않는다. 이 파일들까지 임시 저장하라면 다음과 같이 쓴다.

git stash save --include-untracked
git stash -u

반대로 stage된 파일을 stash하지 않으려면 git stash --keep-index로 사용한다.

git stash도 git add와 비슷하게 --patch 옵션을 지원한다. 남길 부분을 파일 내에서 선택하고 싶다면 해당 옵션을 사용하라.

stash로 저장한 목록을 보려면 다음 명령을 입력한다.

git stash list

#결과 예시
stash@{0}: WIP on main: 94d511c fourth ticket
stash@{1}: WIP on main: 94d511c fourth ticket

stash의 내용이 기억나지 않으면 git stash stash@{<number>} 명령을 쓴다.

git stash stash@{1}

# 결과 예시
Merge: 94d511c 7060e4d f4a6d7f
Author: greeksharifa <greeksharifa@gmail.com>
Date:   Sat May 30 13:51:23 2020 +0900

    WIP on main: 94d511c fourth tickek

diff --cc .gitignore
index 15c8c56,15c8c56,0000000..f6f1686
mode 100644,100644,000000..100644
--- a/.gitignore
+++ b/.gitignore
@@@@ -1,3 -1,3 -1,0 +1,5 @@@@
  +
  +.idea/
  +*dummy*
+++
+++*.txt
diff --cc doonggoos.py
...

잠시 넣어 둔 stash를 다시 작업트리로 꺼내오려면 git stash apply stash@{<number>}를 사용한다.

git stash apply stash@{0}

# 결과 예시
On branch main
Your branch and 'origin/main' have diverged,
and have 3 and 2 different commits each, respectively.
  (use "git pull" to merge the remote branch into yours)

Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

        new file:   doonggoos.py

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

        modified:   .gitignore
        modified:   fourth.py

어떤 파일들이 변경되었는지 알려준다.

더 이상 안 쓸 stash를 제거하려면 git stash drop stash@{<number>}를 사용한다.

git stash drop stash@{0}

#결과 예시
Dropped stash@{0} (9f700348f8688c3cbc21c17e4bc3d231b3abd0c3)

작업트리 청소하기: git clean

untracked 파일을 그냥 없애버리고 싶다면 git clean -d를 쓴다.

tracking하지 않는 모든 정보를 지워버리려면 git clean -f -d를 사용한다. 말 그대로 강제(-f, force)다.

그냥 지워버려도 되는지 확인하고 싶다면 -n 옵션을 붙여서 실행시키면 된다. 그러면 어떤 파일들이 영향을 받는지 알려준다.

git clean -d -n

.gitignore에 명시한 등 무시되는 파일은 git clean으로 지워지지 않는다. 이런 파일들까지 싹 다 지우려면 -x 옵션을 붙인다.
대화형으로 실행하려면 -i 옵션을 붙이면 된다.

최초의 오류 commit 찾기: git bisect

git bisect는 일종의 디버깅 툴이다. 코드에 어떤 버그가 있지만 그게 언제 추가됐는지 정확히 모를 때 쓴다.
bisect를 쓰려면 우선 다음 조건이 필요하다.

어떤 문제가 있는 시점을 알고(보통은 현재일 것이다)
해당 문제가 없는 과거의 어떤 commit 시점을 알고 있을 때

그러면 git bisect를 통해 이분탐색을 수행하여 잘못된 코드가 어떤 commit에서 나타났는지 찾는다. 이분 탐색하며 중간 지점의 commit에서 다시 build해 보고,

문제가 있으면 git bisect bad 입력, 해당 commit 이전을 탐색하고,
문제가 없으면 git bisect good 입력, 해당 commit 이후를 탐색한다.

# 명령어 및 결과 예시
git bisect start                        # 시작
git bisect bad [<commit>]               # 어떤 시점(<commit>을 안 쓰면 현재)에 문제가 있고
git bisect good <commit>                # 어떤 시점에는 문제가 없음을 git에 알리기

Bisecting: 675 revisions left to test after this (roughly 10 steps)
# 그러면 675개의 수정 사항 중 이분 탐색을 수행한다. 2^10 = 1024이니 10단계만 테스트하면 된다.

git bisect good

Bisecting: 337 revisions left to test after this (roughly 9 steps)

git bisect <bad/good>
...

bisect 세션을 끝내고 원래 상태로 돌아가려면 git bisect reset을 입력한다.
만약 중간 지점으로 선택된 commit이 테스트할 수 없다면 bad / good 대신 git bisect skip을 입력해서 잠시 패스하고 근처의 다른 commit을 테스트 대상으로 할 수 있다.

branch에서 특정 commit만 다른 branch로 적용하기: git cherry-pick

git cherry-pick <commit> 명령은 branch의 병합 없이도 다른 branch의 특정 commit을 가져올 수 있다. ticket branch에 있는 96c99dc라는 commit을 main branch로 가져오고자 한다.

# 명령어 예시
git checkout main
git cherry-pick 96c99dc

# 결과 예시
[3rd-branch 32d6b93] example commit message
 Date: Sat May 30 18:51:51 2020 +0900
 1 file changed, 2 insertions(+), 3 deletions(-)

명령어 마음대로 설정하기: Git Alias

alias는 단축만 가능한 것은 아니지만, 단축할 때 많이 쓴다.

git reset HEAD -- <filename>이 입력하기 귀찮거나 자주 실수한다면, 직관적인 명령어로 바꿔 줄 수 있다.
git config alias.<another-name> '<original-command>' 형식으로 쓴다.

git config --global alias.unstage 'reset HEAD --'

이제 아래 두 명령은 동일한 효과를 갖는다.

git reset HEAD -- <filename>
git unstage <filename>

충돌 자동 해결: Reuse Recorded Resolution(git.rerere)

정확히는 전부 자동으로 해 주는 것은 아니고, 예전에 비슷한 충돌을 해결한 적이 있다면 같은 방식으로 자동으로 해결하도록 설정할 수 있다.

다음 설정으로 활성화한다.

git config --global rerere.enabled true

처음 충돌이 났을 때 git rerere status로 충돌 파일을 확인한다. git rerere diff로 충돌을 해결한다.
이후 처리 과정은 일반 충돌 처리 과정과 같다.
- commit하고 나면 Recorded resolution for <filename>이라는 메시지를 볼 수 있다.
다음으로 비슷한 충돌이 났을 때에는 다음 메시지를 확인할 수 있다.
- Resolved <filename> using previous resolution. : 이미 충돌을 해결했다는 뜻이다.
- 충돌 파일을 확인해봐도 충돌된 부분을 찾을 수 없다. 그냥 commit하면 된다.

Comment Read more

Attentional Factorization Machines (AFM) 논문 리뷰 및 Tensorflow 구현

01 May 2020 | Machine_Learning Recommendation System AFM

본 글의 전반부에서는 먼저 Attentional Factorization Machines: Learning theWeight of Feature Interactions via Attention Networks 논문을 리뷰하면서 본 모델에 대해 설명할 것이다. 후반부에서는 Tensorflow를 이용하여 직접 코딩을 하고 학습하는 과정을 소개할 것이다. 논문의 전문은 이곳에서 확인할 수 있다.

1. Attentional Factorization Machines: Learning theWeight of Feature Interactions via Attention Networks 논문 리뷰

1.0. Absbract

FM은 2차원 피쳐 상호작용을 잘 통합하여 선형 회귀를 개선한 지도학습 알고리즘이다. 이 알고리즘은 효과적이긴 하지만, 모든 피쳐에 대해 같은 weight로 학습을 진행시킨다는 점에서 비효율적이다. 왜냐하면 종종 일부 피쳐는 학습에 있어 필수적이지 않은 경우가 있기 때문이다. 오히려 이러한 피쳐들의 존재는 모델의 성능을 떨어트릴 수 있다. 따라서 우리는 여러 피쳐 상호작용 속에서 중요한 피쳐들을 구분해내는 새로운 모델, Attentional Factorization Machine (AFM)을 소개한다.

1.1. Introduction

(전략)

FM은 피쳐 상호작용의 중요성을 구분하는 능력이 부족하기 때문에(피쳐의 중요성을 파악하는 능력) suboptimal 문제에 빠질 수 있다. AFM은 이러한 문제를 해결하기 위해 도입한 모델이다.

1.2. Factorization Machines

FM 모델에 대한 설명은 이곳을 참조하길 바란다. 기호에 대해서만 설명을 추가하면, $v_i$는 피쳐 $i$에 대한 임베딩 벡터이며, $k$는 임베딩 크기를 의미한다.

1.3. Attentioanl Factorization Machines

1.3.1. Model

위 그림은 AFM의 구조를 보여준다. 선명히 보여주기 위해 그림에서는 선형 회귀 부분을 생략하였다. Input Layer와 Embedding Layer의 경우 FM과 같은 구조를 지니는데, Input 피쳐들은 sparse하게 이루어져있고 이들은 dense vector로 임베딩된다. 지금부터는 본 모델의 핵심인 pair-wise interaction layer과 attention-based pooling layer를 설명할 것이다.

Pair-wise Interaction Layer
상호작용을 포착하기 위해 내적을 사용하는 FM을 참고하여, 본 논문에서는 신경망 모델링에서 새로운 Pair-wise Interaction Layer를 제시한다. $m$개의 벡터를 $\frac{m(m-1)}{2}$개의 interacted 벡터로 만드는데, 이 때 각 interacted 벡터는 상호작용을 포착하기 위해 2개의 다른 벡터들의 원소곱으로 계산된다.

정확히 말하면, 피쳐 벡터 $x$의 0이 아닌 피쳐의 집합을 $\chi$라고 하자. 그리고 Embedding Layer의 결과물을 $\epsilon = {{v_i x_i}}_{i \in \chi} $라고 하자. 우리는 아래와 같이 Pair-wise Interaction Layer의 결과물을 아래와 같은 벡터의 집합으로 표현할 수 있다.

[f_{PI}(\epsilon) = { (v_i \odot v_j) x_i x_j }_{(i, j \in R_x)}]

$\odot$ 기호: 원소곱
$ R_x = { (i, j) }_{i, j \in \chi, j>i} $

이 Layer를 정의하면서 우리는 FM을 신경망 구조로 표현할 있게 된다. 먼저 $f_{PI}(\epsilon)$를 sum pooling으로 압축한다음, Fully Connected Layer를 사용하여 prediction score에 투사(project)한다.

[\hat{y} = p^T \sum_{(i, j) \in R_x} (v_i \odot v_j) x_i x_j + b]

$p \in R^k$
$b \in R$

위에서 등장한 p, b는 Prediction Layer의 weight과 bias이다. 물론 p=1, b=0으로 값을 고정한다면 이는 FM과 동일한 형상을 취하게 될 것이다.

Attention-based Pooling Layer
Attention의 기본 아이디어는, 여러 개의 부분이 압축 과정에 있어서 각각 다르게 기여하여 하나로 표현되게 만드는 것이다. interacted 벡터들의 가중 합을 수행하여 피쳐 상호작용에 대해 Attention 메커니즘을 적용하였다.

[f_{Att}(f_{PI}(\epsilon)) = a_{i,j} \sum_{(i, j) \in R_x} (v_i \odot v_j) x_i x_j]

여기서 $a_{i, j}$는 피쳐 상호작용 $\hat{w}_{ij}$의 Attention Score이다.

Prediction Loss를 최소화하여 직접적으로 학습을 진행하여 $a_{i,j}$를 추정하는 것이 기술적으로는 맞게 느껴지지만, 학습 데이터에서 한 번도 동시에 등장한 적이 없는 피쳐들의 경우, 이들의 상호작용에 대한 Attention Score는 추정될 수 없다.

이러한 일반화 문제를 해결하기 위해 MLP를 통해 Attention Score를 파라미터화 하는 Attention Network를 추가하였다. 이 네트워크의 Input은 2개의 피쳐의 interacted 벡터인데, 이들의 상호작용 정보는 임베딩 공간에 인코딩된다.

$e_{ij} = h^T ReLU(W (v_i \odot v_j) x_i x_j + b)$
$a_{ij} = \frac {exp(e_{ij})} { \sum_{(i, j) \in R_x} exp(e_{ij}) }$

$W \in R^{t*k}, b \in R^t, h \in R^t$
$t$: Attention Network의 hidden layer의 크기(Attention Factor)

Attention Score는 softmax 함수를 통해 정규화된다. 이 Attention-based Pooling Layer의 결과물은 k 차원의 벡터로, 중요성을 구별하여 임베딩 공간에서의 모든 피쳐 상호작용을 압축한 것이다. 요약하자면, AFM 모델의 최종 공식은 아래와 같다.

[\hat{y}{AFM}(x) = w_0 + \sum{i=1}^n w_i x_i + p^T \sum_{i=1}^n \sum_{j=i+1}^n a_{ij} (v_i \odot v_j) x_i x_j]

모델 파라미터들은 $ w_0, w, v, p, W, b, h $이다.

1.3.2. Learning

AFM이 데이터 모델링의 관점에서 FM을 개선함에 따라 본 모델은 예측, 회귀, 분류, 랭킹 문제 등에 다양하게 적용될 수 있다. 목적 함수를 최적화하기 위해 SGD를 사용하였다. SGD 알고리즘 적용의 핵심은, 각 파라미터를 기준으로 예측 모델 AFM의 derivative를 구하는 것이다.

과적합 문제
FM보다 표현력이 뛰어난 AFM이기에 더욱 과적합 문제에 민감할 수 있다. 따라서 본 모델에서는 dropout과 L2 Regularization 테크닉이 사용되었다.

(후략)

2. Tensorflow를 활용한 구현

2.1. 데이터 준비

본 모델의 경우 Dataset에 대한 Domain 지식이 필요하다고 볼 수는 없지만, 학습을 진행하기에 앞서 기본적으로 직접 전처리를 해주어야 하는 부분들이 있다. One-Hot 인코딩 외에도, 본 모델은 앞서 논문 리뷰에서도 확인하였듯이 0이 아닌 값에 대해서만 Lookup을 수행하여 실제 학습 데이터를 사용하기 때문에 이에 대한 정보를 저장해야할 필요가 있다. 아래 예시를 잠시 살펴보면,

만약 연속형 변수 중에 0.0이라는 값이 존재하더라도 사실 이 값은 중요한 특성을 나타낼 수도 있다. 그러나 논문의 기본 논조대로라면, 0인 값이기 때문에 학습에서 제외되게 된다. 이렇게 0이라고 해서 중요한 값이 학습에서 제외되는 현상을 막기 위해 본 구현에서는 One-Hot 인코딩 이후의 데이터에 대하여 중요한 정보의 위치를 저장하는 masking 작업을 진행하게 된다.

데이터는 DeepFM 구현글에서 사용한 것과 동일하다. 데이터 전처리는 연속형 변수에 대해서는 MinMaxScale, 범주형 변수에 대해서는 One-Hot 인코딩만을 진행하게 된다.

2.2. Layer 정의

AFM 모델에서는 크게 3개의 Layer가 필요하다. Embedding Layer, Pairwise Interaction Layer, Attention Pooling Layer가 바로 그 3가지이다. Embedding Layer 부분은 이전 글(논문)들을 읽었다면, 굉장히 익숙하게 받아들여 질 것이다. 다만 이전 DeepFM 구현글에서는 하나의 Field에 대해 하나의 Embedding Row가 학습되었다면, 본 글에서는 하나의 Feature에 대해 하나의 Embedding Row가 학습되도록 코드를 수정하였다.

앞서 언급하였듯이 One-Hot 인코딩으로 생성된 0 값을 갖는 feature를 제외한 feature들만 실제 학습에 사용되는데(예를 들어 One-Hot 인코딩 이후에 0.2, 7.4, 0, 1, … 0, 1와 같은 데이터로 변환되었다면 실제 학습에 사용되는 데이터는 0.2, 7.4, 1, … 1이라는 뜻이다.)

위와 같은 논리를 구현하는 방법에는 여러가지가 있을 수 있겠지만 본 구현에서는 다음과 같은 논리를 따랐다.

1) 연속형 변수들은 모두 앞쪽에 배치한 후, 이들에게는 무조건 True Mask를 씌워 학습 데이터로 활용한다.  
2) 범주형 변수들에 대해서는 0이 아닌 값들에 대해서 True Mask를 씌워 학습 데이터로 활용한다.  

논리 자체는 간단하며, 아래 call 메서드에서 그 논리가 구현되어 있다.

import tensorflow as tf
import numpy as np
import config


class Embedding_layer(tf.keras.layers.Layer):
    def __init__(self, num_field, num_feature, num_cont, embedding_size):
        super(Embedding_layer, self).__init__()
        self.embedding_size = embedding_size    # k: 임베딩 벡터의 차원(크기)
        self.num_field = num_field              # m: 인코딩 이전 feature 수
        self.num_feature = num_feature          # p: 인코딩 이후 feature 수, m <= p
        self.num_cont = num_cont                # 연속형 field 수
        self.num_cat  = num_field - num_cont    # 범주형 field 수

        # Parameters
        self.V = tf.Variable(tf.random.normal(shape=(num_feature, embedding_size),
                                              mean=0.0, stddev=0.01), name='V')

    def call(self, inputs):
        # inputs: (None, p, k), embeds: (None, m, k)
        batch_size = inputs.shape[0]

        # 원핫인코딩으로 생성된 0을 제외한 값에 True를 부여한 mask(np.array): (None, m)
        # indices: 그 mask의 indices
        cont_mask = np.full(shape=(batch_size, self.num_cont), fill_value=True)
        cat_mask = tf.not_equal(inputs[:, self.num_cont:], 0.0).numpy()
        mask = np.concatenate([cont_mask, cat_mask], axis=1)

        _, flatten_indices = np.where(mask == True)
        indices = flatten_indices.reshape((batch_size, self.num_field))

        # embedding_matrix: (None, m, k)
        embedding_matrix = tf.nn.embedding_lookup(params=self.V, ids=indices.tolist())

        # masked_inputs: (None, m, 1)
        masked_inputs = tf.reshape(tf.boolean_mask(inputs, mask),
                                   [batch_size, self.num_field, 1])

        masked_inputs = tf.multiply(masked_inputs, embedding_matrix)    # (None, m, k)

        return masked_inputs

다음은 Pairwise Interaction Layer에 대한 설명이다. 만약 14개의 Row가 존재한다면 이에 대한 모든 조합을 구하여 91 = $14\choose2$ 개의 Row를 생성하는 Layer인데, 간단하게 생각해보면 아래와 같이 코드를 짜고 싶을 것이다.

from itertools import combinations

interactions = []
comb_list = list(range(0, num_field, 1))

for b in range(batch_size):
    for i, j in list(combinations(self.comb_list, 2)):
        interactions.append(tf.multiply(inputs[b, i, :], inputs[b, j, :]))

pairwise_interactions = tf.reshape(tf.stack(interactions),
                                    (batch_size, -1, self.embedding_size))

하지만 위와 같이 loop를 돌리게 되면, 속도가 현저하게 느려져서 실 사용이 불가능하다. 따라서 이 때는 Trick이 필요한데, 그림으로 설명하면 아래와 같다.

위 그림에서 14는 num_field의 예시이고, 5는 embedding_size의 예시이다. 가장 왼쪽에 있는 그림은 Embedding Layer를 통과한 Input 행렬을 그대로 num_field 수 만큼 쌓은 형태이이고, 그 오른쪽 그림은 똑같은 행들을 num_field 수만큼 쌓은 형태이다. 이렇게 쌓은 두 행렬 집단을 그대로 원소곱을 하게 되면 마치 조합을 구해서 곱을 한 것과 같은 형태가 나온다. 여기서 필요한 행들만 masking을 통해 취하면, 제일 오른쪽과 같은 결과물을 얻을 수 있다.

이를 코드를 구현한 것이 아래이다. tf.tile, tf.expand_dims 함수를 잘 이용하면 이 Trick을 코드로 구현할 수 있다. 직접 해보길 바란다.

class Pairwise_Interaction_Layer(tf.keras.layers.Layer):
    def __init__(self, num_field, num_feature, embedding_size):
        super(Pairwise_Interaction_Layer, self).__init__()
        self.embedding_size = embedding_size    # k: 임베딩 벡터의 차원(크기)
        self.num_field = num_field              # m: 인코딩 이전 feature 수
        self.num_feature = num_feature          # p: 인코딩 이후 feature 수, m <= p

        masks = tf.convert_to_tensor(config.MASKS)    # (num_field**2)
        masks = tf.expand_dims(masks, -1)             # (num_field**2, 1)
        masks = tf.tile(masks, [1, embedding_size])   # (num_field**2, embedding_size)
        self.masks = tf.expand_dims(masks, 0)         # (1, num_field**2, embedding_size)


    def call(self, inputs):
        batch_size = inputs.shape[0]

        # a, b shape: (batch_size, num_field^2, embedding_size)
        a = tf.expand_dims(inputs, 2)
        a = tf.tile(a, [1, 1, self.num_field, 1])
        a = tf.reshape(a, [batch_size, self.num_field**2, self.embedding_size])
        b = tf.tile(inputs, [1, self.num_field, 1])

        # ab, mask_tensor: (batch_size, num_field^2, embedding_size)
        ab = tf.multiply(a, b)
        mask_tensor = tf.tile(self.masks, [batch_size, 1, 1])

        # pairwise_interactions: (batch_size, num_field C 2, embedding_size)
        pairwise_interactions = tf.reshape(tf.boolean_mask(ab, mask_tensor),
                                           [batch_size, -1, self.embedding_size])

        return pairwise_interactions

config.MASKS는 아래와 같이 구현되어 있다.

MASKS = []
for i in range(NUM_FIELD):
    flag = 1 + i

    MASKS.extend([False]*(flag))
    MASKS.extend([True]*(NUM_FIELD - flag))

다음으로는 마지막 Attention Pooling Layer이다. 설명할 것이 많지 않은 간단한 구조이다.

class Attention_Pooling_Layer(tf.keras.layers.Layer):
    def __init__(self, embedding_size, hidden_size):
        super(Attention_Pooling_Layer, self).__init__()
        self.embedding_size = embedding_size    # k: 임베딩 벡터의 차원(크기)

        # Parameters
        self.h = tf.Variable(tf.random.normal(shape=(1, hidden_size),
                                              mean=0.0, stddev=0.1), name='h')
        self.W = tf.Variable(tf.random.normal(shape=(hidden_size, embedding_size),
                                              mean=0.0, stddev=0.1), name='W_attention')
        self.b = tf.Variable(tf.zeros(shape=(hidden_size, 1)))


    def call(self, inputs):
        # 조합 수 = combinations(num_feauture, 2)
        # inputs: (None, 조합 수, embedding_size)
        # --> (전치 후) (None, embedding_size, 조합 수)
        inputs = tf.transpose(inputs, [0, 2, 1])

        # e: (None, 조합 수, 1)
        e = tf.matmul(self.h, tf.nn.relu(tf.matmul(self.W, inputs) + self.b))
        e = tf.transpose(e, [0, 2, 1])

        # Attention Score 산출
        attention_score = tf.nn.softmax(e)

        return attention_score

2.3. Model Build

위에서 설명한 모든 Layer들을 이어 붙이면 AFM 모델이 완성된다.

# Model 정의
from layers import *
tf.keras.backend.set_floatx('float32')

class AFM(tf.keras.Model):

    def __init__(self, num_field, num_feature, num_cont, embedding_size, hidden_size):
        super(AFM, self).__init__()
        self.embedding_size = embedding_size    # k: 임베딩 벡터의 차원(크기)
        self.num_field = num_field              # m: 인코딩 이전 feature 수
        self.num_feature = num_feature          # p: 인코딩 이후 feature 수, m <= p
        self.num_cont = num_cont                # 연속형 field 수
        self.hidden_size = hidden_size          # Attention Pooling Layer Hidden Unit 수

        self.embedding_layer = Embedding_layer(num_field, num_feature,
                                               num_cont, embedding_size)
        self.pairwise_interaction_layer = Pairwise_Interaction_Layer(
            num_field, num_feature, embedding_size)
        self.attention_pooling_layer = Attention_Pooling_Layer(embedding_size, hidden_size)

        # Parameters
        self.w_0 = tf.Variable(tf.zeros([1]))
        self.w = tf.Variable(tf.zeros([num_feature]))
        self.p = tf.Variable(tf.random.normal(shape=(embedding_size, 1),
                                              mean=0.0, stddev=0.1))

        self.dropout = tf.keras.layers.Dropout(rate=config.DROPOUT_RATE)


    def __repr__(self):
        return "AFM Model: embedding{}, hidden{}".format(self.embedding_size, self.hidden_size)


    def call(self, inputs):
        # 1) Linear Term: (None, )
        linear_terms = self.w_0 + tf.reduce_sum(tf.multiply(self.w, inputs), 1)

        # 2) Interaction Term
        masked_inputs = self.embedding_layer(inputs)
        pairwise_interactions = self.pairwise_interaction_layer(masked_inputs)

        # Dropout and Attention Score
        pairwise_interactions = self.dropout(pairwise_interactions)
        attention_score = self.attention_pooling_layer(pairwise_interactions)

        # (None, 조합 수, embedding_size)
        attention_interactions = tf.multiply(pairwise_interactions, attention_score)

        # (None, embedding_size)
        final_interactions = tf.reduce_sum(attention_interactions, 1)

        # 3) Final: (None, )
        y_pred = linear_terms + tf.squeeze(tf.matmul(final_interactions, self.p), 1)
        y_pred = tf.nn.sigmoid(y_pred)

        return y_pred

2.4. 코드 전문

코드의 전문은 깃헙에서 확인할 수 있다.

Comment Read more

DeepFM 논문 리뷰 및 Tensorflow 구현

07 Apr 2020 | Machine_Learning Recommendation System DeepFM

본 글의 전반부에서는 먼저 DeepFM: A Factorization-Machine based Neural Network for CTR Prediction 논문을 리뷰하면서 본 모델에 대해 설명할 것이다. 후반부에서는 Tensorflow를 이용하여 직접 코딩을 하고 학습하는 과정을 소개할 것이다. 논문의 전문은 이곳에서 확인할 수 있다.

1. DeepFM: A Factorization-Machine based Neural Network for CTR Prediction 논문 리뷰

1.0. Abstract

추천 시스템에서 CTR을 최대화하는 것에 있어 사용자의 행동 속에 숨어있는 복잡한 feature interactions들을 학습하는 것은 매우 중요하다. 본 논문에서는 저차원 및 고차원 feature interactions를 모두 강조하면서 end-to-end 학습을 진행하는 모델에 대해 설명할 것이다. 이 DeepFM이라는 모델은 FM과 딥러닝을 결합한 것이다. 최근(2017년 기준) 구글에서 발표한 Wide & Deep model에 비해 피쳐 엔지니어링이 필요 없고, wide하고 deep한 부분에서 공통된 Input을 가진다는 점이 특징적이다.

1.1. Introduction

추천 시스템에서 CTR은 매우 중요하다. 많은 경우에 추천시스템의 목표는 이 클릭 수를 증대하는 것인데, 따라서 CTR 추정값에 근거하여 아이템을 정렬한 뒤 아이템(기사, 영화 등)을 사용자에게 제시할 수 있다. 온라인 광고에서는 수익을 증가시키는 것이 가장 중요하기에, 이 상황에서는 CTR * bid라는 기준 아래 랭킹 전략을 세울 수 있을 것이다. 여기서 bid는 사용자가 아이템을 클릭할 경우 시스템이 수령하는 수입을 의미한다. 어떠한 케이스든, 이 CTR을 정확히 추정하는 것은 매우 중요할 것이다.

CTR 예측에 있어 중요한 포인트는, 사용자의 클릭 행동 속에 숨어 있는 implicit feature interactions(암시적 피쳐 상호작용)를 학습할 줄 알아야 한다는 것이다.

예를 들어 사람들이 식사 시간에 음식 배달을 위한 앱을 다운로드 받는다면, 이 때 앱 카테고리와 시간이라는 요소 사이의 2차 상호작용이 바로 클릭에 대한 신호가 될 수 있다는 것이다. 10대 남자아이가 RPG게임을 좋아한다고 하자, 이 때는 앱 카테고리-사용자의 성별-사용자의 나이라는 3개 요소의 관계가 클릭을 결정하는 요인이 될 수 있다. 즉, 사용자의 클릭 뒤에 숨어있는 이러한 상호작용들은 매우 복잡하여 저/고차원 모두 잘 잡아내는 것이 매우 중요하다.

(중략)

feature representation을 학습하는 방법으로써 Deep Neural Network가 복잡한 feature interactions를 학습하는 잠재력을 갖고 있다고 판단된다. 다만 CNN-based 모델의 경우 이웃한 feature들 사이에 발생하는 상호작용에 의해 편향된 경향을 보이고, RNN-based 모델의 경우 sequential dependency를 갖고 있는 클릭 데이터에 상대적으로 적합한 모습을 보였다. 이후에 FNN, PNN, Wide & Deep 등 여러 모델들이 제안되었다. 본 논문에서는 이러한 모델들의 단점을 보완한 새로운 모델을 제시한다.

1) DeepFM은 피쳐 엔지니어링 없이 end-to-end 학습을 진행할 수 있다. 저차원의 interaction들은 FM 구조를 통해 모델화하고, 고차원의 interaction들은 DNN을 통해 모델화한다.
2) DeepFM은 같은 Input과 Embedding 벡터를 공유하기 때문에 효과적으로 학습을 진행할 수 있다.
3) 본 논문에서 DeepFM은 벤치마크 데이터와 상업용 데이터 모두에서 평가될 것이다.

1.2. Our Approach

$n$개의 instance를 가진 $(\chi, y)$ 학습 데이터셋이 있다고 하자. 이 때 $\chi$는 $m$개의 field를 지니고 있고, $y$는 0과 1의 값을 가진다. (1 = 클릭함)

$\chi$에는 범주형 변수가 있을 수도 있고, 연속형 변수가 있을 수도 있다. 범주형 변수의 경우 원핫인코딩된 벡터로 표현되며, 연속형 변수의 경우 그 값 자체로 표현되거나 이산화되어 원핫인코딩된 벡터로 표현될 수도 있다.

그렇다면 이제 데이터는 $(x, y)$로 표현할 수 있을 것이다. 여기서 $x$는 $[x_{field_1}, x_{field_2}, …, x_{field_m}]$의 구조를 갖게 되며 각각의 $x_{field_j}$는 $\chi$에서의 j번째 field의 벡터 표현을 의미하게 된다. 일반적으로 $x$는 굉장히 고차원이고 희소하다. CTR의 목적은 context가 주어졌을 때 사용자가 특정 어플을 클릭할 확률을 정확히 추정하는 것이다.

1.2.1. DeepFM

위 그림에서도 확인할 수 있다시피, DeepFM은 2가지 요소로 구성되어 있다. 이 요소들은 같은 Input을 공유한다.

$i$번재 피쳐에 대해 스칼라 $w_i$: 1차원 importance를 측정함
latent vector $V_i$: 다른 피쳐들과의 interaction의 영향을 측정

$V_i$의 경우 FM요소에서는 2차원 interaction을 모델화하며, Deep요소에서는 고차원 피쳐 interaction을 모델화한다. 모든 파라미터들은 통합 예측모델에서 함께 학습된다. 즉 모델을 아주 간단히 표현하자면 아래와 같다.

[\hat{y} = sigmoid(y_{FM} + y_{DNN})]

FM Component

FM요소는 Factorization Machine이다. FM모델에 대한 설명은 이글에서 확인할 수 있다.

Deep Component
CTR 예측에 사용되는 Raw 데이터는 일반적으로 매우 희소하고, 고차원이며, 범주형/연속형 변수가 섞여 있고, 일종의 field(성별, 위치, 나이 등)로 그룹화되어 있다는 특징을 지닌다. 따라서 Embedding Layer로 이러한 정보들을 압축하여 저차원의, dense한 실수 벡터를 만들어서 Input을 재가공할 필요가 있다.

아래 그림은 Input Layer에서 Embedding Layer로 이어지는 보조 네트워크를 강조한 부분이다. 여기서 확인해야 할 부분은 2가지이다. 첫 번재는, Input으로 쓰이는 Input field 벡터가 각자 다른 길이를 갖고 있을 수 있기 때문에, 이들의 임베딩은 같은 크기(k)여야 한다는 것이다. 두 번재는, FM 모델에서 latent 벡터로 기능했던 $V$는 본 요소에서는 Input field 벡터를 Embedding 벡터로 압축하기 위해 사용되고 학습되는 네트워크 weight가 된다는 것이다.

Embedding Layer의 Output은 아래와 같다.

[a^0 = [e_1, e_2, …, e_m]]

$e_i$는 i번재 field의 Embedding
$m$은 field의 수

$a^{(0)}$는 DNN에 투입되며 forward process는 다음과 같다.

[a^{(l+1)} = \sigma{(W^{(l)}a^{(l)} + b^{(l)}})]

$l$: layer의 깊이

이렇게 Dense한 실수 피쳐 벡터가 생성되면 CTR prediction을 위해 최종적으로 sigmoid 함수에 투입되게 된다.

[y_{DNN} = \sigma{(W^{

+1} a^{

} + b^{

+ 1}})]

$ㅣHㅣ$: hidden layer의 수
$ \vert H \vert $: hidden layer의 수

(중략)

1.5. Conclusions

DeepFM은 FM Component와 Deep Component를 함께 학습시킨다. 이러한 방식은 다음과 같은 장점을 지닌다.
1) pre-training이 필요 없다.
2) 저/고차원 feature를 모두 잘 학습한다.
3) feature embedding을 통해 피쳐 엔지니어링이 불필요하다.

실험 결과를 확인하면, DeepFM이 최신 모델들을 압도하고 상당한 효율성을 지닌 것을 알 수 있다.

2. Tensorflow 구현

2.1. 데이터 설명 및 데이터 변환

구현의 핵심은 Parameter인 $w$와 $V$의 shape과 활용 방법에 대해 이해하는 것이다. 사실 구현하는 사람의 입장에서는 논문이 썩 친절하다고 느끼지는 못할 것이다. 다소 애매모호한 표현으로 읽는 사람으로 하여금 혼란을 일으키게 하는 문구나 그림 등도 존재한다. 그럼에도 침착하게 잘 생각해보면, 모델을 구축할 수 있을 것이다.

학습 데이터로는 연봉이 5만 달러를 상회하는지의 여부를 예측하는 데이터를 사용하였고, 여기에서 다운로드 받을 수 있다.

데이터는 48,842개의 Instance로 구성되어 있고, 14개의 Feature를 갖고 있으며, 이 중 6개의 변수가 연속형 변수이다. 당연히 예측 과제는 Binary Classification이다. 0은 연봉 5만 달러 이하를 의미하며, 전체 데이터의 25% 정도를 차지한다. 1은 연봉 5만 달러 초과를 의미한다.

앞에서 설명한 데이터를 예로 들어 설명하도록 하겠다. 이 데이터에는 총 14개의 변수가 있다. 이 14개는 곧, field의 개수가 된다. 이 중 범주형 변수를 One-Hot 인코딩을 통해 변환시키면(물론 연속형 변수도 필요에 따라 구간화하여 범주형 변수화해도 된다.) 본 데이터는 총 108개의 칼럼을 갖게 된다. 이 108개는 곧, feature의 개수가 된다. 즉, One-Hot 인코딩을 통해 변환시킨 칼럼의 개수를 feature의 개수로, 인코딩 이전의 데이터의 칼럼의 개수를 field의 개수로 이해하면 쉽다. 논문에서는 임베딩 스킬을 이용하고 있는데, 여기서 Embedding Matrix인 $V$의 칼럼의 개수는 Hyperparameter이다.

본 프로젝트 파일은 다음과 같이 5개의 py파일로 구성되어 있다.

먼저 config파일을 보자. 이 파일에는 칼럼의 목록을 연속형/범주형을 구분하여 저장한 리스트와 Hyperparameter들이 저장되어 있다.

# config.py
ALL_FIELDS = ['age', 'workclass', 'fnlwgt', 'education', 'education-num',
             'marital-status', 'occupation', 'relationship', 'race',
             'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'country']
CONT_FIELDS = ['age', 'fnlwgt', 'education-num',
               'capital-gain', 'capital-loss', 'hours-per-week']
CAT_FIELDS = list(set(ALL_FIELDS).difference(CONT_FIELDS))

# Hyper-parameters for Experiment
NUM_BIN = 10
BATCH_SIZE = 256
EMBEDDING_SIZE = 5

이제 데이터를 가공할 시간이다. (데이터가 매우 커서 서버에서 데이터를 받아오는 상황이라면, 아래 코드를 pyspark로 짜면 좋을 것이다.) 지금부터 할 작업은 field_index와 field_dict를 만드는 것인데, 쉽게 말해서 아래와 같은 작업을 진행하는 것이다.

인코딩 이후의 데이터에 대해 각 칼럼이 본래 인코딩 이전에 몇 번째 field에 속했었는지에 대한 정보를 저장한 것이 field_index와 field_dict이다.

# Preprocess
import config
from itertools import repeat
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

def get_modified_data(X, all_fields, continuous_fields, categorical_fields, is_bin=False):
    field_dict = dict()
    field_index = []
    X_modified = pd.DataFrame()

    for index, col in enumerate(X.columns):
        if col not in all_fields:
            print("{} not included: Check your column list".format(col))
            raise ValueError

        if col in continuous_fields:
            scaler = MinMaxScaler()

            # 연속형 변수도 구간화 할 것인가?
            if is_bin:
                X_bin = pd.cut(scaler.fit_transform(X[[col]]).reshape(-1, ), config.NUM_BIN, labels=False)
                X_bin = pd.Series(X_bin).astype('str')

                X_bin_col = pd.get_dummies(X_bin, prefix=col, prefix_sep='-')
                field_dict[index] = list(X_bin_col.columns)
                field_index.extend(repeat(index, X_bin_col.shape[1]))
                X_modified = pd.concat([X_modified, X_bin_col], axis=1)

            else:
                X_cont_col = pd.DataFrame(scaler.fit_transform(X[[col]]), columns=[col])
                field_dict[index] = col
                field_index.append(index)
                X_modified = pd.concat([X_modified, X_cont_col], axis=1)

        if col in categorical_fields:
            X_cat_col = pd.get_dummies(X[col], prefix=col, prefix_sep='-')
            field_dict[index] = list(X_cat_col.columns)
            field_index.extend(repeat(index, X_cat_col.shape[1]))
            X_modified = pd.concat([X_modified, X_cat_col], axis=1)

    print('Data Prepared...')
    print('X shape: {}'.format(X_modified.shape))
    print('# of Feature: {}'.format(len(field_index)))
    print('# of Field: {}'.format(len(field_dict)))

    return field_dict, field_index, X_modified

2.2. 모델 빌드

먼저 FM Component에 대해 살펴보자. call 함수에서 y_fm을 어떤 shape으로 반환할 지는 그 task에 맞게 변환하면 된다. 아래 코드에서는 (None, 2)의 형태로 반환되어 최종적으로 Deep Component의 (None, 2)와 합쳐져 (None, 4)의 최종 Output을 반환하게 되는데, 이 수치는 성능 향상을 위해 변경이 가능하다.

Parameter $w$의 길이는 num_feature(108)이며, Parameter $V$의 shape은 num_field(14), embedding_size(5)이다. 그런데 아래 call 함수에서 보면 알 수 있듯이, 이 $V$행렬은 One-Hot 인코딩된 데이터에 곱해지는 구조이기 때문에 tf.nn.embedding_lookup이라는 함수를 통해 행이 복제된다. 즉, 앞서 생성한 field_index의 정보를 참조하여, 같은 field에서 나온 feature일 경우, 같은 Embedding Row($V$의 Row)를 공유하는 것이다.

new_inputs는 Deep Component의 Input으로 쓰일 개체이다. 코드를 살펴보면, $V$라는 행렬이 FM Component에도 쓰이지만, new_inputs를 만들어내면서 Deep Component에도 영향을 미치는 것을 알 수 있다.

class FM_layer(tf.keras.layers.Layer):
    def __init__(self, num_feature, num_field, embedding_size, field_index):
        super(FM_layer, self).__init__()
        self.embedding_size = embedding_size    # k: 임베딩 벡터의 차원(크기)
        self.num_feature = num_feature          # f: 원래 feature 개수
        self.num_field = num_field              # m: grouped field 개수
        self.field_index = field_index          # 인코딩된 X의 칼럼들이 본래 어디 소속이었는지

        # Parameters of FM Layer
        # w: capture 1st order interactions
        # V: capture 2nd order interactions
        self.w = tf.Variable(tf.random.normal(shape=[num_feature],
                                              mean=0.0, stddev=1.0), name='w')
        self.V = tf.Variable(tf.random.normal(shape=(num_field, embedding_size),
                                              mean=0.0, stddev=0.01), name='V')

    def call(self, inputs):
        x_batch = tf.reshape(inputs, [-1, self.num_feature, 1])
        # Parameter V를 field_index에 맞게 복사하여 num_feature에 맞게 늘림
        embeds = tf.nn.embedding_lookup(params=self.V, ids=self.field_index)

        # Deep Component에서 쓸 Input
        # (batch_size, num_feature, embedding_size)
        new_inputs = tf.math.multiply(x_batch, embeds)

        # (batch_size, )
        linear_terms = tf.reduce_sum(
            tf.math.multiply(self.w, inputs), axis=1, keepdims=False)

        # (batch_size, )
        interactions = 0.5 * tf.subtract(
            tf.square(tf.reduce_sum(new_inputs, [1, 2])),
            tf.reduce_sum(tf.square(new_inputs), [1, 2])
        )

        linear_terms = tf.reshape(linear_terms, [-1, 1])
        interactions = tf.reshape(interactions, [-1, 1])

        y_fm = tf.concat([linear_terms, interactions], 1)

        return y_fm, new_inputs

아래는 메인 모델에 대한 코드이다. 성능 향상을 위해 Deep Component를 수정하는 것은 연구자의 자유이다. Task에 따라 가볍게 설계할 수도, 복잡하게 설계할 수도 있을 것이다. 본 코드에서는 Dropout만을 추가하여 다소 가볍게 설계하였다.

import tensorflow as tf
from layers import FM_layer

tf.keras.backend.set_floatx('float32')

class DeepFM(tf.keras.Model):

    def __init__(self, num_feature, num_field, embedding_size, field_index):
        super(DeepFM, self).__init__()
        self.embedding_size = embedding_size    # k: 임베딩 벡터의 차원(크기)
        self.num_feature = num_feature          # f: 원래 feature 개수
        self.num_field = num_field              # m: grouped field 개수
        self.field_index = field_index          # 인코딩된 X의 칼럼들이 본래 어디 소속이었는지

        self.fm_layer = FM_layer(num_feature, num_field, embedding_size, field_index)

        self.layers1 = tf.keras.layers.Dense(units=64, activation='relu')
        self.dropout1 = tf.keras.layers.Dropout(rate=0.2)
        self.layers2 = tf.keras.layers.Dense(units=16, activation='relu')
        self.dropout2 = tf.keras.layers.Dropout(rate=0.2)
        self.layers3 = tf.keras.layers.Dense(units=2, activation='relu')

        self.final = tf.keras.layers.Dense(units=1, activation='sigmoid')

    def __repr__(self):
        return "DeepFM Model: #Field: {}, #Feature: {}, ES: {}".format(
            self.num_field, self.num_feature, self.embedding_size)

    def call(self, inputs):
        # 1) FM Component: (num_batch, 2)
        y_fm, new_inputs = self.fm_layer(inputs)

        # retrieve Dense Vectors: (num_batch, num_feature*embedding_size)
        new_inputs = tf.reshape(new_inputs, [-1, self.num_feature*self.embedding_size])

        # 2) Deep Component
        y_deep = self.layers1(new_inputs)
        y_deep = self.dropout1(y_deep)
        y_deep = self.layers2(y_deep)
        y_deep = self.dropout2(y_deep)
        y_deep = self.layers3(y_deep)

        # Concatenation
        y_pred = tf.concat([y_fm, y_deep], 1)
        y_pred = self.final(y_pred)
        y_pred = tf.reshape(y_pred, [-1, ])

        return y_pred

2.3. 학습

학습 코드는 아래와 같다. 그리 무거운 모델은 아니므로 Autograph는 사용하지 않았다.

import config
from preprocess import get_modified_data
from DeepFM import DeepFM

import numpy as np
import pandas as pd
from time import perf_counter
import tensorflow as tf
from sklearn.model_selection import train_test_split
from tensorflow.keras.metrics import BinaryAccuracy, AUC


def get_data():
    file = pd.read_csv('data/adult.data', header=None)
    X = file.loc[:, 0:13]
    Y = file.loc[:, 14].map({' <=50K': 0, ' >50K': 1})

    X.columns = config.ALL_FIELDS
    field_dict, field_index, X_modified = \
        get_modified_data(X, config.ALL_FIELDS, config.CONT_FIELDS, config.CAT_FIELDS, False)

    X_train, X_test, Y_train, Y_test = train_test_split(X_modified, Y, test_size=0.2, stratify=Y)

    train_ds = tf.data.Dataset.from_tensor_slices(
        (tf.cast(X_train.values, tf.float32), tf.cast(Y_train, tf.float32))) \
        .shuffle(30000).batch(config.BATCH_SIZE)

    test_ds = tf.data.Dataset.from_tensor_slices(
        (tf.cast(X_test.values, tf.float32), tf.cast(Y_test, tf.float32))) \
        .shuffle(10000).batch(config.BATCH_SIZE)

    return train_ds, test_ds, field_dict, field_index


# Batch 단위 학습
def train_on_batch(model, optimizer, acc, auc, inputs, targets):
    with tf.GradientTape() as tape:
        y_pred = model(inputs)
        loss = tf.keras.losses.binary_crossentropy(from_logits=False, y_true=targets, y_pred=y_pred)

    grads = tape.gradient(target=loss, sources=model.trainable_variables)

    # apply_gradients()를 통해 processed gradients를 적용함
    optimizer.apply_gradients(zip(grads, model.trainable_variables))

    # accuracy & auc
    acc.update_state(targets, y_pred)
    auc.update_state(targets, y_pred)

    return loss


# 반복 학습 함수
def train(epochs):
    train_ds, test_ds, field_dict, field_index = get_data()

    model = DeepFM(embedding_size=config.EMBEDDING_SIZE, num_feature=len(field_index),
                   num_field=len(field_dict), field_index=field_index)

    optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)

    print("Start Training: Batch Size: {}, Embedding Size: {}".format(config.BATCH_SIZE, config.EMBEDDING_SIZE))
    start = perf_counter()
    for i in range(epochs):
        acc = BinaryAccuracy(threshold=0.5)
        auc = AUC()
        loss_history = []

        for x, y in train_ds:
            loss = train_on_batch(model, optimizer, acc, auc, x, y)
            loss_history.append(loss)

        print("Epoch {:03d}: 누적 Loss: {:.4f}, Acc: {:.4f}, AUC: {:.4f}".format(
            i, np.mean(loss_history), acc.result().numpy(), auc.result().numpy()))

    test_acc = BinaryAccuracy(threshold=0.5)
    test_auc = AUC()
    for x, y in test_ds:
        y_pred = model(x)
        test_acc.update_state(y, y_pred)
        test_auc.update_state(y, y_pred)

    print("테스트 ACC: {:.4f}, AUC: {:.4f}".format(test_acc.result().numpy(), test_auc.result().numpy()))
    print("Batch Size: {}, Embedding Size: {}".format(config.BATCH_SIZE, config.EMBEDDING_SIZE))
    print("걸린 시간: {:.3f}".format(perf_counter() - start))
    model.save_weights('weights/weights-epoch({})-batch({})-embedding({}).h5'.format(
        epochs, config.BATCH_SIZE, config.EMBEDDING_SIZE))


if __name__ == '__main__':
    train(epochs=100)

Embedding Size를 변환하면서 진행한 테스트 결과는 아래와 같다. (Epoch: 100)

Embedding Size	누적 Loss	Train ACC	Train AUC	Test ACC	Test AUC	시간
10	0.3243	0.8485	0.9038	0.8464	0.8991	4분 0.78초
9	0.3386	0.8382	0.8954	0.8402	0.8975	4분 3.64초
8	0.3704	0.8240	0.8729	0.8260	0.8745	4분 2.79초
7	0.3248	0.8471	0.9033	0.8424	0.9013	4분 0.84초
6	0.3305	0.8433	0.9001	0.8416	0.9041	4분 1.28초
5	0.3945	0.8169	0.8512	0.8190	0.8576	4분 8.10초

Reference

https://github.com/ChenglongChen/tensorflow-DeepFM

Comment Read more

Field-aware Factorization Machines (FFM) 설명 및 xlearn 실습

05 Apr 2020 | Machine_Learning Recommendation System Field-aware Factorization Machines

본 글의 전반부에서는 먼저 Field-aware Factorization Machines for CTR prediction 논문을 리뷰하면서 본 모델에 대해 설명할 것이다. 후반부에서는 간단한 xlearn코드 역시 소개할 예정이다. 논문의 전문은 이곳에서 확인할 수 있다.

1. Field-aware Factorization Machines for CTR prediction 논문 리뷰

1.0.Abstract

CTR 예측과 같은 크고 희소한 데이터셋에 대해 FFM은 효과적인 방법이다. 본 논문에서는 우리는 FFM을 학습시키는 효과적인 구현 방법을 제시할 것이다. 그리고 우리는 이 모델을 전체적으로 분석한 뒤 다른 경쟁 모델과 비교를 진행할 것이다. 실험에 따르면 FFM이 특정 분류 모델에 있어서 굉장히 뛰어난 접근 방법이라는 것을 알려준다. 마지막으로, 우리는 FFM 패키지를 공개한다.

1.1. Introduction

CTR 예측에 있어서 굉장히 중요한 것은, feature 간의 conjunction(결합, 연결)을 이해하는 것이다. Simple Logistic Regression과 같은 간단한 모델은 이러한 결합을 잘 이해하지 못한다. FM 모델은 2개의 Latent Vector의 곱으로 factorize하여 feature conjunction을 이해하게 된다.

개인화된 태그 추천을 위해 pairwise interaction tensor factorization (PITF)라는 FM의 변형 모델이 제안되었다. 이후 KDD Cup 2020에서, Team Opera Solutions라는 팀이 이 모델의 일반화된 버전을 제안하였다. 그러나 이 용어는 다소 일반적이고 혼동을 줄 수 있는 이름이므로, 본 논문에서는 이를 FFM이라고 부르도록 하겠다.

FFM의 중요 특징은 아래와 같다.

최적화 문제를 해결하기 위해 Stochastic Gradient를 사용한다. 과적합을 막기 위해 오직 1 epoch만 학습한다.
FFM은 위 팀에서 비교한 모델 6개 중 가장 뛰어난 성적을 보여주었다.

1.2. POLY2 and FM

(중략)

1.3. FFM

FFM의 중요한 아이디어는 PITF로 부터 파생되었는데, 이는 바로 개인화된 태그에 관한 것이다. PIFT에서 그들은 User, Item, Tag를 포함한 3개의 가용 필드를 가정했고, 이를 분리된 latent space에서 (User, Item), (User, Tag), (Item,Tag)로 factorize하였다. 이러한 정의는 추천 시스템에 적합한 정의이고 CTR 예측에 있어서는 자세한 설명이 부족한 편이므로, 좀 더 포괄적인 논의를 진행해보도록 하겠다.

아래와 같은 데이터 테이블이 있을 때, features는 fields로 그룹화할 수 있다.

예를 들어, Espn, Vogue, NBC는 Publisher라는 field에 속할 수 있겠다. FFM은 이러한 정보를 활용하는 FM의 변형된 버전이다. FFM의 원리를 설명하기 위해, 다음 새로운 예시에 대해 생각해보자.

FM의 상호작용 항인 $\phi_{FM}(w, x)$는 아래와 같이 표현될 수 있다.

FM에서는 다른 feature들과의 latent effect를 학습하기 위해 모든 feature는 오직 하나의 latent vector를 가진다. Espn을 예로 들어보면, $w_{Espn}$은 Nike와 Male과의 latent effect를 학습하기 위해 이용되었다. 그러나 Nike와 Male은 다른 Field에 속하기 때문에 사실 (Espn, Nike)의 관계와 (Espn, Male)의 관계에서 사용되었던 $w_{Espn}$의 값은 다를 가능성이 높다. 즉, 하나의 벡터로 2개의 관계를 모두 표현하기에는 무리가 있다는 점이다.

FFM에서는 각각의 feature는 여러 latent vector를 갖게 된다. FFM의 상호작용 항인 $\phi_{FFM}(w, x)$은 아래와 같이 표현된다.

수학적으로 재표현하면 아래와 같이 표현할 수 있겠다.

여기서 $f_1$과 $f_2$는 $j_1$과 $j_2$의 field를 의미한다. $j$들은 Espn, Nike 등을 의미한다. $f$를 field의 개수라고 할 때, FFM의 변수의 개수는 $nfk$이며, FFM의 계산 복잡성은 $O(\overline{n}^2 k)$이다.

여기서 n, f, k는 각각 feature의 개수(often called p), field의 개수, latent 변수의 개수를 의미한다.

FFM의 경우 각각의 latent vector아 오직 특정 field와 관련한 효과에 대해서는 학습을 진행하기 때문에 잠재 변수의 수은 $k$는 FM의 경우보다 작은 경우가 많다.

[k_{FFM} < k_{FM}]

1.3.1. Solving the Optimization Problem

사실 FFM의 최적화 문제를 푸는 것은 Simple Logistic Regression의 최적화 문제를 푸는 식에서 $\phi_{LM}(w, x)$를 $\phi_{FFM}(w, x)$로 바꾸는 것을 제외하면 동일하다.

실험 결과에 그 이유가 나오지만, Stochastic Gradient 알고리즘으로 행렬 분해에 있어 효과적인 AdaGrad를 적용하였다. 각 SG 스텝마다 data point $(y, x)$는 $\phi_{FFM}(w, x)$ 식에서 $w_{j1, f2}, w_{j2f1}$를 업데이트하기 위해 추출된다. CTR prediction과 같은 문제를 푸는 데에 있어 $x$는 굉장히 희소한 벡터임을 기억하자. 따라서 실제로는 0이 아닌 값들에 대해서만 업데이트가 진행될 것이다.

sub-gradient는 아래와 같다.

d=1…k에 대해 gradient의 제곱합은 아래와 같이 합산된다.

최종적으로 $(w_{j1, f2})d$과 $(w{j2, f1})_d$ 는 아래와 같이 업데이트 된다.

여기서 $\eta$는 직접 정한 learning rate를 의미한다. $w$의 초깃값은 $[0, 1/\sqrt{k}]$ 사이의 Uniform Distribution 에서의 랜덤한 값으로 초기화된다. $G$는 $(G_{j1, f2})_d^{-\frac{1}{2}}$의 값이 매우 커지는 것을 막기 위해 모두 1로 세팅된다. 전체적인 과정은 아래와 같으며, 각 instance를 normalize해주는 것이 성능 향상에 도움이 되었다는 말을 남긴다.

1.3.2. Parallelization on Shared-memory Systems

본 논문에서는 Hog-WILD!라는 병렬처리 기법을 사용하였다.

1.3.3. Adding Field Information

널리 사용되는 LIBSVM의 데이터 포맷은 다음과 같다.

label feat1:val1 feat2:val2 …

여기서 각 (feat, val) 쌍은 feature index와 value를 의미한다. FFM을 위해 우리는 위 포맷을 아래와 같이 확장할 수 있다.

label field1:feat1:val1 field2:feat2:val2 …

이는 적합한 field를 각 feature 마다 지정해주어야 함을 의미한다. 특정 feature에 대해서는 이 지정 작업이 쉽지만, 나머지들에 대해서는 그렇지 않을 수도 있다. 이 부분에 대해서는 feature의 3가지 종류의 관점에서 논의해보도록 하자.

Categorical Features
선형 모델에서 categorical feature는 여러 개의 binary feature로 변환하는 것이 일반적이다. 우리는 다음과 같이 데이터 instance를 변형할 수 있다.

LIBSVM 포맷에서는 0의 값은 저장되지 않기 때문에 이렇게 모든 categorical feature들을 binary feature로 변형할 수 있는 것이다. 이제 위 데이터는 최종적으로 아래와 같은 형상을 갖게 된다.

Numerical Features
conference에서 논문이 통과될지에 대한 데이터가 있다고 하자. 칼럼의 의미는 아래와 같다.

AR: accept rate of the conference
Hidx: h-index of the author
Cite: # citations of the author

각 feature를 dummy field로 취급하여 아래와 같은 데이터 형상을 만들 수도 있지만, 이는 딱히 도움이 되지 않는 방법 같다.

Yes AR:AR:45.73 Hidx:Hidx:2 Cite:Cite:3

또 하나의 방법은, feature는 field에 넣고, 기존의 실수 값을 이산화하여 feature로 만든 후, binary하게 1과 0의 값을 넣어주는 방식이다.

Yes AR:45:1 Hidx:2:1 Cite:3:1

이산화 방법에 대해서는 여러가지 방식이 존재할 수 있다. 어떠한 방법이든 일정 수준의 정보 손실은 감수해야 한다.

Single-field Features
일부 데이터 셋에 대해서 모든 feature가 단일 field에 속하여 각 feature에 대해 field를 지정해주는 것이 무의미한 경우도 있다. 특히 NLP와 같은 분야에서는 이러한 현상이 두드러진다.

위 경우에서 유일한 field는 “sentence”가 될 것이다. 일부 사람들은 numerical features의 경우처럼 dummy field를 만들면 어떨까 하고 의문을 가지지만, 사실 그렇게 되면 n(feature의 수)이 너무 커지기 때문에 굉장히 비효율적이다.

(FFM의 모델 크기가 $O(nfk)$임을 기억해보자. 이 경우에는 $f=n$이 될 것이다. (field의 수 = feature의 수))

1.4. Experiments

(후략)

2. xlearn

2.1. 설치

여러 가지 방법으로 설치를 진행할 수 있지만, 여기에서 whl파일을 통해 설치하는 것이 가장 간단하다.

2.2. 코드

def _convert_to_ffm(path, df, type, target, numerics, categories, features, encoder):
    # Flagging categorical and numerical fields
    print('convert_to_ffm - START')
    for x in numerics:
        if(x not in encoder['catdict']):
            print(f'UPDATING CATDICT: numeric field - {x}')
            encoder['catdict'][x] = 0
    for x in categories:
        if(x not in encoder['catdict']):
            print(f'UPDATING CATDICT: categorical field - {x}')
            encoder['catdict'][x] = 1

    nrows = df.shape[0]
    with open(path + str(type) + "_ffm.txt", "w") as text_file:

        # Looping over rows to convert each row to libffm format
        for n, r in enumerate(range(nrows)):
            datastring = ""
            datarow = df.iloc[r].to_dict()
            datastring += str(int(datarow[target]))  # Set Target Variable here

            # For numerical fields, we are creating a dummy field here
            for i, x in enumerate(encoder['catdict'].keys()):
                if(encoder['catdict'][x] == 0):
                    # Not adding numerical values that are nan
                    if math.isnan(datarow[x]) is not True:
                        datastring = datastring + " "+str(i)+":" + str(i)+":" + str(datarow[x])
                else:

                    # For a new field appearing in a training example
                    if(x not in encoder['catcodes']):
                        print(f'UPDATING CATCODES: categorical field - {x}')
                        encoder['catcodes'][x] = {}
                        encoder['currentcode'] += 1
                        print(f'UPDATING CATCODES: categorical value for field {x} - {datarow[x]}')
                        encoder['catcodes'][x][datarow[x]] = encoder['currentcode']  # encoding the feature

                    # For already encoded fields
                    elif(datarow[x] not in encoder['catcodes'][x]):
                        encoder['currentcode'] += 1
                        print(f'UPDATING CATCODES: categorical value for field {x} - {datarow[x]}')
                        encoder['catcodes'][x][datarow[x]] = encoder['currentcode']  # encoding the feature

                    code = encoder['catcodes'][x][datarow[x]]
                    datastring = datastring + " "+str(i)+":" + str(int(code))+":1"

            datastring += '\n'
            text_file.write(datastring)

    # print('Encoder Summary:')
    # print(json.dumps(encoder, indent=4))
    return encoder

위와 같이 LIBSVM 데이터 포맷으로 데이터를 변경한 후에,

import xlearn as xl

model = xl.create_ffm()

# 학습/테스트 데이터 path 연결
model.setTrain("data/train_ffm.txt")
model.setValidate("data/test_ffm.txt")

# Early Stopping 불가
model.disableEarlyStop()

# param 선언
param = {'task': 'binary', 'lr': 0.2, 'lambda': 0.00002,
         'k': 3, 'epoch': 100, 'metric': 'auc', 'opt': 'adagrad',
         'num_threads': 4}

# 학습
# model.fit(param=param, model_path="model/model.out")

# Cross-Validation 학습
model.cv(param)

# Predict
model.setTest("data/test_ffm.txt")
model.setSigmoid()
model.predict("model/model.out", "output/predictions.txt")

위와 같이 학습을 진행하면 된다. 간단하다.

Reference

https://wngaw.github.io/field-aware-factorization-machines-with-xlearn/

Comment Read more

Older Newer

Gorio Tech Blog

LightFM 설명

1. Metadata Embeddings for User and Item Cold-start Recommendations 논문 리뷰

1.1. Introduction

1.2. LightFM

2. LightFM 학습 및 HyperOpt를 활용한 Bayesian Optimization

2.1. Data Preparation

2.2. Hyper Parameter Optimization with HyperOpt

2.3. 결과 확인

Reference

GitHub 사용법 - 09. Overall(Git 명령어 정리, Git 사용법)

Working tree(작업트리) 생성

git init

git clone

Git Repository 연결

연결된 원격 저장소 확인

원격 저장소 이름 변경

원격 연결 삭제

Git 설정하기

git 기본 에디터 변경

인증 정보 저장: Credential

Git 준비 영역(index)에 파일 추가

한 파일 내 수정사항의 일부만 준비 영역에 추가

Commit하기

git commit [-m “message”] [–amend]

수정사항을 원격저장소에 반영하기: git push

upstream 연결

upstream 삭제

수정사항 반영하기

모든 branch의 수정사항 반영하기

원격 저장소의 수정사항을 로컬로 가져오기: git pull

Git Directory 상태 확인

git status

특정 파일/디렉토리 무시하기: .gitignore

전체 프로젝트에 .gitignore 적용하기

History 검토

현재 존재하는 commit 검토: git log

git log 옵션: –patch(-p), –max-count(-<number>), –oneline(–pretty=oneline), –graph

commit 검색하기

일부 commit만 확인하기

commit과 commit의 변화 과정 전체를 검토: git reflog

특정 파일의 수정사항 history 보기: git blame

다른 commit / branch와의 자세한 차이 확인: git diff

difftool

HEAD: branch의 tip

Tag 붙이기

git tag

특정 commit 보기

git show

git show <tag-name>

Git Branch

branch 목록 업데이트하기

branch 목록 보기

branch 이름 변경

branch 이름 변경 시 로컬 저장소의 branch 이름도 변경

원격 branch 목록 업데이트

branch 전환

새 branch 생성

branch 생성과 같이 checkout하기

원격 저장소의 branch를 로컬 저장소에 복사하며 checkout하기

branch 병합: git merge

branch 삭제

작업 취소하기

특정 파일의 수정사항 되돌리기: checkout, reset

branch 병합 취소하기

커밋 합치기: git reset <commit>

git rebase

main branch의 commit을 topic branch로 가져오기

branch의 parent 바꾸기: –onto

특정 범위의 commit들 제거하기

충돌 시 해결법

rebase로 commit 합치거나 수정하기

하나의 commit을 2개로 분리하기

commit을 되돌리는 commit: git revert

공유된 branch 병합 취소하기

history 완전 삭제하기: 완전범죄?

수정사항 임시 저장하기: git stash

작업트리 청소하기: git clean

최초의 오류 commit 찾기: git bisect

branch에서 특정 commit만 다른 branch로 적용하기: git cherry-pick