딥러닝/자연어 처리

[LM metric] BLEU(Bilingual Evaluation Understudy)

Generated Sentence를 평가하는 방식은 크게 BLEU와 ROUGE가 존재한다.

Reference Setence의 단어가 Generated Sentence에 포함되는 정도 → ROUGE
Generated Sentence의 단어가 Reference Sentence에 포함되는 정도 → BLEU

(모델로부터 생성되는 문장: Generated Sentence, 정답 문장: Reference Sentence)

ROUGE Score는 주로 Text Summarization에서 사용
n-gram Recall에 기반
BLEU Score는 일반적으로 Machine Translation에서 사용
n-gram Precision에 기반

BLEU(Bilingual Evaluation Understudy)

BLEU는 문장의 길이와 단어의 중복을 고려하여 정답문장과 예측문장 사이의 겹치는 정도를 계산하는 지표

n-gram을 통한 순서쌍들이 얼마나 겹치는지 측정(precision)
문장길이에 대한 과적합 보정 (Brevity Penalty)
같은 단어가 연속적으로 나올때 과적합 되는 것을 보정(Clipping)

1. Precision: n-gram(1~4)을 통한 순서쌍들이 얼마나 겹치는지 측정

예측된 sentence: 빛이 쐬는 노인은 완벽한 어두운곳에서 잠든 사람과 비교할 때 강박증이 심해질 기회가 훨씬 높았다
true sentence: 빛이 쐬는 사람은 완벽한 어둠에서 잠든 사람과 비교할 때 우울증이 심해질 가능성이 훨씬 높았다

2. Clipping: 같은 단어가 연속적으로 나올때 과적합 되는 것을 보정

아래 예측된 문장에 중복된 단어들(the:3, more:2)이 있다.

이를 보정하기 위해 true sentence에 있는 중복되는 단어의 max count(the:2, more:1)를 고려하게 된다(Clipping).

다른 n-gram도 같은 방식으로 처리하면 된다.

예측된 sentence: The more decomposition the more flavor the food has
true sentence: The more the merrier I always say

3. Brevity Penalty: 문장길이에 대한 과적합 보정

문장길이에 대한 보정계수를 구하면 다음과 같다.

예측된 sentence: 빛이 쐬는 노인은 완벽한 어두운곳에서 잠듬
true sentence: 빛이 쐬는 사람은 완벽한 어둠에서 잠든 사람과 비교할 때 우울증이 심해질 가능성이 훨씬 높았다

BLEU score

예측된 sentence: 빛이 쐬는 노인은 완벽한 어두운곳에서 잠든 사람과 비교할 때 강박증이 심해질 기회가 훨씬 높았다
true sentence: 빛이 쐬는 사람은 완벽한 어둠에서 잠든 사람과 비교할 때 우울증이 심해질 가능성이 훨씬 높았다

TORCHTEXT.DATA.METRICS

from torchtext.data.metrics import bleu_score

torchtext.data.metrics.bleu_score(candidate_corpus, 
				  references_corpus, 
                                  max_n=4, 
                                  weights=[0.25, 0.25, 0.25, 0.25])

candidate_corpus – 생성된 문장. iterable한 token으로 구성되어야 함
references_corpus – 레퍼런스 문장. iterable한 token으로 구성되어야 함
max_n – 사용하려는 n-gram의 최대 n값 (만약 max_n=3이면 unigrams, bigrams, trigrams을 사용하겠다는 것)
weights – 각 n-gram마다의 weights를 담고있는 배열 (default값은 [0.25, 0.25, 0.25, 0.25])

reference )

https://jrc-park.tistory.com/273

https://supkoon.tistory.com/18

https://donghwa-kim.github.io/BLEU.html

https://pytorch.org/text/stable/data_metrics.html

728x90

'딥러닝 > 자연어 처리' 카테고리의 다른 글

RNN & LSTM (0)	2022.06.25
How to generate text: decoding methods (0)	2022.06.15
[LM metric] Perplexity (0)	2022.06.07
GPT & GPT2 (0)	2022.06.04
[CS244n] Transformers & Pretraining (0)	2022.05.31

Contents

새소식

[LM metric] BLEU(Bilingual Evaluation Understudy)

BLEU(Bilingual Evaluation Understudy)

TORCHTEXT.DATA.METRICS

'딥러닝 > 자연어 처리' 카테고리의 다른 글

당신이 좋아할만한 콘텐츠

티스토리툴바