딥러닝/자연어 처리

NLP_Preprocessing : +)Detokenization

tokenized된 문장을 복원(detokenization)하여 사람이 읽기 좋은 형태로 만들어 주어야 할 필요가 있다.

따라서 Detokenization은 자연어 생성 task에 필요한 작업이다.

Tokenization

1. 영어 원문

There's currently over a thousand TED Talks on the TED website.

2. tokenization 수행한 뒤, 기존의 띄어쓰기와 tokenization에 의해 수행된 공백과의 구분을 위해 ▁(U+2581) 을 원래의 공백 위치에 삽입한다.

▁There 's ▁currently ▁over ▁a ▁thousand ▁TED ▁Talks ▁on ▁the ▁TED ▁website .

3. subword segmentation 수행한 뒤, 이전 step까지의 공백과 subword segmentation으로 인한 공백을 구분하기 위한 ▁를 삽입한다.

▁▁There ▁'s ▁▁currently ▁▁over ▁▁a ▁▁thous and ▁▁TED ▁▁T al ks ▁▁on ▁▁the ▁▁TED ▁▁we b site ▁.

Detokenization

1. whitespace를 제거

▁▁There▁'s▁▁currently▁▁over▁▁a▁▁thousand▁▁TED▁▁Talks▁▁on▁▁the▁▁TED▁▁website▁.

2. _ _을 white space로 치환

There▁'s currently over a thousand TED Talks on the TED website▁.

3. 마지막 남은 ▁를 제거

There's currently over a thousand TED Talks on the TED website.

import sys

STR = '▁'
TWO_STR = '▁▁'


def detokenization(line):
    if TWO_STR in line:
        line = line.strip().replace(' ', '').replace(TWO_STR, ' ').replace(STR, '').strip()
    else:
        line = line.strip().replace(' ', '').replace(STR, ' ').strip()

    return line


if __name__ == "__main__":
    for line in sys.stdin:
        if line.strip() != "":
            buf = []
            for token in line.strip().split('\t'):
                buf += [detokenization(token)]

            sys.stdout.write('\t'.join(buf) + '\n')
        else:
            sys.stdout.write('\n')

python ./detokenizer.py < ./review.sorted.uniq.refined.tok.bpe.tsv > review.sorted.uniq.tok.bpe.detok.tsv

728x90

'딥러닝 > 자연어 처리' 카테고리의 다른 글

[WordEmbedding] 딥러닝 이전의 단어 임베딩 (0)	2022.06.28
NLP_Preprocessing : 6)Batchify with torchtext (1)	2022.06.28
NLP_Preprocessing : 5)Subword Segmentation (0)	2022.06.27
NLP_Preprocessing : 4)Tokenization (0)	2022.06.27
NLP_Preprocessing : 3)Labeling (0)	2022.06.26

Contents

새소식

NLP_Preprocessing : +)Detokenization

Tokenization

Detokenization

'딥러닝 > 자연어 처리' 카테고리의 다른 글

당신이 좋아할만한 콘텐츠

티스토리툴바