[pytorch] 정수 인코딩(Integer Encoding) | Counter 와 FreqDist를 활용한 정수 인코딩 코드

pytorch

[pytorch] 정수 인코딩(Integer Encoding) | Counter 와 FreqDist를 활용한 정수 인코딩 코드

독립성이 강한 ISFP 2024. 11. 14. 21:00

728x90

정수 인코딩(Integer Encoding)은 자연어 처리에서 단어에 정수 인덱스를 할당하는 중요한 과정입니다.

이 과정은 텍스트 데이터를 컴퓨터가 이해할 수 있도록 수치로 변환하며, 이후 원-핫 인코딩이나 워드 임베딩과 같은 심화된 작업의 기초가 됩니다.

정수 인코딩이란?

정수 인코딩은 텍스트의 각 단어에 고유한 숫자를 할당하는 방식입니다.

가장 일반적인 방법은 텍스트에서 자주 등장하는 단어일수록 낮은 숫자를 부여하는 방식으로, 먼저 텍스트 데이터를 분석하여 등장 빈도가 높은 순서대로 단어 집합(vocabulary)을 생성합니다.

이렇게 만들어진 단어 집합에서 가장 많이 사용되는 단어에는 낮은 숫자가 할당되고, 빈도가 낮은 단어일수록 높은 숫자가 할당됩니다.

예를 들어, 특정 텍스트 데이터에서 단어의 빈도를 계산한 후, 빈도 순서대로 정수를 부여하면 각 단어가 정수로 매핑되어 효율적으로 관리할 수 있습니다. 정수 인코딩을 통해 텍스트를 수치화하면, 컴퓨터가 이를 효과적으로 다룰 수 있어 자연어 처리의 다양한 단계에서 활용됩니다.

딕셔너리를 활용한 정수 인코딩

우선 문장 단위로 텍스트를 나눈 후, 각 문장을 단어 단위로 토큰화하고, 불필요한 단어(불용어)와 일정 길이 이하의 단어를 제거해 단어 집합을 구축합니다. 단어 집합에는 단어와 그에 해당하는 빈도수가 기록됩니다. 이후, 빈도가 높은 순서로 정렬하고 빈도수가 높은 단어일수록 낮은 정수를 부여하여 각 단어에 정수를 할당합니다.

1. 예시 텍스트 데이터 정의

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

import nltk
nltk.download('punkt')
nltk.download('stopwords')

# 새로운 예시 텍스트 데이터
raw_text = """
Every morning, I wake up early to watch the sunrise. I enjoy the calm and stillness of the dawn. The air is fresh, and there’s a quiet energy that helps me prepare for the day ahead. I start my day with a warm cup of coffee, savoring the rich aroma as I plan out my tasks. Sometimes, I read a few pages from a book or write in my journal, setting intentions for the day.

After breakfast, I take a walk around the neighborhood, greeting familiar faces and observing the subtle changes in the scenery. The sound of birds chirping and the sight of dew on the grass bring a sense of peace. The walk refreshes my mind and energizes me.

As the day progresses, I focus on my work and handle various tasks that require my attention. There are moments of challenge and focus, balanced by breaks where I step outside, stretch, or enjoy a snack. In the afternoon, I like to check my to-do list, reflecting on what I have accomplished so far and adjusting my goals if necessary.

When evening arrives, I wind down with some relaxing activities. I might cook a simple meal, listen to music, or watch a favorite show. Before going to bed, I take a moment to express gratitude for the day and set positive thoughts for the next. With a calm mind, I prepare for sleep, looking forward to another new day.
"""

2. 위 예시 텍스트 데이터에 문장 단위 토큰화를 적용합니다.

sentences = sent_tokenize(raw_text)
sentences

> ['\nEvery morning, I wake up early to watch the sunrise.',
 'I enjoy the calm and stillness of the dawn.',
 'The air is fresh, and there’s a quiet energy that helps me prepare for the day ahead.',
 'I start my day with a warm cup of coffee, savoring the rich aroma as I plan out my tasks.',
 'Sometimes, I read a few pages from a book or write in my journal, setting intentions for the day.',
 'After breakfast, I take a walk around the neighborhood, greeting familiar faces and observing the subtle changes in the scenery.',
 'The sound of birds chirping and the sight of dew on the grass bring a sense of peace.',
 'The walk refreshes my mind and energizes me.',
 'As the day progresses, I focus on my work and handle various tasks that require my attention.',
 'There are moments of challenge and focus, balanced by breaks where I step outside, stretch, or enjoy a snack.',
 'In the afternoon, I like to check my to-do list, reflecting on what I have accomplished so far and adjusting my goals if necessary.',
 'When evening arrives, I wind down with some relaxing activities.',
 'I might cook a simple meal, listen to music, or watch a favorite show.',
 'Before going to bed, I take a moment to express gratitude for the day and set positive thoughts for the next.',
 'With a calm mind, I prepare for sleep, looking forward to another new day.']

2. 각 문장을 단어 단위로 나누고, 불용어와 의미 없는 단어를 제거하여 단어 집합과 빈도수를 계산합니다.

stop_words = set(stopwords.words('english')) # 불용어 정의
vocab = {}                                   # 각 단어의 빈도를 저장할 딕셔너리
preprocessed_sentences = []                  # 정제된 문장 단위의 단어 리스트

# 문장별로 단어 토큰화, 정제 및 정규화 과정
for sentence in sentences:
    tokenized_sentence = word_tokenize(sentence)  # 단어 단위로 토큰화
    result = []

    for word in tokenized_sentence: # sentences 리스트에는 텍스트가 문장 단위로 저장되어 있음
        word = word.lower()  # 모든 단어를 소문자화
        if word not in stop_words and len(word) > 2:  # 불용어 제거, 단어 길이 제한
            result.append(word)
            if word not in vocab:  # 단어 집합에 단어가 없으면 추가
                vocab[word] = 0
            vocab[word] += 1
    preprocessed_sentences.append(result)  # 정제된 문장 추가

print("Preprocessed Sentences:", preprocessed_sentences)
print("Vocabulary:", vocab)

> Preprocessed Sentences: [['every', 'morning', 'wake', 'early', 'watch', 'sunrise'], ['enjoy', 'calm', 'stillness', 'dawn'], ['air', 'fresh', 'quiet', 'energy', 'helps', 'prepare', 'day', 'ahead'], ['start', 'day', 'warm', 'cup', 'coffee', 'savoring', 'rich', 'aroma', 'plan', 'tasks'], ['sometimes', 'read', 'pages', 'book', 'write', 'journal', 'setting', 'intentions', 'day'], ['breakfast', 'take', 'walk', 'around', 'neighborhood', 'greeting', 'familiar', 'faces', 'observing', 'subtle', 'changes', 'scenery'], ['sound', 'birds', 'chirping', 'sight', 'dew', 'grass', 'bring', 'sense', 'peace'], ['walk', 'refreshes', 'mind', 'energizes'], ['day', 'progresses', 'focus', 'work', 'handle', 'various', 'tasks', 'require', 'attention'], ['moments', 'challenge', 'focus', 'balanced', 'breaks', 'step', 'outside', 'stretch', 'enjoy', 'snack'], ['afternoon', 'like', 'check', 'to-do', 'list', 'reflecting', 'accomplished', 'far', 'adjusting', 'goals', 'necessary'], ['evening', 'arrives', 'wind', 'relaxing', 'activities'], ['might', 'cook', 'simple', 'meal', 'listen', 'music', 'watch', 'favorite', 'show'], ['going', 'bed', 'take', 'moment', 'express', 'gratitude', 'day', 'set', 'positive', 'thoughts', 'next'], ['calm', 'mind', 'prepare', 'sleep', 'looking', 'forward', 'another', 'new', 'day']]
Vocabulary: {'every': 1, 'morning': 1, 'wake': 1, 'early': 1, 'watch': 2, 'sunrise': 1, 'enjoy': 2, 'calm': 2, 'stillness': 1, 'dawn': 1, 'air': 1, 'fresh': 1, 'quiet': 1, 'energy': 1, 'helps': 1, 'prepare': 2, 'day': 6, 'ahead': 1, 'start': 1, 'warm': 1, 'cup': 1, 'coffee': 1, 'savoring': 1, 'rich': 1, 'aroma': 1, 'plan': 1, 'tasks': 2, 'sometimes': 1, 'read': 1, 'pages': 1, 'book': 1, 'write': 1, 'journal': 1, 'setting': 1, 'intentions': 1, 'breakfast': 1, 'take': 2, 'walk': 2, 'around': 1, 'neighborhood': 1, 'greeting': 1, 'familiar': 1, 'faces': 1, 'observing': 1, 'subtle': 1, 'changes': 1, 'scenery': 1, 'sound': 1, 'birds': 1, 'chirping': 1, 'sight': 1, 'dew': 1, 'grass': 1, 'bring': 1, 'sense': 1, 'peace': 1, 'refreshes': 1, 'mind': 2, 'energizes': 1, 'progresses': 1, 'focus': 2, 'work': 1, 'handle': 1, 'various': 1, 'require': 1, 'attention': 1, 'moments': 1, 'challenge': 1, 'balanced': 1, 'breaks': 1, 'step': 1, 'outside': 1, 'stretch': 1, 'snack': 1, 'afternoon': 1, 'like': 1, 'check': 1, 'to-do': 1, 'list': 1, 'reflecting': 1, 'accomplished': 1, 'far': 1, 'adjusting': 1, 'goals': 1, 'necessary': 1, 'evening': 1, 'arrives': 1, 'wind': 1, 'relaxing': 1, 'activities': 1, 'might': 1, 'cook': 1, 'simple': 1, 'meal': 1, 'listen': 1, 'music': 1, 'favorite': 1, 'show': 1, 'going': 1, 'bed': 1, 'moment': 1, 'express': 1, 'gratitude': 1, 'set': 1, 'positive': 1, 'thoughts': 1, 'next': 1, 'sleep': 1, 'looking': 1, 'forward': 1, 'another': 1, 'new': 1}

- word.lower(): 모든 단어를 소문자로 변환합니다.

- if word not in stop_words and len(word) > 2: 불용어 제거 및 단어 길이 제한을 두어, 의미가 없거나 너무 짧은 단어들을 제외합니다. 예를 들어, ‘I’, ‘a’ 등은 의미가 부족하므로 제외됩니다.

- vocab [word]:: vocab 딕셔너리에 단어를 추가하거나, 이미 존재하는 단어라면 해당 단어의 빈도수를 1씩 증가시킵니다.

Preprocessed Sentences

정제된 문장 목록: 원본 텍스트 데이터를 불용어 제거 및 정제 과정을 거쳐 단어로 토큰화한 결과입니다. 각 문장은 단어 토큰으로 이루어진 리스트로 저장됩니다.

예를 들어, ['every', 'morning', 'wake', 'early', 'watch', 'sunrise']는 원본 텍스트의 첫 번째 문장을 정제하고 단어로 분리한 결과입니다.

불용어 제거: “is”, “the”와 같은 불용어가 제거되었습니다.

단어 길이 제한: 길이가 2 이하인 단어도 제거되어 “I”, “a” 등 짧은 단어가 포함되지 않았습니다.

Vocabulary

단어 집합(Vocabulary): 텍스트에 등장하는 각 단어와 그 빈도수를 저장한 딕셔너리입니다.

단어를 키(key)로, 해당 단어가 텍스트에서 등장한 횟수를 값(value)으로 저장합니다.

예를 들어, {'day': 6, 'watch': 2, 'morning': 1, 'prepare': 2,...}

“day”는 총 6번 등장했으며, “watch”는 2번, “morning”은 1번 등장했습니다.

3. 단어 빈도수를 기준으로 단어를 정렬

vocab_sorted = sorted(vocab.items(), key=lambda x: x[1], reverse=True)

> [('day', 6), ('watch', 2), ('enjoy', 2), ('calm', 2), ('prepare', 2), ('tasks', 2), ('take', 2), ('walk', 2), ('mind', 2), ('focus', 2), ('every', 1), ('morning', 1), ('wake', 1), ('early', 1), ('sunrise', 1), ('stillness', 1), ('dawn', 1), ('air', 1), ('fresh', 1), ('quiet', 1), ('energy', 1), ('helps', 1), ('ahead', 1), ('start', 1), ('warm', 1), ('cup', 1), ('coffee', 1), ('savoring', 1), ('rich', 1), ('aroma', 1), ('plan', 1), ('sometimes', 1), ('read', 1), ('pages', 1), ('book', 1), ('write', 1), ('journal', 1), ('setting', 1), ('intentions', 1), ('breakfast', 1),...

vocab.items()는 vocab 딕셔너리의 (단어, 빈도수) 쌍을 반환합니다. sorted 함수는 빈도수 (x [1])를 기준으로 내림차순(reverse=True)으로 정렬하여, 빈도가 높은 단어가 먼저 오도록 합니다.

4. 상위 빈도 단어에 낮은 숫자 인덱스를 부여하여 word_to_index라는 딕셔너리를 생성합니다. 단어 빈도가 1 이하인 단어는 제외하여, 특정 빈도 이상의 단어만 인덱스를 부여받습니다.

# 상위 빈도 단어부터 낮은 숫자 인덱스를 부여하여 딕셔너리 생성
word_to_index = {}
index = 1  # 인덱스는 1부터 시작
for word, frequency in vocab_sorted:
    if frequency > 1:  # 빈도수가 1 이하인 단어는 제외
        word_to_index[word] = index
        index += 1

print("Word to Index:", word_to_index)

> Word to Index: {'day': 1, 'watch': 2, 'enjoy': 3, 'calm': 4, 'prepare': 5, 'tasks': 6, 'take': 7, 'walk': 8, 'mind': 9, 'focus': 10}

if frequency > 1 조건문을 사용하여, 빈도가 1 이하인 단어는 제외하고, 빈도가 2 이상인 단어만 인덱스를 부여합니다.

word_to_index는 단어와 인덱스의 맵핑을 저장하고 있으며, 빈도수가 높은 단어일수록 낮은 인덱스가 할당됩니다.

5. 텍스트 처리에서 단어 집합의 크기를 제한하여 가장 빈도수가 높은 상위 N개 단어만 사용하고, 나머지 단어들은 제거하는 전처리를 수행합니다

vocab_size = 5

# 인덱스가 5 초과인 단어 제거
words_frequency = [word for word, index in word_to_index.items() if index >= vocab_size + 1]

# 해당 단어에 대한 인덱스 정보를 삭제
for w in words_frequency:
    del word_to_index[w]
print(word_to_index)

> {'day': 1, 'watch': 2, 'enjoy': 3, 'calm': 4, 'prepare': 5}

현재 word_to_index에는 가장 자주 등장하는 5개 단어만 포함이 되어있습니다.

만약 preprocessed_sentences의 첫 번째 문장이 ['every', 'morning', 'wake', 'early', 'watch', 'sunrise']라면, 각 단어가 word_to_index 딕셔너리에 있는지 확인하여 정수로 변환합니다.

예를 들어, word_to_index에 상위 5개 단어만 저장되어 있고, 인덱스가 다음과 같다고 가정해 보겠습니다.

word_to_index = {'day': 1, 'watch': 2, 'enjoy': 3, 'calm': 4, 'prepare': 5, 'OOV': 6}

이 경우, 각 단어를 word_to_index에서 찾고, 존재하지 않는 단어는 OOV(Out-Of-Vocabulary)로 처리하게 됩니다. ['every', 'morning', 'wake', 'early', 'watch', 'sunrise']에서 watch는 word_to_index에 존재하므로 인덱스 2로 변환되고, 나머지 단어들은 word_to_index에 없으므로 OOV의 인덱스 6으로 변환됩니다.

따라서, 이 문장은 다음과 같이 변환됩니다.

['every', 'morning', 'wake', 'early', 'watch', 'sunrise'] → [6, 6, 6, 6, 2, 6]

위 내용을 코드로 표현해 보면 다음과 같습니다.

word_to_index['OOV'] = len(word_to_index) + 1
print(word_to_index)

> {'day': 1, 'watch': 2, 'enjoy': 3, 'calm': 4, 'prepare': 5, 'OOV': 6}

encoded_sentences = []
for sentence in preprocessed_sentences:
    encoded_sentence = []
    for word in sentence:
        try:
            # 단어 집합에 있는 단어라면 해당 단어의 정수를 리턴.
            encoded_sentence.append(word_to_index[word])
        except KeyError:
            # 만약 단어 집합에 없는 단어라면 'OOV'의 정수를 리턴.
            encoded_sentence.append(word_to_index['OOV'])
    encoded_sentences.append(encoded_sentence)
print(encoded_sentences)

> [[6, 6, 6, 6, 2, 6], [3, 4, 6, 6], [6, 6, 6, 6, 6, 5, 1, 6], [6, 1, 6, 6, 6, 6, 6, 6, 6, 6], [6, 6, 6, 6, 6, 6, 6, 6, 1], [6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6], [6, 6, 6, 6, 6, 6, 6, 6, 6], [6, 6, 6, 6], [1, 6, 6, 6, 6, 6, 6, 6, 6], [6, 6, 6, 6, 6, 6, 6, 6, 3, 6], [6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6], [6, 6, 6, 6, 6], [6, 6, 6, 6, 6, 6, 2, 6, 6], [6, 6, 6, 6, 6, 6, 1, 6, 6, 6, 6], [4, 6, 5, 6, 6, 6, 6, 6, 1]

Counter 모듈을 활용한 정수 인코딩

파이썬의 Counter 모듈을 활용하면 손쉽게 단어 빈도를 계산할 수 있습니다. Counter 객체에 모든 단어를 입력하여 단어 빈도수를 계산하고, 상위 n개의 빈도 높은 단어만을 추출하여 정수로 인코딩할 수 있습니다.

from collections import Counter

print(preprocessed_sentences)

> [['every', 'morning', 'wake', 'early', 'watch', 'sunrise'], ['enjoy', 'calm', 'stillness', 'dawn'], ['air', 'fresh', 'quiet', 'energy', 'helps', 'prepare', 'day', 'ahead'], ['start', 'day', 'warm', 'cup', 'coffee', 'savoring', 'rich', 'aroma', 'plan', 'tasks'], ['sometimes', 'read', 'pages', 'book', 'write', 'journal', 'setting', 'intentions', 'day'], ['breakfast', 'take', 'walk', 'around', 'neighborhood', 'greeting', 'familiar', 'faces', 'observing', 'subtle', 'changes', 'scenery'], ['sound', 'birds', 'chirping', 'sight', 'dew', 'grass', 'bring', 'sense', 'peace'], ['walk', 'refreshes', 'mind', 'energizes'], ['day', 'progresses', 'focus', 'work', 'handle', 'various', 'tasks', 'require', 'attention'], ['moments', 'challenge', 'focus', 'balanced', 'breaks', 'step', 'outside', 'stretch', 'enjoy', 'snack'], ['afternoon', 'like', 'check', 'to-do', 'list', 'reflecting', 'accomplished', 'far', 'adjusting', 'goals', 'necessary'], ['evening', 'arrives', 'wind', 'relaxing', 'activities'], ['might', 'cook', 'simple', 'meal', 'listen', 'music', 'watch', 'favorite', 'show'], ['going', 'bed', 'take', 'moment', 'express', 'gratitude', 'day', 'set', 'positive', 'thoughts', 'next'], ['calm', 'mind', 'prepare', 'sleep', 'looking', 'forward', 'another', 'new', 'day']]

1. 이차원 리스트를 일차원 리스트로 평탄화

Counter는 일차원 리스트를 입력으로 받아야 각 단어의 빈도를 계산할 수 있습니다.

words = [word for sentence in preprocessed_sentences for word in sentence]
print(words)

> ['every', 'morning', 'wake', 'early', 'watch', 'sunrise', 'enjoy', 'calm', 'stillness', 'dawn', 'air', 'fresh', 'quiet', 'energy', 'helps', 'prepare', 'day', 'ahead', 'start', 'day', 'warm', 'cup', 'coffee', 'savoring', 'rich', 'aroma', 'plan', 'tasks', 'sometimes', 'read', 'pages', 'book', 'write', 'journal', 'setting', 'intentions', 'day', 'breakfast', 'take', 'walk', 'around', 'neighborhood', 'greeting', 'familiar', 'faces', 'observing', 'subtle', 'changes', 'scenery', 'sound', 'birds', 'chirping', 'sight', 'dew', 'grass', 'bring', 'sense', 'peace', 'walk', 'refreshes', 'mind', 'energizes', 'day', 'progresses', 'focus', 'work', 'handle', 'various', 'tasks', 'require', 'attention', 'moments', 'challenge', 'focus', 'balanced', 'breaks', 'step', 'outside', 'stretch', 'enjoy', 'snack', 'afternoon', 'like', 'check', 'to-do', 'list', 'reflecting', 'accomplished', 'far', 'adjusting', 'goals', 'necessary', 'evening', 'arrives', 'wind', 'relaxing', 'activities', 'might', 'cook', 'simple', 'meal', 'listen', 'music', 'watch', 'favorite', 'show', 'going', 'bed', 'take', 'moment', 'express', 'gratitude', 'day', 'set', 'positive', 'thoughts', 'next', 'calm', 'mind', 'prepare', 'sleep', 'looking', 'forward', 'another', 'new', 'day']

2.Counter를 활용하여 단어의 빈도를 계산합니다.

vocab = Counter(words)
print(vocab)

> Counter({'day': 6, 'watch': 2, 'enjoy': 2, 'calm': 2, 'prepare': 2, 'tasks': 2, 'take': 2, 'walk': 2, 'mind': 2, 'focus': 2, 'every': 1, 'morning': 1, 'wake': 1, 'early': 1, 'sunrise': 1, 'stillness': 1, 'dawn': 1, 'air': 1, 'fresh': 1, 'quiet': 1, 'energy': 1, 'helps': 1, 'ahead': 1, 'start': 1, 'warm': 1, 'cup': 1, 'coffee': 1, 'savoring': 1, 'rich': 1, 'aroma': 1, 'plan': 1, 'sometimes': 1, 'read': 1, 'pages': 1, 'book': 1, 'write': 1, 'journal': 1, 'setting': 1, 'intentions': 1, 'breakfast': 1, 'around': 1, 'neighborhood': 1, 'greeting': 1, 'familiar': 1, 'faces': 1, 'observing': 1, 'subtle': 1, 'changes': 1, 'scenery': 1, 'sound': 1, 'birds': 1, 'chirping': 1, 'sight': 1, 'dew': 1, 'grass': 1, 'bring': 1, 'sense': 1, 'peace': 1, 'refreshes': 1, 'energizes': 1, 'progresses': 1, 'work': 1, 'handle': 1, 'various': 1, 'require': 1, 'attention': 1, 'moments': 1, 'challenge': 1, 'balanced': 1, 'breaks': 1, 'step': 1, 'outside': 1, 'stretch': 1, 'snack': 1, 'afternoon': 1, 'like': 1, 'check': 1, 'to-do': 1, 'list': 1, 'reflecting': 1, 'accomplished': 1, 'far': 1, 'adjusting': 1, 'goals': 1, 'necessary': 1, 'evening': 1, 'arrives': 1, 'wind': 1, 'relaxing': 1, 'activities': 1, 'might': 1, 'cook': 1, 'simple': 1, 'meal': 1, 'listen': 1, 'music': 1, 'favorite': 1, 'show': 1, 'going': 1, 'bed': 1, 'moment': 1, 'express': 1, 'gratitude': 1, 'set': 1, 'positive': 1, 'thoughts': 1, 'next': 1, 'sleep': 1, 'looking': 1, 'forward': 1, 'another': 1, 'new': 1})

3. 등장 빈도수가 높은 상위 5개의 단어만 저장합니다.

vocab_size = 5
vocab = vocab.most_common(vocab_size) # 등장 빈도수가 높은 상위 5개의 단어만 저장
vocab

> [('day', 6), ('watch', 2), ('enjoy', 2), ('calm', 2), ('prepare', 2)]

4. 높은 빈도수 단어에 낮은 정수 인덱스를 부여합니다.

word_to_index = {word: index + 1 for index, (word, frequency) in enumerate(vocab)}

print(word_to_index)

> {'day': 1, 'watch': 2, 'enjoy': 3, 'calm': 4, 'prepare': 5}

NLTK의 FreqDist 사용한 정수 인코딩

NLTK의 FreqDist를 사용하면 빈도수를 구할 수 있습니다. Counter와 유사하게, 상위 n개의 단어를 선택해 정수를 부여하는 방식으로 단어 집합을 만들 수 있습니다.

1. FreqDist를 사용하여 단어 빈도수를 계산합니다.

flat_words = [word for sentence in preprocessed_sentences for word in sentence]  # 이차원 리스트를 일차원으로 평탄화
freq_dist = FreqDist(flat_words)
freq_dist

> FreqDist({'day': 6, 'watch': 2, 'enjoy': 2, 'calm': 2, 'prepare': 2, 'tasks': 2, 'take': 2, 'walk': 2, 'mind': 2, 'focus': 2, ...})

# np.hstack을 사용해도 결과는 동일
# freq_dist = FreqDist(np.hstack(preprocessed_sentences))
# freq_dist

2. 상위 5개의 단어 선택 후 높은 빈도수 단어에 낮은 정수 인덱스를 부여

vocab_size = 5
most_common_words = freq_dist.most_common(vocab_size)

word_to_index = {word: index + 1 for index, (word, frequency) in enumerate(most_common_words)}
print("Word to Index:", word_to_index)

> Word to Index: {'day': 1, 'watch': 2, 'enjoy': 3, 'calm': 4, 'prepare': 5}

728x90

저작자표시 (새창열림)

'pytorch' 카테고리의 다른 글

[pytorch] DTM과 TF-IDF \| 텍스트 전처리 \| 자연어 처리 (0)	2024.11.22
[pytorch] Bag of Words (BOW) \| CountVectorizer (0)	2024.11.19
[pytorch] 정규 표현식(Regular Expression) \| 특수 문자(metacharacters)\| 모듈 함수 (module functions) \| 정규 표현식을 이용한 토큰화 (RegexpTokenizer) (0)	2024.11.13
[pytorch] 어간 추출(stemming)과 표제어 추출 (Lemmatization) (3)	2024.11.12
[pytorch] 불용어(stopwords)란? \| 한국어 불용어 제거 \| 영어 불용어 제거 (2)	2024.11.04

현재글[pytorch] 정수 인코딩(Integer Encoding) | Counter 와 FreqDist를 활용한 정수 인코딩 코드

250x250

머신러닝 딥러닝과 친해지는중 🐥

분류, 티스토리챌린지, Deep Learning, 데이터분석, 자연어처리, deeplearning, konlpy, pytorch, Pandas, 오블완, 인공지능, cnn, 텍스트전처리, machinelearning, Ai, Python, 딥러닝, nlp, 머신러닝, 토큰화,

Today :
Yesterday :

resultofeffort