오류Error

[오류Error] Resource punkt_tab not found. Please use the NLTK Downloader to obtain the resource:

독립성이 강한 ISFP 2024. 11. 22. 12:24
728x90
반응형

텍스트 전처리 공부하는 중에 로컬에서 토크나이저를 수행하려고 하니 에러가 발생했다.

sent_text = sent_tokenize(content_text)

{
	"name": "LookupError",
	"message": "
**********************************************************************
  Resource punkt_tab not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('punkt_tab')
  
  For more information see: https://www.nltk.org/data.html

  Attempted to load tokenizers/punkt_tab/english/

  Searched in:
    - '/Users/song/nltk_data'
    - '/Users/song/opt/anaconda3/envs/song38/nltk_data'
    - '/Users/song/opt/anaconda3/envs/song38/share/nltk_data'
    - '/Users/song/opt/anaconda3/envs/song38/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************
",
	"stack": "---------------------------------------------------------------------------
LookupError                               Traceback (most recent call last)
Cell In[8], line 2
      1 # 입력 코퍼스에 대해서 NLTK를 이용하여 문장 토큰화를 수행.
----> 2 sent_text = sent_tokenize(content_text)
      4 # # 각 문장에 대해서 구두점을 제거하고, 대문자를 소문자로 변환.
      5 # normalized_text = []
      6 # for string in sent_text:
   (...)
     10 # # 각 문장에 대해서 NLTK를 이용하여 단어 토큰화를 수행.
     11 # result = [word_tokenize(sentence) for sentence in normalized_text]

File ~/opt/anaconda3/envs/song38/lib/python3.8/site-packages/nltk/tokenize/__init__.py:119, in sent_tokenize(text, language)
    109 def sent_tokenize(text, language=\"english\"):
    110     \"\"\"
    111     Return a sentence-tokenized copy of *text*,
    112     using NLTK's recommended sentence tokenizer
   (...)
    117     :param language: the model name in the Punkt corpus
    118     \"\"\"
--> 119     tokenizer = _get_punkt_tokenizer(language)
    120     return tokenizer.tokenize(text)

File ~/opt/anaconda3/envs/song38/lib/python3.8/site-packages/nltk/tokenize/__init__.py:105, in _get_punkt_tokenizer(language)
     96 @functools.lru_cache
     97 def _get_punkt_tokenizer(language=\"english\"):
     98     \"\"\"
     99     A constructor for the PunktTokenizer that utilizes
    100     a lru cache for performance.
   (...)
    103     :type language: str
    104     \"\"\"
--> 105     return PunktTokenizer(language)

File ~/opt/anaconda3/envs/song38/lib/python3.8/site-packages/nltk/tokenize/punkt.py:1744, in PunktTokenizer.__init__(self, lang)
   1742 def __init__(self, lang=\"english\"):
   1743     PunktSentenceTokenizer.__init__(self)
-> 1744     self.load_lang(lang)

File ~/opt/anaconda3/envs/song38/lib/python3.8/site-packages/nltk/tokenize/punkt.py:1749, in PunktTokenizer.load_lang(self, lang)
   1746 def load_lang(self, lang=\"english\"):
   1747     from nltk.data import find
-> 1749     lang_dir = find(f\"tokenizers/punkt_tab/{lang}/\")
   1750     self._params = load_punkt_params(lang_dir)
   1751     self._lang = lang

File ~/opt/anaconda3/envs/song38/lib/python3.8/site-packages/nltk/data.py:579, in find(resource_name, paths)
    577 sep = \"*\" * 70
    578 resource_not_found = f\"\
{sep}\
{msg}\
{sep}\
\"
--> 579 raise LookupError(resource_not_found)

LookupError: 
**********************************************************************
  Resource punkt_tab not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('punkt_tab')
  
  For more information see: https://www.nltk.org/data.html

  Attempted to load tokenizers/punkt_tab/english/

  Searched in:
    - '/Users/song/nltk_data'
    - '/Users/song/opt/anaconda3/envs/song38/nltk_data'
    - '/Users/song/opt/anaconda3/envs/song38/share/nltk_data'
    - '/Users/song/opt/anaconda3/envs/song38/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************
"
}

 

에러 메시지를 보면, nltk 라이브러리가 punkt_tab 리소스를 찾지 못해서 발생한 문제이다.

punkt는 NLTK의 문장 토큰화를 위해 필요한 데이터이기 때문에 nltk.download() 함수를 사용하여 필요한 데이터를 설치해야 한다.

 

해결 방법

import nltk

# punkt 데이터 다운로드
nltk.download('punkt')

 

 

728x90
반응형