728x90
반응형
텍스트 전처리 공부하는 중에 로컬에서 토크나이저를 수행하려고 하니 에러가 발생했다.
sent_text = sent_tokenize(content_text)
{
"name": "LookupError",
"message": "
**********************************************************************
Resource punkt_tab not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('punkt_tab')
For more information see: https://www.nltk.org/data.html
Attempted to load tokenizers/punkt_tab/english/
Searched in:
- '/Users/song/nltk_data'
- '/Users/song/opt/anaconda3/envs/song38/nltk_data'
- '/Users/song/opt/anaconda3/envs/song38/share/nltk_data'
- '/Users/song/opt/anaconda3/envs/song38/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
**********************************************************************
",
"stack": "---------------------------------------------------------------------------
LookupError Traceback (most recent call last)
Cell In[8], line 2
1 # 입력 코퍼스에 대해서 NLTK를 이용하여 문장 토큰화를 수행.
----> 2 sent_text = sent_tokenize(content_text)
4 # # 각 문장에 대해서 구두점을 제거하고, 대문자를 소문자로 변환.
5 # normalized_text = []
6 # for string in sent_text:
(...)
10 # # 각 문장에 대해서 NLTK를 이용하여 단어 토큰화를 수행.
11 # result = [word_tokenize(sentence) for sentence in normalized_text]
File ~/opt/anaconda3/envs/song38/lib/python3.8/site-packages/nltk/tokenize/__init__.py:119, in sent_tokenize(text, language)
109 def sent_tokenize(text, language=\"english\"):
110 \"\"\"
111 Return a sentence-tokenized copy of *text*,
112 using NLTK's recommended sentence tokenizer
(...)
117 :param language: the model name in the Punkt corpus
118 \"\"\"
--> 119 tokenizer = _get_punkt_tokenizer(language)
120 return tokenizer.tokenize(text)
File ~/opt/anaconda3/envs/song38/lib/python3.8/site-packages/nltk/tokenize/__init__.py:105, in _get_punkt_tokenizer(language)
96 @functools.lru_cache
97 def _get_punkt_tokenizer(language=\"english\"):
98 \"\"\"
99 A constructor for the PunktTokenizer that utilizes
100 a lru cache for performance.
(...)
103 :type language: str
104 \"\"\"
--> 105 return PunktTokenizer(language)
File ~/opt/anaconda3/envs/song38/lib/python3.8/site-packages/nltk/tokenize/punkt.py:1744, in PunktTokenizer.__init__(self, lang)
1742 def __init__(self, lang=\"english\"):
1743 PunktSentenceTokenizer.__init__(self)
-> 1744 self.load_lang(lang)
File ~/opt/anaconda3/envs/song38/lib/python3.8/site-packages/nltk/tokenize/punkt.py:1749, in PunktTokenizer.load_lang(self, lang)
1746 def load_lang(self, lang=\"english\"):
1747 from nltk.data import find
-> 1749 lang_dir = find(f\"tokenizers/punkt_tab/{lang}/\")
1750 self._params = load_punkt_params(lang_dir)
1751 self._lang = lang
File ~/opt/anaconda3/envs/song38/lib/python3.8/site-packages/nltk/data.py:579, in find(resource_name, paths)
577 sep = \"*\" * 70
578 resource_not_found = f\"\
{sep}\
{msg}\
{sep}\
\"
--> 579 raise LookupError(resource_not_found)
LookupError:
**********************************************************************
Resource punkt_tab not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('punkt_tab')
For more information see: https://www.nltk.org/data.html
Attempted to load tokenizers/punkt_tab/english/
Searched in:
- '/Users/song/nltk_data'
- '/Users/song/opt/anaconda3/envs/song38/nltk_data'
- '/Users/song/opt/anaconda3/envs/song38/share/nltk_data'
- '/Users/song/opt/anaconda3/envs/song38/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
**********************************************************************
"
}
에러 메시지를 보면, nltk 라이브러리가 punkt_tab 리소스를 찾지 못해서 발생한 문제이다.
punkt는 NLTK의 문장 토큰화를 위해 필요한 데이터이기 때문에 nltk.download() 함수를 사용하여 필요한 데이터를 설치해야 한다.
해결 방법
import nltk
# punkt 데이터 다운로드
nltk.download('punkt')
728x90
반응형