Error on AWS Lambda when using pinecone-text BM25Encoder

benb · July 23, 2023, 12:39pm

I have a code running in lambda but when i try to import pinecone-text BM25Encoder on lambda i got this error on cloudwatch:

[ERROR] OSError: [Errno 30] Read-only file system: '/home/sbx_user1051'
Traceback (most recent call last):
  File "/var/lang/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/var/task/receiver.py", line 23, in <module>
    from pinecone_text.sparse import BM25Encoder
  File "/var/task/pinecone_text/sparse/__init__.py", line 30, in <module>
    from pinecone_text.sparse.bm25_encoder import BM25Encoder
  File "/var/task/pinecone_text/sparse/bm25_encoder.py", line 13, in <module>
    from pinecone_text.sparse.bm25_tokenizer import BM25Tokenizer
  File "/var/task/pinecone_text/sparse/bm25_tokenizer.py", line 11, in <module>
    nltk.download("punkt")
  File "/var/task/nltk/downloader.py", line 777, in download
    for msg in self.incr_download(info_or_id, download_dir, force):
  File "/var/task/nltk/downloader.py", line 642, in incr_download
    yield from self._download_package(info, download_dir, force)
  File "/var/task/nltk/downloader.py", line 699, in _download_package
    os.makedirs(download_dir)
  File "/var/lang/lib/python3.9/os.py", line 215, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/var/lang/lib/python3.9/os.py", line 225, in makedirs
    mkdir(name, mode)

I can’t change the default path of BM25Encoder downloader because its an internal code of pinecone package and also i dont want to use temp directory and download it everytime someone execute the lambda.

Amnon · July 23, 2023, 3:13pm

Hi, it seems like the failure is upon calling nltk.download("punkt"). There are several solutions that you might consider:

Downloading this module in docker file as described here
As you mentioned, it possible to download the module to the “temp” dir, with the disadvantage of latency for each lambda call, although it might work good enough depends on your use case - you can find a code example here
The package is open source, so you can clone the repo and change the word splitting function to use some other package like re or simple split() call. However, this is likely to produce lower quality.

Please let me know if you find one of these solutions satisfying,
Amnon