Import Data Feature

Does Import Data Feature “vectorize” the S3 files in Parquet format ? In other words, can we build automated pipelines that only drop the files to S3 and they will be added to the desired namespace within an index?
We are scaling at the moment and need efficient ways of duplicating our indices or running the entire doc processing pipeline (> 1 GB of docs) after updates to chunking/embedding, currently we are unable to do so and have to rely on local jobs ( github workflow actions). What are some options that we can use without impacting our production indexes.

1 Like

Hi @kavita1 - thanks for reaching out and sorry for the delay getting back to you.

The short answer here is “no”, we don’t currently have this functionality. However, we’re starting to look at how we can better address these kinds of workflows so your feedback is very appreciated.

There are also a few 3rd party products like AWS Bedrock in addition to more low-level tools like Kafka and Spark that could help here.

@perry Yes thank you for sharing and I evaluated AWS Bedrock, it isn’t there yet. I think for large organizations that already have a big data system using Kafka & Spark it may be relatively easier , however for all the AI first startups that have built their own data pipelines or use Unstructured.io for RAG and are now looking to scale are facing some of the traditional challenges. I have to build observability to track how a doc is being vectorized - why because our evals are now showing what needs to be tweaked. This is just an example and we are evaluating other open source options as well.

Thanks again for the feedback and definitely makes sense. I’m making sure our product team hears this as well.

Airflow with S3 sensors may be helpful here as well, you can as well incorporate Spark or kafka in the same pipeline

1 Like