Skip to content

[Ray Data] with_column API support Actor #56529

@codingl2k1

Description

@codingl2k1

Description

Daft's with_column API support an actor as a UDF

# Define the return type for embeddings
embedding_type = daft.DataType.embedding(daft.DataType.float32(), ENCODING_DIM)

@daft.udf(
    return_dtype=embedding_type,
    concurrency=NUM_GPU_NODES,
    num_gpus=1,
    batch_size=BATCH_SIZE
)
class EncodingUDF:
    def __init__(self):
        from sentence_transformers import SentenceTransformer

        device = 'cuda' if torch.cuda.is_available() else 'cpu'
        self.model = SentenceTransformer(EMBEDDING_MODEL_NAME, device=device)
        self.model.compile()

    def __call__(self, text_col):
        embeddings = self.model.encode(
            text_col.to_pylist(),
            batch_size=SENTENCE_TRANSFORMER_BATCH_SIZE,
            convert_to_tensor=True,
            torch_dtype=torch.bfloat16,
        )
        return embeddings.cpu().numpy()


daft.read_parquet("s3://desmond-demo/text-embedding-dataset.parquet")
    .with_column("sentences", EncodingUDF(col("text")))
    .explode("sentences")

Use case

For simple inference, I can currently use map_batches instead, but it would be good if Ray Data's with_column API could support Actors.

Metadata

Metadata

Labels

P1Issue that should be fixed within a few weekscommunity-backlogdataRay Data-related issuesenhancementRequest for new feature and/or capabilitygood-first-issueGreat starter issue for someone just starting to contribute to Rayperformanceusability

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions