-
Notifications
You must be signed in to change notification settings - Fork 7.4k
Closed
Labels
P1Issue that should be fixed within a few weeksIssue that should be fixed within a few weekscommunity-backlogdataRay Data-related issuesRay Data-related issuesenhancementRequest for new feature and/or capabilityRequest for new feature and/or capabilitygood-first-issueGreat starter issue for someone just starting to contribute to RayGreat starter issue for someone just starting to contribute to Rayperformanceusability
Description
Description
Daft's with_column API support an actor as a UDF
# Define the return type for embeddings
embedding_type = daft.DataType.embedding(daft.DataType.float32(), ENCODING_DIM)
@daft.udf(
return_dtype=embedding_type,
concurrency=NUM_GPU_NODES,
num_gpus=1,
batch_size=BATCH_SIZE
)
class EncodingUDF:
def __init__(self):
from sentence_transformers import SentenceTransformer
device = 'cuda' if torch.cuda.is_available() else 'cpu'
self.model = SentenceTransformer(EMBEDDING_MODEL_NAME, device=device)
self.model.compile()
def __call__(self, text_col):
embeddings = self.model.encode(
text_col.to_pylist(),
batch_size=SENTENCE_TRANSFORMER_BATCH_SIZE,
convert_to_tensor=True,
torch_dtype=torch.bfloat16,
)
return embeddings.cpu().numpy()
daft.read_parquet("s3://desmond-demo/text-embedding-dataset.parquet")
.with_column("sentences", EncodingUDF(col("text")))
.explode("sentences")Use case
For simple inference, I can currently use map_batches instead, but it would be good if Ray Data's with_column API could support Actors.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P1Issue that should be fixed within a few weeksIssue that should be fixed within a few weekscommunity-backlogdataRay Data-related issuesRay Data-related issuesenhancementRequest for new feature and/or capabilityRequest for new feature and/or capabilitygood-first-issueGreat starter issue for someone just starting to contribute to RayGreat starter issue for someone just starting to contribute to Rayperformanceusability