Optimizing ML pipeline #11409
-
Hi, My pipeline setup:
Is there a way to make this process faster? To skip some steps? Potential ideas:I would try to avoid pandas in step 3 and use Is there a way to load data from Spark memory or HDFS directly to |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Not a Spark expert. But here are a few things that might be useful:
|
Beta Was this translation helpful? Give feedback.
-
In addition to the above comment, you can consider using the (py)spark interface of XGBoost. However, getting rid of the text-based inputs should be the first step before trying anything else. |
Beta Was this translation helpful? Give feedback.
Not a Spark expert. But here are a few things that might be useful:
QuantileDMatix
for thedtrain
(training dataset) instead ofDMatrix
when using thehist
tree method (default).DMatrix
to load files, use pandas to load parquet or numpy to load its data, then pass them to XGBoost.