Optimizing ML pipeline #11409

jankogasic · 2025-04-16T14:42:01Z

jankogasic
Apr 16, 2025

Hi,

My pipeline setup:

query Spark remote servers
save data as libsvm file type to HDFS
fetch libsvm data to host memory with pandas.sql() (really long)
using sklearn.datasets.load_svmlight_file() to get CSR matrix (really long)
now use xgb.DMatrix(CSR_matrix) to finally start training

Is there a way to make this process faster? To skip some steps?

Potential ideas:

I would try to avoid pandas in step 3 and use webhdfs. Also, to try to avoid sklearn in step 4 and load libsvm file directly to xgb.DMatrix().

Is there a way to load data from Spark memory or HDFS directly to xgb.DMatrix() object? This way I can avoid hosts disk which might improve pipeline speed.

Answered by trivialfis

Apr 16, 2025

Not a Spark expert. But here are a few things that might be useful:

Don't use text-based formats like libsvm or CSV. Use binary formats like Parquet or numpy format, etc.
Use QuantileDMatix for the dtrain (training dataset) instead of DMatrix when using the hist tree method (default).
Don't use DMatrix to load files, use pandas to load parquet or numpy to load its data, then pass them to XGBoost.
If your machine has enough storage to store the data during training, you can fetch it from the remote server once and reuse it.

View full answer

trivialfis · 2025-04-16T15:05:35Z

trivialfis
Apr 16, 2025
Maintainer

Not a Spark expert. But here are a few things that might be useful:

Don't use text-based formats like libsvm or CSV. Use binary formats like Parquet or numpy format, etc.
Use QuantileDMatix for the dtrain (training dataset) instead of DMatrix when using the hist tree method (default).
Don't use DMatrix to load files, use pandas to load parquet or numpy to load its data, then pass them to XGBoost.
If your machine has enough storage to store the data during training, you can fetch it from the remote server once and reuse it.

0 replies

trivialfis · 2025-04-16T19:23:29Z

trivialfis
Apr 16, 2025
Maintainer

In addition to the above comment, you can consider using the (py)spark interface of XGBoost. However, getting rid of the text-based inputs should be the first step before trying anything else.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Optimizing ML pipeline #11409

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Optimizing ML pipeline #11409

Uh oh!

Uh oh!

jankogasic Apr 16, 2025

My pipeline setup:

Potential ideas:

Replies: 2 comments

Uh oh!

Uh oh!

trivialfis Apr 16, 2025 Maintainer

Uh oh!

trivialfis Apr 16, 2025 Maintainer

jankogasic
Apr 16, 2025

trivialfis
Apr 16, 2025
Maintainer

trivialfis
Apr 16, 2025
Maintainer