Memory Issues with Sparse Vectors in XGBoost4j-Spark: Disabling Sparse-to-Dense Conversion #11467

stepanov1997 · 2025-05-18T09:08:05Z

stepanov1997
May 18, 2025

Hi everyone,

I’m using sparse vectors with about 10 features out of a possible 50 million. However, the conversion to dense vectors is causing heap exhaustion. Is there a way to disable the sparse-to-dense conversion?

Right now, I can’t even train on a small batch of vectors without running into memory issues, but I ultimately need to train on 200 million rows.

Any help would be greatly appreciated. I’m using XGBoost4j-Spark version 3.0.0 with the Java.

Thanks!

trivialfis · 2025-05-18T10:59:33Z

trivialfis
May 18, 2025
Maintainer

cc @wbo4958

0 replies

stepanov1997 · 2025-05-19T14:38:53Z

stepanov1997
May 19, 2025
Author

Hey @trivialfis and @wbo4958, the lack of proper handling for SparseVector (as noted in the TODO) causes memory issues. If I have 50 million features, converting it to a dense vector can lead to serious problems. With proper sparse vector support, this wouldn’t be an issue

xgboost/jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/Utils.scala

Line 52 in 94bb1da

// TODO support sparsevector

Can we expect this issue to be fixed in the near future, or should I consider giving up on distributed training for now?

2 replies

trivialfis May 19, 2025
Maintainer

It's in the backlog. But may I ask how did you get the 50 million features? Are you using some kind of encoding?

stepanov1997 May 20, 2025
Author

Thx, I'm using Murmurhash to encode features.

wbo4958 · 2025-05-22T06:22:40Z

wbo4958
May 22, 2025

Before 3.0, xgboost jvm packages allowed sparse vector, but it will result in in-accurate model due to lacking sparse vector in xgboost. So starting from 3.0, we don't use sparse vector any more until we get sparse vector supported in xgboost.

0 replies

trivialfis · 2025-05-22T06:36:39Z

trivialfis
May 22, 2025
Maintainer

There's mismatch between traditional sparse matrix and decision trees. For decision trees, 0 is a valid value, not missing.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Memory Issues with Sparse Vectors in XGBoost4j-Spark: Disabling Sparse-to-Dense Conversion #11467

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Memory Issues with Sparse Vectors in XGBoost4j-Spark: Disabling Sparse-to-Dense Conversion #11467

Uh oh!

stepanov1997 May 18, 2025

Replies: 4 comments · 2 replies

Uh oh!

trivialfis May 18, 2025 Maintainer

Uh oh!

stepanov1997 May 19, 2025 Author

Uh oh!

trivialfis May 19, 2025 Maintainer

Uh oh!

stepanov1997 May 20, 2025 Author

Uh oh!

wbo4958 May 22, 2025

Uh oh!

trivialfis May 22, 2025 Maintainer

stepanov1997
May 18, 2025

Replies: 4 comments 2 replies

trivialfis
May 18, 2025
Maintainer

stepanov1997
May 19, 2025
Author

trivialfis May 19, 2025
Maintainer

stepanov1997 May 20, 2025
Author

wbo4958
May 22, 2025

trivialfis
May 22, 2025
Maintainer