Skip to content

Optimizing ML pipeline #11409

Answered by trivialfis
jankogasic asked this question in Q&A
Discussion options

You must be logged in to vote

Not a Spark expert. But here are a few things that might be useful:

  • Don't use text-based formats like libsvm or CSV. Use binary formats like Parquet or numpy format, etc.
  • Use QuantileDMatix for the dtrain (training dataset) instead of DMatrix when using the hist tree method (default).
  • Don't use DMatrix to load files, use pandas to load parquet or numpy to load its data, then pass them to XGBoost.
  • If your machine has enough storage to store the data during training, you can fetch it from the remote server once and reuse it.

Replies: 2 comments

Comment options

You must be logged in to vote
0 replies
Answer selected by jankogasic
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants