Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign up[MRG] Add experimental.ColumnTransformer #9012
Conversation
This comment has been minimized.
This comment has been minimized.
feel free to squash my commits |
2111a57
to
30bea38
30bea38
to
2333e61
('tfidf', TfidfVectorizer()), | ||
('best', TruncatedSVD(n_components=50)), | ||
])), | ||
]), 'body'), |
This comment has been minimized.
This comment has been minimized.
vene
Jun 6, 2017
Member
Are we sold on the tuple-based API here? I'd like it if this were a bit more explicit... (I'd like it to say column_name='body'
somehow)
This comment has been minimized.
This comment has been minimized.
vene
Jun 6, 2017
Member
While we're at it, why is the outer structure a dict and not an ordered list of tuples like FeatureUnion?
Often it is easiest to preprocess data before applying scikit-learn methods, for example using | ||
pandas. | ||
If the preprocessing has parameters that you want to adjust within a | ||
grid-search, however, they need to be inside a transformer. This can be |
This comment has been minimized.
This comment has been minimized.
vene
Jun 6, 2017
Member
they=?
I'd remove the whole sentence and just say ":class:ColumnTransformer
is a convenient way to perform heterogeneous preprocessing on data columns within a pipeline.
|
||
.. note:: | ||
:class:`ColumnTransformer` expects a very different data format from the numpy arrays usually used in scikit-learn. | ||
For a numpy array ``X_array``, ``X_array[1]`` will give a single sample (``X_array[1].shape == (n_samples.)``), but all features. |
This comment has been minimized.
This comment has been minimized.
vene
Jun 6, 2017
Member
Do you mean "will give all feature values for the selected sample, e.g. (X_array[1].shape == (n_features,)
) ?
This comment has been minimized.
This comment has been minimized.
jorisvandenbossche
Jun 6, 2017
Author
Member
Yeah, I didn't update the rst docs yet, and I also saw there are still some errors like these
This comment has been minimized.
This comment has been minimized.
:class:`ColumnTransformer` expects a very different data format from the numpy arrays usually used in scikit-learn. | ||
For a numpy array ``X_array``, ``X_array[1]`` will give a single sample (``X_array[1].shape == (n_samples.)``), but all features. | ||
For columnar data like a dict or pandas dataframe ``X_columns``, ``X_columns[1]`` is expected to give a feature called | ||
``1`` for each sample (``X_columns[1].shape == (n_samples,)``). |
This comment has been minimized.
This comment has been minimized.
.. note:: | ||
:class:`ColumnTransformer` expects a very different data format from the numpy arrays usually used in scikit-learn. | ||
For a numpy array ``X_array``, ``X_array[1]`` will give a single sample (``X_array[1].shape == (n_samples.)``), but all features. | ||
For columnar data like a dict or pandas dataframe ``X_columns``, ``X_columns[1]`` is expected to give a feature called |
This comment has been minimized.
This comment has been minimized.
vene
Jun 6, 2017
Member
-> a pandas DataFrame
Also this chapter should probably have actual links to the pandas website or something, for readers who might have no idea what we're talking about.
This comment has been minimized.
This comment has been minimized.
So the current way to specify a transformer is like this:
(where There was some discussion and back-and-forth about this in the original PR, and other options mentioned are (as far I as read it correctly):
or
BTW, when using dicts, I would actually find this interface more logical:
which switches the place of column and transformer name, which gives you a tuple of (name, trans) similar to the Pipeline interface, and uses the dict key to select (which mimics getitem how also the values are selected from the input data). |
This comment has been minimized.
This comment has been minimized.
The column is a numpy array, right, so it's not hashable. I think we could use either the list or dict thing here, and have a helper |
This comment has been minimized.
This comment has been minimized.
Oh I didn't fully read your comment. I think multiple columns are essential, so we can't do that... |
This comment has been minimized.
This comment has been minimized.
Maybe i'm a little bit confused but then, does this overlap in scope with FeatureUnion? If there is more than one transformer, we need to know what order to use for column_stacking their output, right? So if we use a dict with transformers as keys can we guarantee a consistent order? |
This comment has been minimized.
This comment has been minimized.
Let's try to discuss this with @amueller before proceeding. I think I'll help on this today. |
transformations to each field of the data, producing a homogeneous feature | ||
matrix from a heterogeneous data source. | ||
The transformers are applied in parallel, and the feature matrices they output | ||
are concatenated side-by-side into a larger matrix. |
This comment has been minimized.
This comment has been minimized.
vene
Jun 7, 2017
Member
Since this PR adds ColumnTransformer, we can say here something like: for data organized in fields with heterogeneous types, see the related class class:ColumnTransformer
.
This comment has been minimized.
This comment has been minimized.
Additional problem that we encountered: In the meantime (since the original PR was made), Transformers need 2D But, by ensuring this, the example using a So possible options:
|
This comment has been minimized.
This comment has been minimized.
Another option would be:
|
This comment has been minimized.
This comment has been minimized.
+1 with the policy that transformers should return 2D arrays.
|
Those changes look good |
Some things I noticed |
@@ -101,6 +101,105 @@ memory the ``DictVectorizer`` class uses a ``scipy.sparse`` matrix by | |||
default instead of a ``numpy.ndarray``. | |||
|
|||
|
|||
.. _column_transformer: |
This comment has been minimized.
This comment has been minimized.
jnothman
May 27, 2018
Member
This should be in compose.rst, but perhaps noted at the top of this file
This comment has been minimized.
This comment has been minimized.
jorisvandenbossche
May 28, 2018
Author
Member
Yes, I know, but also (related to what I mentioned here: #9012 (comment)):
- when moving to compose.rst, I think we should use a different example (eg using transformers from preprocessing module, as I think that is a more typical use case)
- we should reference this in preprocessing.rst
- we should add a better 'typical data science usecase" example for the example gallery
- I would maybe keep the explanation currently in
feature_extraction.rst
(the example), but shorten it by referring to compose.rst for the general explanation.
I can work on the above this week. But in light of getting this merged sooner rather than later, I would prefer doing it as a follow-up PR, if that is fine? (I can also do a minimal here and simply move the current docs addition to compose.rst without any of the other mentioned improvements).
feature_names : list of strings | ||
Names of the features produced by transform. | ||
""" | ||
check_is_fitted(self, 'transformers_') |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
jorisvandenbossche
May 28, 2018
Author
Member
Ideally, yes. I am only not fully sure what to do here currently, given that get_feature_names
is in general not really well supported.
I think ideally I would add the names of the passed through columns to feature_names
, but then the actual string column names in case of pandas dataframes. And in case of numpy arrays return the indices into that array as strings? (['0', '1', ..])
I can also raise an error for now if there are columns passed through, just to make sure that if we improve the get_feature_names
in the future, it does not lead to a change in behaviour (but a removal of the error).
This comment has been minimized.
This comment has been minimized.
jnothman
May 28, 2018
Member
Can just raise NotImplementedError in case of remainder != 'drop'
for now.... Or you can tack the remainder transformer onto the end of _iter
.
I agree get_feature_names
is not quite the right design.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Yes, just move it. The rest can happen after release or at the hands of
other contributors
|
This comment has been minimized.
This comment has been minimized.
Added last update: added additional error for get_feature_names, and moved the docs to compose.rst. As far as I am concerned, somebody can push the green button ;-) |
This comment has been minimized.
This comment has been minimized.
Indeed: let's see how this flies! |
0b6308c
into
scikit-learn:master
This comment has been minimized.
This comment has been minimized.
Thanks Joris for some great work on finally making this happen! |
This comment has been minimized.
This comment has been minimized.
Woohoo, thanks for merging! |
This comment has been minimized.
This comment has been minimized.
Great work congrats ! |
This comment has been minimized.
This comment has been minimized.
Yes, I am jumping up and down of exitement.
|
This comment has been minimized.
This comment has been minimized.
Really nice!!! Let's stack those columns then ;) |
This comment has been minimized.
This comment has been minimized.
eyadsibai
commented
May 29, 2018
Looking forward for the next release |
This comment has been minimized.
This comment has been minimized.
armgilles
commented
May 31, 2018
Next release will be amazing ! |
This comment has been minimized.
This comment has been minimized.
OMG this is great! Thank you so much for your work (and patience) on this one @jorisvandenbossche |
This comment has been minimized.
This comment has been minimized.
Thank you @jorisvandenbossche!! Great stuff. Is there going to be an effort (I would like to contribute) to implement I find that one of the big advantages that What do you guys think? Thank you in advance. |
This comment has been minimized.
This comment has been minimized.
@partmor See #9606 and #6425. What exactly was working with DataFrameMapper that's not currently working? I feel like the main usecase will be with OneHotEncoder/CategoricalEncoder who will provide a |
This comment has been minimized.
This comment has been minimized.
@amueller thank you for the links. For instance, if we want to use |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
you might want to look at the eli5 library where more feature names have
been implemented in a single dispatch framework that is easily extended.
Also: #5523 suggests pandas in/out. I think it might be a good idea to have
something like df_out on transformers, but this is a harder change to make
|
jorisvandenbossche commentedJun 6, 2017
•
edited by amueller
Continuation @amueller's PR #3886 (for now just rebased and updated for changes in sklearn)
Fixes #2034.
Closes #2034, closes #3886, closes #8540, closes #8539