add heterogeneous ColumnTransformer

Merge remote-tracking branch 'upstream/master' into amueller/heteroge…
Get tests/examples working with current sklearn
scikit-learn · jnothman · May 29, 2018 · jnothman · May 29, 2018 · jorisvandenbossche
diff --git a/examples/column_transformer.py b/examples/column_transformer.py
                ('tfidf', TfidfVectorizer()),
                ('best', TruncatedSVD(n_components=50)),
-            ])),
+            ]), 'body'),
diff --git a/doc/modules/feature_extraction.rst b/doc/modules/feature_extraction.rst
+Often it is easiest to preprocess data before applying scikit-learn methods, for example using
+pandas.
+If the preprocessing has parameters that you want to adjust within a
+grid-search, however, they need to be inside a transformer. This can be
diff --git a/doc/modules/feature_extraction.rst b/doc/modules/feature_extraction.rst
+
+.. note::
+    :class:`ColumnTransformer` expects a very different data format from the numpy arrays usually used in scikit-learn.
+    For a numpy array ``X_array``, ``X_array[1]`` will give a single sample (``X_array[1].shape == (n_samples.)``), but all features.
diff --git a/doc/modules/feature_extraction.rst b/doc/modules/feature_extraction.rst
+    :class:`ColumnTransformer` expects a very different data format from the numpy arrays usually used in scikit-learn.
+    For a numpy array ``X_array``, ``X_array[1]`` will give a single sample (``X_array[1].shape == (n_samples.)``), but all features.
+    For columnar data like a dict or pandas dataframe ``X_columns``, ``X_columns[1]`` is expected to give a feature called
+    ``1`` for each sample (``X_columns[1].shape == (n_samples,)``).
diff --git a/doc/modules/feature_extraction.rst b/doc/modules/feature_extraction.rst
+.. note::
+    :class:`ColumnTransformer` expects a very different data format from the numpy arrays usually used in scikit-learn.
+    For a numpy array ``X_array``, ``X_array[1]`` will give a single sample (``X_array[1].shape == (n_samples.)``), but all features.
+    For columnar data like a dict or pandas dataframe ``X_columns``, ``X_columns[1]`` is expected to give a feature called
diff --git a/doc/modules/pipeline.rst b/doc/modules/pipeline.rst
+transformations to each field of the data, producing a homogeneous feature
+matrix from a heterogeneous data source.
+The transformers are applied in parallel, and the feature matrices they output
+are concatenated side-by-side into a larger matrix.
diff --git a/doc/modules/feature_extraction.rst b/doc/modules/feature_extraction.rst
@@ -101,6 +101,105 @@ memory the ``DictVectorizer`` class uses a ``scipy.sparse`` matrix by
 default instead of a ``numpy.ndarray``.


+.. _column_transformer:
diff --git a/sklearn/compose/_column_transformer.py b/sklearn/compose/_column_transformer.py
+        feature_names : list of strings
+            Names of the features produced by transform.
+        """
+        check_is_fitted(self, 'transformers_')