--- tags: - sentence-transformers - sentence-similarity - feature-extraction - generated_from_trainer - dataset_size:13734 - loss:MultipleNegativesRankingLoss base_model: intfloat/e5-small-v2 widget: - source_sentence: predict sentences: - " def _compute_score_samples(self, X, subsample_features):\n \"\"\"\n\ \ Compute the score of each samples in X going through the extra trees.\n\ \n Parameters\n ----------\n X : array-like or sparse matrix\n\ \ Data matrix.\n\n subsample_features : bool\n Whether\ \ features should be subsampled.\n\n Returns\n -------\n \ \ scores : ndarray of shape (n_samples,)\n The score of each sample\ \ in X.\n \"\"\"\n n_samples = X.shape[0]\n\n depths = np.zeros(n_samples,\ \ order=\"f\")\n\n average_path_length_max_samples = _average_path_length([self._max_samples])\n\ \n # Note: we use default n_jobs value, i.e. sequential computation, which\n\ \ # we expect to be more performant that parallelizing for small number\n\ \ # of samples, e.g. < 1k samples. Default n_jobs value can be overridden\n\ \ # by using joblib.parallel_backend context manager around\n #\ \ ._compute_score_samples. Using a higher n_jobs may speed up the\n # computation\ \ of the scores, e.g. for > 1k samples. See\n # https://github.com/scikit-learn/scikit-learn/pull/28622\ \ for more\n # details.\n lock = threading.Lock()\n Parallel(\n\ \ verbose=self.verbose,\n require=\"sharedmem\",\n \ \ )(\n delayed(_parallel_compute_tree_depths)(\n tree,\n\ \ X,\n features if subsample_features else None,\n\ \ self._decision_path_lengths[tree_idx],\n self._average_path_length_per_tree[tree_idx],\n\ \ depths,\n lock,\n )\n for\ \ tree_idx, (tree, features) in enumerate(\n zip(self.estimators_,\ \ self.estimators_features_)\n )\n )\n\n denominator\ \ = len(self.estimators_) * average_path_length_max_samples\n scores =\ \ 2 ** (\n # For a single training sample, denominator and depth are\ \ 0.\n # Therefore, we set the score manually to 1.\n -np.divide(\n\ \ depths, denominator, out=np.ones_like(depths), where=denominator\ \ != 0\n )\n )\n return scores" - " def predict(self, X):\n return np.zeros(X.shape[0])" - "def test_dist_threshold_invalid_parameters():\n X = [[0], [1]]\n with pytest.raises(ValueError,\ \ match=\"Exactly one of \"):\n AgglomerativeClustering(n_clusters=None,\ \ distance_threshold=None).fit(X)\n\n with pytest.raises(ValueError, match=\"\ Exactly one of \"):\n AgglomerativeClustering(n_clusters=2, distance_threshold=1).fit(X)\n\ \n X = [[0], [1]]\n with pytest.raises(ValueError, match=\"compute_full_tree\ \ must be True if\"):\n AgglomerativeClustering(\n n_clusters=None,\ \ distance_threshold=1, compute_full_tree=False\n ).fit(X)" - source_sentence: sklearn tags sentences: - " def __sklearn_tags__(self):\n tags = super().__sklearn_tags__()\n\ \ tags.input_tags.sparse = True\n return tags" - "class SelectFdr(_BaseFilter):\n \"\"\"Filter: Select the p-values for an estimated\ \ false discovery rate.\n\n This uses the Benjamini-Hochberg procedure. ``alpha``\ \ is an upper bound\n on the expected false discovery rate.\n\n Read more\ \ in the :ref:`User Guide `.\n\n Parameters\n\ \ ----------\n score_func : callable, default=f_classif\n Function\ \ taking two arrays X and y, and returning a pair of arrays\n (scores,\ \ pvalues).\n Default is f_classif (see below \"See Also\"). The default\ \ function only\n works with classification tasks.\n\n alpha : float,\ \ default=5e-2\n The highest uncorrected p-value for features to keep.\n\ \n Attributes\n ----------\n scores_ : array-like of shape (n_features,)\n\ \ Scores of features.\n\n pvalues_ : array-like of shape (n_features,)\n\ \ p-values of feature scores.\n\n n_features_in_ : int\n Number\ \ of features seen during :term:`fit`.\n\n .. versionadded:: 0.24\n\n \ \ feature_names_in_ : ndarray of shape (`n_features_in_`,)\n Names of\ \ features seen during :term:`fit`. Defined only when `X`\n has feature\ \ names that are all strings.\n\n .. versionadded:: 1.0\n\n See Also\n\ \ --------\n f_classif : ANOVA F-value between label/feature for classification\ \ tasks.\n mutual_info_classif : Mutual information for a discrete target.\n\ \ chi2 : Chi-squared stats of non-negative features for classification tasks.\n\ \ f_regression : F-value between label/feature for regression tasks.\n mutual_info_regression\ \ : Mutual information for a continuous target.\n SelectPercentile : Select\ \ features based on percentile of the highest\n scores.\n SelectKBest\ \ : Select features based on the k highest scores.\n SelectFpr : Select features\ \ based on a false positive rate test.\n SelectFwe : Select features based\ \ on family-wise error rate.\n GenericUnivariateSelect : Univariate feature\ \ selector with configurable\n mode.\n\n References\n ----------\n\ \ https://en.wikipedia.org/wiki/False_discovery_rate\n\n Examples\n --------\n\ \ >>> from sklearn.datasets import load_breast_cancer\n >>> from sklearn.feature_selection\ \ import SelectFdr, chi2\n >>> X, y = load_breast_cancer(return_X_y=True)\n\ \ >>> X.shape\n (569, 30)\n >>> X_new = SelectFdr(chi2, alpha=0.01).fit_transform(X,\ \ y)\n >>> X_new.shape\n (569, 16)\n \"\"\"\n\n _parameter_constraints:\ \ dict = {\n **_BaseFilter._parameter_constraints,\n \"alpha\":\ \ [Interval(Real, 0, 1, closed=\"both\")],\n }\n\n def __init__(self, score_func=f_classif,\ \ *, alpha=5e-2):\n super().__init__(score_func=score_func)\n self.alpha\ \ = alpha\n\n def _get_support_mask(self):\n check_is_fitted(self)\n\ \n n_features = len(self.pvalues_)\n sv = np.sort(self.pvalues_)\n\ \ selected = sv[\n sv <= float(self.alpha) / n_features * np.arange(1,\ \ n_features + 1)\n ]\n if selected.size == 0:\n return\ \ np.zeros_like(self.pvalues_, dtype=bool)\n return self.pvalues_ <= selected.max()" - "def test_absolute_error():\n # For coverage only.\n X, y = make_regression(n_samples=500,\ \ random_state=0)\n gbdt = HistGradientBoostingRegressor(loss=\"absolute_error\"\ , random_state=0)\n gbdt.fit(X, y)\n assert gbdt.score(X, y) > 0.9" - source_sentence: test lsvc intercept scaling zero sentences: - "class BaggingClassifier(ClassifierMixin, BaseBagging):\n \"\"\"A Bagging classifier.\n\ \n A Bagging classifier is an ensemble meta-estimator that fits base\n classifiers\ \ each on random subsets of the original dataset and then\n aggregate their\ \ individual predictions (either by voting or by averaging)\n to form a final\ \ prediction. Such a meta-estimator can typically be used as\n a way to reduce\ \ the variance of a black-box estimator (e.g., a decision\n tree), by introducing\ \ randomization into its construction procedure and\n then making an ensemble\ \ out of it.\n\n This algorithm encompasses several works from the literature.\ \ When random\n subsets of the dataset are drawn as random subsets of the samples,\ \ then\n this algorithm is known as Pasting [1]_. If samples are drawn with\n\ \ replacement, then the method is known as Bagging [2]_. When random subsets\n\ \ of the dataset are drawn as random subsets of the features, then the method\n\ \ is known as Random Subspaces [3]_. Finally, when base estimators are built\n\ \ on subsets of both samples and features, then the method is known as\n \ \ Random Patches [4]_.\n\n Read more in the :ref:`User Guide `.\n\ \n .. versionadded:: 0.15\n\n Parameters\n ----------\n estimator\ \ : object, default=None\n The base estimator to fit on random subsets\ \ of the dataset.\n If None, then the base estimator is a\n :class:`~sklearn.tree.DecisionTreeClassifier`.\n\ \n .. versionadded:: 1.2\n `base_estimator` was renamed to `estimator`.\n\ \n n_estimators : int, default=10\n The number of base estimators in\ \ the ensemble.\n\n max_samples : int or float, default=1.0\n The number\ \ of samples to draw from X to train each base estimator (with\n replacement\ \ by default, see `bootstrap` for more details).\n\n - If int, then draw\ \ `max_samples` samples.\n - If float, then draw `max_samples * X.shape[0]`\ \ samples.\n\n max_features : int or float, default=1.0\n The number\ \ of features to draw from X to train each base estimator (\n without replacement\ \ by default, see `bootstrap_features` for more\n details).\n\n \ \ - If int, then draw `max_features` features.\n - If float, then draw\ \ `max(1, int(max_features * n_features_in_))` features.\n\n bootstrap : bool,\ \ default=True\n Whether samples are drawn with replacement. If False,\ \ sampling\n without replacement is performed.\n\n bootstrap_features\ \ : bool, default=False\n Whether features are drawn with replacement.\n\ \n oob_score : bool, default=False\n Whether to use out-of-bag samples\ \ to estimate\n the generalization error. Only available if bootstrap=True.\n\ \n warm_start : bool, default=False\n When set to True, reuse the solution\ \ of the previous call to fit\n and add more estimators to the ensemble,\ \ otherwise, just fit\n a whole new ensemble. See :term:`the Glossary `.\n\ \n .. versionadded:: 0.17\n *warm_start* constructor parameter.\n\ \n n_jobs : int, default=None\n The number of jobs to run in parallel\ \ for both :meth:`fit` and\n :meth:`predict`. ``None`` means 1 unless in\ \ a\n :obj:`joblib.parallel_backend` context. ``-1`` means using all\n\ \ processors. See :term:`Glossary ` for more details.\n\n random_state\ \ : int, RandomState instance or None, default=None\n Controls the random\ \ resampling of the original dataset\n (sample wise and feature wise).\n\ \ If the base estimator accepts a `random_state` attribute, a different\n\ \ seed is generated for each instance in the ensemble.\n Pass an\ \ int for reproducible output across multiple function calls.\n See :term:`Glossary\ \ `.\n\n verbose : int, default=0\n Controls the verbosity\ \ when fitting and predicting.\n\n Attributes\n ----------\n estimator_\ \ : estimator\n The base estimator from which the ensemble is grown.\n\n\ \ .. versionadded:: 1.2\n `base_estimator_` was renamed to `estimator_`.\n\ \n n_features_in_ : int\n Number of features seen during :term:`fit`.\n\ \n .. versionadded:: 0.24\n\n feature_names_in_ : ndarray of shape (`n_features_in_`,)\n\ \ Names of features seen during :term:`fit`. Defined only when `X`\n \ \ has feature names that are all strings.\n\n .. versionadded:: 1.0\n\ \n estimators_ : list of estimators\n The collection of fitted base\ \ estimators.\n\n estimators_samples_ : list of arrays\n The subset\ \ of drawn samples (i.e., the in-bag samples) for each base\n estimator.\ \ Each subset is defined by an array of the indices selected.\n\n estimators_features_\ \ : list of arrays\n The subset of drawn features for each base estimator.\n\ \n classes_ : ndarray of shape (n_classes,)\n The classes labels.\n\n\ \ n_classes_ : int or list\n The number of classes.\n\n oob_score_\ \ : float\n Score of the training dataset obtained using an out-of-bag\ \ estimate.\n This attribute exists only when ``oob_score`` is True.\n\n\ \ oob_decision_function_ : ndarray of shape (n_samples, n_classes)\n \ \ Decision function computed with out-of-bag estimate on the training\n \ \ set. If n_estimators is small it might be possible that a data point\n \ \ was never left out during the bootstrap. In this case,\n `oob_decision_function_`\ \ might contain NaN. This attribute exists\n only when ``oob_score`` is\ \ True.\n\n See Also\n --------\n BaggingRegressor : A Bagging regressor.\n\ \n References\n ----------\n\n .. [1] L. Breiman, \"Pasting small votes\ \ for classification in large\n databases and on-line\", Machine Learning,\ \ 36(1), 85-103, 1999.\n\n .. [2] L. Breiman, \"Bagging predictors\", Machine\ \ Learning, 24(2), 123-140,\n 1996.\n\n .. [3] T. Ho, \"The random\ \ subspace method for constructing decision\n forests\", Pattern Analysis\ \ and Machine Intelligence, 20(8), 832-844,\n 1998.\n\n .. [4] G.\ \ Louppe and P. Geurts, \"Ensembles on Random Patches\", Machine\n Learning\ \ and Knowledge Discovery in Databases, 346-361, 2012.\n\n Examples\n --------\n\ \ >>> from sklearn.svm import SVC\n >>> from sklearn.ensemble import BaggingClassifier\n\ \ >>> from sklearn.datasets import make_classification\n >>> X, y = make_classification(n_samples=100,\ \ n_features=4,\n ... n_informative=2, n_redundant=0,\n\ \ ... random_state=0, shuffle=False)\n >>> clf\ \ = BaggingClassifier(estimator=SVC(),\n ... n_estimators=10,\ \ random_state=0).fit(X, y)\n >>> clf.predict([[0, 0, 0, 0]])\n array([1])\n\ \ \"\"\"\n\n def __init__(\n self,\n estimator=None,\n \ \ n_estimators=10,\n *,\n max_samples=1.0,\n max_features=1.0,\n\ \ bootstrap=True,\n bootstrap_features=False,\n oob_score=False,\n\ \ warm_start=False,\n n_jobs=None,\n random_state=None,\n\ \ verbose=0,\n ):\n super().__init__(\n estimator=estimator,\n\ \ n_estimators=n_estimators,\n max_samples=max_samples,\n\ \ max_features=max_features,\n bootstrap=bootstrap,\n \ \ bootstrap_features=bootstrap_features,\n oob_score=oob_score,\n\ \ warm_start=warm_start,\n n_jobs=n_jobs,\n random_state=random_state,\n\ \ verbose=verbose,\n )\n\n def _get_estimator(self):\n \ \ \"\"\"Resolve which estimator to return (default is DecisionTreeClassifier)\"\ \"\"\n if self.estimator is None:\n return DecisionTreeClassifier()\n\ \ return self.estimator\n\n def _set_oob_score(self, X, y):\n \ \ n_samples = y.shape[0]\n n_classes_ = self.n_classes_\n\n predictions\ \ = np.zeros((n_samples, n_classes_))\n\n for estimator, samples, features\ \ in zip(\n self.estimators_, self.estimators_samples_, self.estimators_features_\n\ \ ):\n # Create mask for OOB samples\n mask = ~indices_to_mask(samples,\ \ n_samples)\n\n if hasattr(estimator, \"predict_proba\"):\n \ \ predictions[mask, :] += estimator.predict_proba(\n \ \ (X[mask, :])[:, features]\n )\n\n else:\n \ \ p = estimator.predict((X[mask, :])[:, features])\n \ \ j = 0\n\n for i in range(n_samples):\n if\ \ mask[i]:\n predictions[i, p[j]] += 1\n \ \ j += 1\n\n if (predictions.sum(axis=1) == 0).any():\n \ \ warn(\n \"Some inputs do not have OOB scores. \"\n \ \ \"This probably means too few estimators were used \"\n \ \ \"to compute any reliable oob estimates.\"\n )\n\n oob_decision_function\ \ = predictions / predictions.sum(axis=1)[:, np.newaxis]\n oob_score =\ \ accuracy_score(y, np.argmax(predictions, axis=1))\n\n self.oob_decision_function_\ \ = oob_decision_function\n self.oob_score_ = oob_score\n\n def _validate_y(self,\ \ y):\n y = column_or_1d(y, warn=True)\n check_classification_targets(y)\n\ \ self.classes_, y = np.unique(y, return_inverse=True)\n self.n_classes_\ \ = len(self.classes_)\n\n return y\n\n def predict(self, X, **params):\n\ \ \"\"\"Predict class for X.\n\n The predicted class of an input\ \ sample is computed as the class with\n the highest mean predicted probability.\ \ If base estimators do not\n implement a ``predict_proba`` method, then\ \ it resorts to voting.\n\n Parameters\n ----------\n X :\ \ {array-like, sparse matrix} of shape (n_samples, n_features)\n The\ \ training input samples. Sparse matrices are accepted only if\n they\ \ are supported by the base estimator.\n\n **params : dict\n \ \ Parameters routed to the `predict_proba` (if available) or the `predict`\n\ \ method (otherwise) of the sub-estimators via the metadata routing\ \ API.\n\n .. versionadded:: 1.7\n\n Only available\ \ if\n `sklearn.set_config(enable_metadata_routing=True)` is set.\ \ See\n :ref:`Metadata Routing User Guide ` for\ \ more\n details.\n\n Returns\n -------\n \ \ y : ndarray of shape (n_samples,)\n The predicted classes.\n \ \ \"\"\"\n _raise_for_params(params, self, \"predict\")\n\n predicted_probabilitiy\ \ = self.predict_proba(X, **params)\n return self.classes_.take((np.argmax(predicted_probabilitiy,\ \ axis=1)), axis=0)\n\n def predict_proba(self, X, **params):\n \"\"\ \"Predict class probabilities for X.\n\n The predicted class probabilities\ \ of an input sample is computed as\n the mean predicted class probabilities\ \ of the base estimators in the\n ensemble. If base estimators do not implement\ \ a ``predict_proba``\n method, then it resorts to voting and the predicted\ \ class probabilities\n of an input sample represents the proportion of\ \ estimators predicting\n each class.\n\n Parameters\n ----------\n\ \ X : {array-like, sparse matrix} of shape (n_samples, n_features)\n \ \ The training input samples. Sparse matrices are accepted only if\n\ \ they are supported by the base estimator.\n\n **params : dict\n\ \ Parameters routed to the `predict_proba` (if available) or the `predict`\n\ \ method (otherwise) of the sub-estimators via the metadata routing\ \ API.\n\n .. versionadded:: 1.7\n\n Only available\ \ if\n `sklearn.set_config(enable_metadata_routing=True)` is set.\ \ See\n :ref:`Metadata Routing User Guide ` for\ \ more\n details.\n\n Returns\n -------\n \ \ p : ndarray of shape (n_samples, n_classes)\n The class probabilities\ \ of the input samples. The order of the\n classes corresponds to that\ \ in the attribute :term:`classes_`.\n \"\"\"\n _raise_for_params(params,\ \ self, \"predict_proba\")\n\n check_is_fitted(self)\n # Check data\n\ \ X = validate_data(\n self,\n X,\n accept_sparse=[\"\ csr\", \"csc\"],\n dtype=None,\n ensure_all_finite=False,\n\ \ reset=False,\n )\n\n if _routing_enabled():\n \ \ routed_params = process_routing(self, \"predict_proba\", **params)\n \ \ else:\n routed_params = Bunch()\n routed_params.estimator\ \ = Bunch(predict_proba=Bunch())\n\n # Parallel loop\n n_jobs, _,\ \ starts = _partition_estimators(self.n_estimators, self.n_jobs)\n\n all_proba\ \ = Parallel(\n n_jobs=n_jobs, verbose=self.verbose, **self._parallel_args()\n\ \ )(\n delayed(_parallel_predict_proba)(\n self.estimators_[starts[i]\ \ : starts[i + 1]],\n self.estimators_features_[starts[i] : starts[i\ \ + 1]],\n X,\n self.n_classes_,\n \ \ predict_params=routed_params.estimator.get(\"predict\", None),\n \ \ predict_proba_params=routed_params.estimator.get(\"predict_proba\", None),\n\ \ )\n for i in range(n_jobs)\n )\n\n # Reduce\n\ \ proba = sum(all_proba) / self.n_estimators\n\n return proba\n\n\ \ def predict_log_proba(self, X, **params):\n \"\"\"Predict class log-probabilities\ \ for X.\n\n The predicted class log-probabilities of an input sample is\ \ computed as\n the log of the mean predicted class probabilities of the\ \ base\n estimators in the ensemble.\n\n Parameters\n ----------\n\ \ X : {array-like, sparse matrix} of shape (n_samples, n_features)\n \ \ The training input samples. Sparse matrices are accepted only if\n\ \ they are supported by the base estimator.\n\n **params : dict\n\ \ Parameters routed to the `predict_log_proba`, the `predict_proba`\ \ or the\n `proba` method of the sub-estimators via the metadata routing\ \ API. The\n routing is tried in the mentioned order depending on whether\ \ this method is\n available on the sub-estimator.\n\n ..\ \ versionadded:: 1.7\n\n Only available if\n `sklearn.set_config(enable_metadata_routing=True)`\ \ is set. See\n :ref:`Metadata Routing User Guide `\ \ for more\n details.\n\n Returns\n -------\n \ \ p : ndarray of shape (n_samples, n_classes)\n The class log-probabilities\ \ of the input samples. The order of the\n classes corresponds to that\ \ in the attribute :term:`classes_`.\n \"\"\"\n _raise_for_params(params,\ \ self, \"predict_log_proba\")\n\n check_is_fitted(self)\n\n if\ \ hasattr(self.estimator_, \"predict_log_proba\"):\n # Check data\n\ \ X = validate_data(\n self,\n X,\n \ \ accept_sparse=[\"csr\", \"csc\"],\n dtype=None,\n\ \ ensure_all_finite=False,\n reset=False,\n \ \ )\n\n if _routing_enabled():\n routed_params\ \ = process_routing(self, \"predict_log_proba\", **params)\n else:\n\ \ routed_params = Bunch()\n routed_params.estimator\ \ = Bunch(predict_log_proba=Bunch())\n\n # Parallel loop\n \ \ n_jobs, _, starts = _partition_estimators(self.n_estimators, self.n_jobs)\n\ \n all_log_proba = Parallel(n_jobs=n_jobs, verbose=self.verbose)(\n\ \ delayed(_parallel_predict_log_proba)(\n self.estimators_[starts[i]\ \ : starts[i + 1]],\n self.estimators_features_[starts[i] :\ \ starts[i + 1]],\n X,\n self.n_classes_,\n\ \ params=routed_params.estimator.predict_log_proba,\n \ \ )\n for i in range(n_jobs)\n )\n\n \ \ # Reduce\n log_proba = all_log_proba[0]\n\n for\ \ j in range(1, len(all_log_proba)):\n log_proba = np.logaddexp(log_proba,\ \ all_log_proba[j])\n\n log_proba -= np.log(self.n_estimators)\n\n\ \ else:\n log_proba = np.log(self.predict_proba(X, **params))\n\ \n return log_proba\n\n @available_if(\n _estimator_has(\"decision_function\"\ , delegates=(\"estimators_\", \"estimator\"))\n )\n def decision_function(self,\ \ X, **params):\n \"\"\"Average of the decision functions of the base classifiers.\n\ \n Parameters\n ----------\n X : {array-like, sparse matrix}\ \ of shape (n_samples, n_features)\n The training input samples. Sparse\ \ matrices are accepted only if\n they are supported by the base estimator.\n\ \n **params : dict\n Parameters routed to the `decision_function`\ \ method of the sub-estimators\n via the metadata routing API.\n\n\ \ .. versionadded:: 1.7\n\n Only available if\n \ \ `sklearn.set_config(enable_metadata_routing=True)` is set. See\n\ \ :ref:`Metadata Routing User Guide ` for more\n\ \ details.\n\n Returns\n -------\n score :\ \ ndarray of shape (n_samples, k)\n The decision function of the input\ \ samples. The columns correspond\n to the classes in sorted order,\ \ as they appear in the attribute\n ``classes_``. Regression and binary\ \ classification are special\n cases with ``k == 1``, otherwise ``k==n_classes``.\n\ \ \"\"\"\n _raise_for_params(params, self, \"decision_function\"\ )\n\n check_is_fitted(self)\n\n # Check data\n X = validate_data(\n\ \ self,\n X,\n accept_sparse=[\"csr\", \"csc\"\ ],\n dtype=None,\n ensure_all_finite=False,\n \ \ reset=False,\n )\n\n if _routing_enabled():\n routed_params\ \ = process_routing(self, \"decision_function\", **params)\n else:\n \ \ routed_params = Bunch()\n routed_params.estimator = Bunch(decision_function=Bunch())\n\ \n # Parallel loop\n n_jobs, _, starts = _partition_estimators(self.n_estimators,\ \ self.n_jobs)\n\n all_decisions = Parallel(n_jobs=n_jobs, verbose=self.verbose)(\n\ \ delayed(_parallel_decision_function)(\n self.estimators_[starts[i]\ \ : starts[i + 1]],\n self.estimators_features_[starts[i] : starts[i\ \ + 1]],\n X,\n params=routed_params.estimator.decision_function,\n\ \ )\n for i in range(n_jobs)\n )\n\n # Reduce\n\ \ decisions = sum(all_decisions) / self.n_estimators\n\n return\ \ decisions" - " def get_n_splits(self, X=None, y=None, groups=None):\n return self.n_splits" - "def test_lsvc_intercept_scaling_zero():\n # Test that intercept_scaling is\ \ ignored when fit_intercept is False\n\n lsvc = svm.LinearSVC(fit_intercept=False)\n\ \ lsvc.fit(X, Y)\n assert lsvc.intercept_ == 0.0" - source_sentence: test power transformer 1d sentences: - "def test_power_transformer_1d():\n X = np.abs(X_1col)\n\n for standardize\ \ in [True, False]:\n pt = PowerTransformer(method=\"box-cox\", standardize=standardize)\n\ \n X_trans = pt.fit_transform(X)\n X_trans_func = power_transform(X,\ \ method=\"box-cox\", standardize=standardize)\n\n X_expected, lambda_expected\ \ = stats.boxcox(X.flatten())\n\n if standardize:\n X_expected\ \ = scale(X_expected)\n\n assert_almost_equal(X_expected.reshape(-1, 1),\ \ X_trans)\n assert_almost_equal(X_expected.reshape(-1, 1), X_trans_func)\n\ \n assert_almost_equal(X, pt.inverse_transform(X_trans))\n assert_almost_equal(lambda_expected,\ \ pt.lambdas_[0])\n\n assert len(pt.lambdas_) == X.shape[1]\n assert\ \ isinstance(pt.lambdas_, np.ndarray)" - "def test_hdbscan_feature_array():\n \"\"\"\n Tests that HDBSCAN works with\ \ feature array, including an arbitrary\n goodness of fit check. Note that\ \ the check is a simple heuristic.\n \"\"\"\n labels = HDBSCAN().fit_predict(X)\n\ \n # Check that clustering is arbitrarily good\n # This is a heuristic to\ \ guard against regression\n check_label_quality(labels)" - "def test_pca_initialization_not_compatible_with_sparse_input(csr_container):\n\ \ # Sparse input matrices cannot use PCA initialization.\n tsne = TSNE(init=\"\ pca\", learning_rate=100.0, perplexity=1)\n with pytest.raises(TypeError, match=\"\ PCA initialization.*\"):\n tsne.fit_transform(csr_container([[0, 5], [5,\ \ 0]]))" - source_sentence: Evaluate predicted target values for X relative to y_true sentences: - "def test_hdbscan_usable_inputs(X, kwargs):\n \"\"\"\n Tests that HDBSCAN\ \ works correctly for array-likes and precomputed inputs\n with non-finite\ \ points.\n \"\"\"\n HDBSCAN(min_samples=1, **kwargs).fit(X)" - " def __call__(self, estimator, X, y_true, sample_weight=None, **kwargs):\n\ \ \"\"\"Evaluate predicted target values for X relative to y_true.\n\n\ \ Parameters\n ----------\n estimator : object\n \ \ Trained estimator to use for scoring. Must have a predict_proba\n \ \ method; the output of that is used to compute the score.\n\n X :\ \ {array-like, sparse matrix}\n Test data that will be fed to estimator.predict.\n\ \n y_true : array-like\n Gold standard target values for X.\n\ \n sample_weight : array-like of shape (n_samples,), default=None\n \ \ Sample weights.\n\n **kwargs : dict\n Other parameters\ \ passed to the scorer. Refer to\n :func:`set_score_request` for more\ \ details.\n\n Only available if `enable_metadata_routing=True`. See\ \ the\n :ref:`User Guide `.\n\n .. versionadded::\ \ 1.3\n\n Returns\n -------\n score : float\n \ \ Score function applied to prediction of estimator on X.\n \"\"\"\n \ \ # TODO (1.8): remove in 1.8 (scoring=\"max_error\" has been deprecated\ \ in 1.6)\n if self._deprecation_msg is not None:\n warnings.warn(\n\ \ self._deprecation_msg, category=DeprecationWarning, stacklevel=2\n\ \ )\n\n _raise_for_params(kwargs, self, None)\n\n _kwargs\ \ = copy.deepcopy(kwargs)\n if sample_weight is not None:\n \ \ _kwargs[\"sample_weight\"] = sample_weight\n\n return self._score(partial(_cached_call,\ \ None), estimator, X, y_true, **_kwargs)" - ' def set_inverse_transform_request(self, **kwargs): pass' pipeline_tag: sentence-similarity library_name: sentence-transformers --- # SentenceTransformer based on intfloat/e5-small-v2 This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [intfloat/e5-small-v2](https://huggingface.co/intfloat/e5-small-v2). It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. ## Model Details ### Model Description - **Model Type:** Sentence Transformer - **Base model:** [intfloat/e5-small-v2](https://huggingface.co/intfloat/e5-small-v2) - **Maximum Sequence Length:** 512 tokens - **Output Dimensionality:** 384 dimensions - **Similarity Function:** Cosine Similarity ### Model Sources - **Documentation:** [Sentence Transformers Documentation](https://sbert.net) - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers) - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers) ### Full Model Architecture ``` SentenceTransformer( (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: PeftModelForFeatureExtraction (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True}) (2): Normalize() ) ``` ## Usage ### Direct Usage (Sentence Transformers) First install the Sentence Transformers library: ```bash pip install -U sentence-transformers ``` Then you can load this model and run inference. ```python from sentence_transformers import SentenceTransformer # Download from the 🤗 Hub model = SentenceTransformer("sentence_transformers_model_id") # Run inference sentences = [ 'Evaluate predicted target values for X relative to y_true', ' def __call__(self, estimator, X, y_true, sample_weight=None, **kwargs):\n """Evaluate predicted target values for X relative to y_true.\n\n Parameters\n ----------\n estimator : object\n Trained estimator to use for scoring. Must have a predict_proba\n method; the output of that is used to compute the score.\n\n X : {array-like, sparse matrix}\n Test data that will be fed to estimator.predict.\n\n y_true : array-like\n Gold standard target values for X.\n\n sample_weight : array-like of shape (n_samples,), default=None\n Sample weights.\n\n **kwargs : dict\n Other parameters passed to the scorer. Refer to\n :func:`set_score_request` for more details.\n\n Only available if `enable_metadata_routing=True`. See the\n :ref:`User Guide `.\n\n .. versionadded:: 1.3\n\n Returns\n -------\n score : float\n Score function applied to prediction of estimator on X.\n """\n # TODO (1.8): remove in 1.8 (scoring="max_error" has been deprecated in 1.6)\n if self._deprecation_msg is not None:\n warnings.warn(\n self._deprecation_msg, category=DeprecationWarning, stacklevel=2\n )\n\n _raise_for_params(kwargs, self, None)\n\n _kwargs = copy.deepcopy(kwargs)\n if sample_weight is not None:\n _kwargs["sample_weight"] = sample_weight\n\n return self._score(partial(_cached_call, None), estimator, X, y_true, **_kwargs)', 'def test_hdbscan_usable_inputs(X, kwargs):\n """\n Tests that HDBSCAN works correctly for array-likes and precomputed inputs\n with non-finite points.\n """\n HDBSCAN(min_samples=1, **kwargs).fit(X)', ] embeddings = model.encode(sentences) print(embeddings.shape) # [3, 384] # Get the similarity scores for the embeddings similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] ``` ## Training Details ### Training Dataset #### Unnamed Dataset * Size: 13,734 training samples * Columns: sentence_0 and sentence_1 * Approximate statistics based on the first 1000 samples: | | sentence_0 | sentence_1 | |:--------|:---------------------------------------------------------------------------------|:------------------------------------------------------------------------------------| | type | string | string | | details |
  • min: 3 tokens
  • mean: 8.78 tokens
  • max: 63 tokens
|
  • min: 9 tokens
  • mean: 233.15 tokens
  • max: 512 tokens
| * Samples: | sentence_0 | sentence_1 | |:-----------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | Get the estimator | def _get_estimator(self):
"""Get the estimator.

Returns
-------
estimator_ : estimator object
The cloned estimator object.
"""
# TODO(1.8): remove and only keep clone(self.estimator)
if self.estimator is None and self.base_estimator != "deprecated":
estimator_ = clone(self.base_estimator)

warn(
(
"`base_estimator` has been deprecated in 1.6 and will be removed"
" in 1.8. Please use `estimator` instead."
),
FutureWarning,
)
# TODO(1.8) remove
elif self.estimator is None and self.base_estimator == "deprecated":
raise ValueError(
"You must pass an estimator to SelfTrainingClassifier. Use `estimator`."
)
elif self.estimator is not None and self.base_estimator != "deprecated":
raise ValueError(
"You must p...
| | Gaussian Naive Bayes (GaussianNB) | class GaussianNB(_BaseNB):
"""
Gaussian Naive Bayes (GaussianNB).

Can perform online updates to model parameters via :meth:`partial_fit`.
For details on algorithm used to update feature means and variance online,
see `Stanford CS tech report STAN-CS-79-773 by Chan, Golub, and LeVeque
`_.

Read more in the :ref:`User Guide `.

Parameters
----------
priors : array-like of shape (n_classes,), default=None
Prior probabilities of the classes. If specified, the priors are not
adjusted according to the data.

var_smoothing : float, default=1e-9
Portion of the largest variance of all features that is added to
variances for calculation stability.

.. versionadded:: 0.20

Attributes
----------
class_count_ : ndarray of shape (n_classes,)
number of training samples observed in each class.

class_pri...
| | test rfe cv n jobs | def test_rfe_cv_n_jobs(global_random_seed):
generator = check_random_state(global_random_seed)
iris = load_iris()
X = np.c_[iris.data, generator.normal(size=(len(iris.data), 6))]
y = iris.target

rfecv = RFECV(estimator=SVC(kernel="linear"))
rfecv.fit(X, y)
rfecv_ranking = rfecv.ranking_

rfecv_cv_results_ = rfecv.cv_results_

rfecv.set_params(n_jobs=2)
rfecv.fit(X, y)
assert_array_almost_equal(rfecv.ranking_, rfecv_ranking)

assert rfecv_cv_results_.keys() == rfecv.cv_results_.keys()
for key in rfecv_cv_results_.keys():
assert rfecv_cv_results_[key] == pytest.approx(rfecv.cv_results_[key])
| * Loss: [MultipleNegativesRankingLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters: ```json { "scale": 20.0, "similarity_fct": "cos_sim" } ``` ### Training Hyperparameters #### Non-Default Hyperparameters - `per_device_train_batch_size`: 16 - `per_device_eval_batch_size`: 16 - `num_train_epochs`: 1 - `fp16`: True - `multi_dataset_batch_sampler`: round_robin #### All Hyperparameters
Click to expand - `overwrite_output_dir`: False - `do_predict`: False - `eval_strategy`: no - `prediction_loss_only`: True - `per_device_train_batch_size`: 16 - `per_device_eval_batch_size`: 16 - `per_gpu_train_batch_size`: None - `per_gpu_eval_batch_size`: None - `gradient_accumulation_steps`: 1 - `eval_accumulation_steps`: None - `torch_empty_cache_steps`: None - `learning_rate`: 5e-05 - `weight_decay`: 0.0 - `adam_beta1`: 0.9 - `adam_beta2`: 0.999 - `adam_epsilon`: 1e-08 - `max_grad_norm`: 1 - `num_train_epochs`: 1 - `max_steps`: -1 - `lr_scheduler_type`: linear - `lr_scheduler_kwargs`: {} - `warmup_ratio`: 0.0 - `warmup_steps`: 0 - `log_level`: passive - `log_level_replica`: warning - `log_on_each_node`: True - `logging_nan_inf_filter`: True - `save_safetensors`: True - `save_on_each_node`: False - `save_only_model`: False - `restore_callback_states_from_checkpoint`: False - `no_cuda`: False - `use_cpu`: False - `use_mps_device`: False - `seed`: 42 - `data_seed`: None - `jit_mode_eval`: False - `use_ipex`: False - `bf16`: False - `fp16`: True - `fp16_opt_level`: O1 - `half_precision_backend`: auto - `bf16_full_eval`: False - `fp16_full_eval`: False - `tf32`: None - `local_rank`: 0 - `ddp_backend`: None - `tpu_num_cores`: None - `tpu_metrics_debug`: False - `debug`: [] - `dataloader_drop_last`: False - `dataloader_num_workers`: 0 - `dataloader_prefetch_factor`: None - `past_index`: -1 - `disable_tqdm`: False - `remove_unused_columns`: True - `label_names`: None - `load_best_model_at_end`: False - `ignore_data_skip`: False - `fsdp`: [] - `fsdp_min_num_params`: 0 - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False} - `tp_size`: 0 - `fsdp_transformer_layer_cls_to_wrap`: None - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None} - `deepspeed`: None - `label_smoothing_factor`: 0.0 - `optim`: adamw_torch - `optim_args`: None - `adafactor`: False - `group_by_length`: False - `length_column_name`: length - `ddp_find_unused_parameters`: None - `ddp_bucket_cap_mb`: None - `ddp_broadcast_buffers`: False - `dataloader_pin_memory`: True - `dataloader_persistent_workers`: False - `skip_memory_metrics`: True - `use_legacy_prediction_loop`: False - `push_to_hub`: False - `resume_from_checkpoint`: None - `hub_model_id`: None - `hub_strategy`: every_save - `hub_private_repo`: None - `hub_always_push`: False - `gradient_checkpointing`: False - `gradient_checkpointing_kwargs`: None - `include_inputs_for_metrics`: False - `include_for_metrics`: [] - `eval_do_concat_batches`: True - `fp16_backend`: auto - `push_to_hub_model_id`: None - `push_to_hub_organization`: None - `mp_parameters`: - `auto_find_batch_size`: False - `full_determinism`: False - `torchdynamo`: None - `ray_scope`: last - `ddp_timeout`: 1800 - `torch_compile`: False - `torch_compile_backend`: None - `torch_compile_mode`: None - `include_tokens_per_second`: False - `include_num_input_tokens_seen`: False - `neftune_noise_alpha`: None - `optim_target_modules`: None - `batch_eval_metrics`: False - `eval_on_start`: False - `use_liger_kernel`: False - `eval_use_gather_object`: False - `average_tokens_across_devices`: False - `prompts`: None - `batch_sampler`: batch_sampler - `multi_dataset_batch_sampler`: round_robin
### Training Logs | Epoch | Step | Training Loss | |:------:|:----:|:-------------:| | 0.5821 | 500 | 0.6129 | ### Framework Versions - Python: 3.11.12 - Sentence Transformers: 3.4.1 - Transformers: 4.51.3 - PyTorch: 2.6.0+cu124 - Accelerate: 1.6.0 - Datasets: 3.5.1 - Tokenizers: 0.21.1 ## Citation ### BibTeX #### Sentence Transformers ```bibtex @inproceedings{reimers-2019-sentence-bert, title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", author = "Reimers, Nils and Gurevych, Iryna", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", month = "11", year = "2019", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/abs/1908.10084", } ``` #### MultipleNegativesRankingLoss ```bibtex @misc{henderson2017efficient, title={Efficient Natural Language Response Suggestion for Smart Reply}, author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil}, year={2017}, eprint={1705.00652}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```