Coverage for pyspark/sql/pandas/group

Hot-keys on this page

r m x p toggle line displays

j k next/prev highlighted chunk

0 (zero) top of page

1 (one) first highlighted chunk

# Licensed to the Apache Software Foundation (ASF) under one or more

# contributor license agreements. See the NOTICE file distributed with

# this work for additional information regarding copyright ownership.

# The ASF licenses this file to You under the Apache License, Version 2.0

# (the "License"); you may not use this file except in compliance with

# the License. You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.

import sys

import warnings

from pyspark.rdd import PythonEvalType

from pyspark.sql.column import Column

from pyspark.sql.dataframe import DataFrame

class PandasGroupedOpsMixin(object):

"""

Min-in for pandas grouped operations. Currently, only :class:`GroupedData`

can use this class.

"""

def apply(self, udf):

"""

It is an alias of :meth:`pyspark.sql.GroupedData.applyInPandas`; however, it takes a

:meth:`pyspark.sql.functions.pandas_udf` whereas

:meth:`pyspark.sql.GroupedData.applyInPandas` takes a Python native function.

.. versionadded:: 2.3.0

Parameters

----------

udf : :func:`pyspark.sql.functions.pandas_udf`

a grouped map user-defined function returned by

:func:`pyspark.sql.functions.pandas_udf`.

Notes

-----

It is preferred to use :meth:`pyspark.sql.GroupedData.applyInPandas` over this

API. This API will be deprecated in the future releases.

Examples

--------

>>> from pyspark.sql.functions import pandas_udf, PandasUDFType

>>> df = spark.createDataFrame(

... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],

... ("id", "v"))

>>> @pandas_udf("id long, v double", PandasUDFType.GROUPED_MAP) # doctest: +SKIP

... def normalize(pdf):

... v = pdf.v

... return pdf.assign(v=(v - v.mean()) / v.std())

>>> df.groupby("id").apply(normalize).show() # doctest: +SKIP

+---+-------------------+

| id| v|

+---+-------------------+

| 1|-0.7071067811865475|

| 1| 0.7071067811865475|

| 2|-0.8320502943378437|

| 2|-0.2773500981126146|

| 2| 1.1094003924504583|

+---+-------------------+

Coverage for pyspark/sql/pandas/group_ops.py : 45%

52 statements 25 run 27 missing 0 excluded 1 partial