PySpark Advance DataFrame — A practical approach, part 5

Deepanshu tyagi
2 min readSep 25, 2022
apache pyspark

Hello learners, in the previous blogs we learned about some basics function of PySpark DataFrame and In this blog, we will learn about some advanced functions of PySpark DataFrame and also perform some practical.

Topics Covered

  1. Dropping Columns

2. Dropping Rows

3. Various Parameter In Dropping functionalities

4. Handling Missing values by Mean, Median, and Mode

5. Filter Operation

6. GroupBy

First, we have to import SparkSession and define appName and then import our data.

Reading the data

Dropping Columns

.drop() : PySpark drop() is used to remove the columns from the DataFrame.

.na.drop() : Pyspark .na.drop() is used to drop na records.

Parameters

  1. how — This takes values ‘any’ or ‘all’. By using ‘any’, drop a row if it contains Nulls on any columns. By using ‘all’, drop a row only if all columns have NULL values. Default is ‘any’.
  2. thresh — This takes int value, Drop rows that have less than thresh hold non-null values. Default is ‘None’.
  3. subset — Use this to select the columns for NULL values. Default is ‘None.

Filling the Null values

.na.fill() : It is used to fill the Null values.

Handling Missing values by Mean, Median, and Mode

Imputation estimator for completing missing values, using the mean, median, or mode of the columns in which the missing values are located.

Filter Operations

.filter() : It is used to query the dataframe.

GroupBy Function

The function PySpark groupBy() is used to collect identical data in groups on DataFrame and execute count, sum, avg, min, and max functions on grouped data.

Thanks, guys for reading this blog. In this blog, we spoke about the advanced functions of DataFrame and also codded them now in the next part we are going to learn about Machine Learning using PySpark.

Follow me for the next part and also clap If you like this blog.

--

--