Introduction to Apache PySpark Part 1

4 min readSep 17, 2022

RDDs, DataFrame and Datasets

In this series of blogs, we will start learning about PySpark, which is quite popular nowadays and mostly used for machine learning and big data engineering.

History of PySpark

Apache spark started as a research project at UC Berkeley AMP Lab in 2009, and it became open source in 2010. After this, Spark grew into a big developer community and moved to the Apache Software Foundation in 2013. Now, the project is used by various communities and various organizations.

PySpark API

RDD
DataFrame
Datasets

What is RDD?

RDD stands for Resilient Distributed Datasets. RDD is a read-only partition collection of records.

It is the fundamental unit of spark.

Resilient: It is fault-tolerant and is capable of rebuilding data on failure.
Distributed: Data is distributed among the multiple nodes in the cluster.
Dataset: Collection of partitioned data with values.

2. Immutable and follows lazy transformation.

3. You may apply multiple operations to these RDDs to accomplish a specific task.

4. It is also possible to cache and partition the RDD manually.

RDD Operations

Transformation: To create a new RDD (e.g. Filter, Groupby, and map)
Action: Instruct Spark to perform computation and send the result back to the driver.

Features of RDD

In memory Computation: It stores interim results in distributed memory (RAM) rather than stable storage (disk).
Lazy Evaluations: All transformations in Apache Spark are lazy, as long as they don’t calculate their results immediately.
Fault Tolerance: Spark RDD is fault-tolerant because they follow data line information to reconstruct data lost automatically in the event of a crash.
Immutability: Data can also be created or recovered at any time, making it easier to cache, share, and replicate. So it’s a way to get consistency in the calculations.
Partitioning: Partitioning is the basic unit of parallelism within spark RDD. Every partition is a logical data decision that is mutable. It is possible to create a partition by transformations on existing partitions.
Persistence: It is applicable to all elements of the datasets via maps, filters, or groups per operation.

DataFrame

In Apache Spark, a DataFrame is a distributed collection of rows underneath named columns. Simply put, this is the same as a table in a relational database or an Excel spreadsheet with column headings.

It also has certain features that are common to the RDD:

Unchangeable in nature: We can create DataFrame/ RDD one time, but we cannot change it. And we can transform a DataFrame/ RDD after the conversion application.
Lazy evaluations: This means that a task is not executed until an action is completed.

How to build a DataFrame.

There are several ways to create a DataFrame in Apache Spark:

It can be built using various data formats. Such as loading data from JSON and CSV.
Loading data from existing RDD.
Programming specifying the scheme.

DataSets

The Apache Spark Dataset provides an object-oriented and secure programming interface by type.

DataFrame is an alias for a non typed [line] dataset. Data sets offer compilation-time security, which means that production applications can be verified for errors before they are run, and they allow direct operations on user-defined classes.

The Dataset API also provides domain-specific language operations such as sum(), avg(), join(), select(), and groupBy(), making code much easier to express, read, and write.

RDDs VS DataFrame VS Datasets (Bonus part)

Thanks, guys for reading my blog, in the next part we will make a few practices using RDD and Dataframe so follow me to get updates.

Introduction to Apache PySpark Part 1

What is RDD?

DataFrame

DataSets

Written by Deepanshu tyagi