Quick Guide to PySpark

Jed Lee
6 min readAug 25, 2023

I am sharing what I have learnt about PySpark in the past 1 month

Image from Real Python

Content of this Article

  1. Introduction to PySpark
  2. Differences between PySpark and Python
  3. Defining Characteristics of PySpark
  4. Good Practices & Watchouts
  5. Conclusion

Introduction to PySpark

Born in the labs of UC Berkeley in 2009, Apache Spark was designed as a response to the growing limitations of the Hadoop MapReduce computing model. Hadoop MapReduce, a pioneer in its own right, enabled distributed data processing across vast datasets but was hindered by its inflexible two-stage dataflow. This structure often faltered with iterative algorithms, a staple in Machine Learning.

Apache Spark’s innovative response was an in-memory engine, ensuring swifter computations, especially for iterative tasks. Central to this innovation was the introduction of Resilient Distributed Datasets (RDDs). These fault-tolerant, parallel-processable datasets live in memory, significantly cutting down disk read-write operations and supercharging processing speeds.

However, while Spark was revolutionary, it primarily catered to those proficient in Scala, the language in which Spark was written. It remained a challenge for the vast Python community to harness its power directly.

Enter PySpark

PySpark is essentially a Python wrapper for Apache Spark.

PySpark allows users to interact with the core functionalities of Spark using Python. Under the hood, when you execute PySpark code, it invokes Spark operations written in Scala. The PySpark API wraps around these core Spark functionalities, making them accessible and usable within a Python environment.

Essentially, you are coding in Python while calling Spark API using PySpark’s API, making big data processing much more accessible.

Differences between PySpark and Python

  • Scope and Purpose: Python is the jack-of-all-trades in the programming world, finding use in web development, automation, data science, etc. PySpark, contrastingly, wears a specialized hat. It is tailored specifically for big data processing and analytics.
  • Scale: Python, despite enhanced by libraries like Pandas, is confined to the memory limits of a single machine. PySpark excels in dealing with large datasets, enabling distributed processing across a multitude of nodes.
  • Underlying Architecture: PySpark operates on Spark’s architecture, converting data to Resilient Distributed Datasets (RDDs) for cluster processing, a capability standard Python lacks.

Defining Characteristics of PySpark

Distributed Nature:

  • Description: PySpark excels in distributing data across cluster nodes, processing it in parallel via Resilient Distributed Datasets (RDDs).
  • Advantage: This distribution ensures faster processing and fault tolerance. If a node fails, tasks can be reassigned, avoiding data loss and ensuring task continuity.
  • Purpose & Application: In the realm of big data, single-node operations often do not cut it. For businesses, massive datasets that could overload a single machine. PySpark can distribute the workload across a cluster for concurrent processing. This allows for quicker insights and large-scale data operations that once took hours can now be achieved in minutes.

Lazy Evaluation:

  • Description: Unlike some eager computation systems, PySpark holds off on immediate computations. Operations are registered and only executed upon an action command.
  • Advantage: This approach is purposed for optimization. By waiting, PySpark can evaluate the entire pipeline of operations and optimize their execution, reducing redundancies.
  • Purpose & Application: For iterative data processing tasks, particularly in Machine Learning with multiple transformations, lazy evaluation avoids unnecessary computations. In a data exploration phase, a data scientist might set up numerous filters and transformations. With lazy evaluation, PySpark batches these operations, executing them only when an action like ‘collect’ or ‘save’ is called.

In-memory Computation:

  • Description: PySpark’s strength lies in its in-memory data storage and processing.
  • Advantage: Compared to traditional disk-based storage that involves read-write operations that can be time-consuming, in-memory computation, ensures lightning-fast data access and processing.
  • Purpose & Application: Iterative tasks, prevalent in Machine Learning algorithms, stand to gain immensely. Instead of reading data from the disk for every iteration, having it in memory speeds up the entire process multifold. Algorithm training time can be reduced from perhaps hours to mere minutes.

Best Practices & Watchouts

Image from UserGuiding

When you are using PySpark, consider these best practices:

  • Avoiding collect() Unless Absolutely Necessary: Working with large datasets means it is crucial to be wary of operations that pull excessive data into the driver node, which can be resource-limited. Using collect() on large DataFrames can be resource-intensive and may cause memory exhaustion.
# Sample data
data = [("Ali", 25),
("Bob", 30),
("Clo", 22),
("David", 28),
("Eve", 35)]

# Create a DataFrame
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)

# Limit and Display the top 3 roles of data
df.limit(3).show(truncate=False, vertical=True)
# Result
-RECORD 0----
Name | Ali
Age | 25
-RECORD 1----
Name | Bob
Age | 30
-RECORD 2----
Name | Clo
Age | 22

# Using take(n)
rows = df.take(3)
# Result
[Row(Name='Ali', Age=25), Row(Name='Bob', Age=30), Row(Name='Clo', Age=22)]

# Using first()
first_row = df.first()
# Result
Row(Name='Ali', Age=25)

# Using count function
row_count = df.limit(5).count()
# Result
5
  • Prioritize Built-in Functions Over UDFs (User Defined Functions): Spark’s built-in functions are optimized for distributed processing, making them faster and more efficient. In contrast, UDFs, especially those written in Python, undergo a serialization and deserialization overhead that can significantly hamper performance.
  • Leverage Broadcast Variables: Joining a large DataFrame with a smaller one can trigger extensive shuffling of the larger DataFrame. Broadcasting the smaller DataFrame ensures each worker node has a local copy, preventing the need to shuffle the larger DataFrame.
from pyspark.sql.functions import broadcast  
large_df.join(broadcast(small_df), "id").show()
  • Persist or Cache, but with Prudence: Persisting (or caching) a DataFrame in memory accelerates repeated operations on it. However, over-caching can lead to memory issues.
    Caching involves storing the data in memory, providing fast access for subsequent operations. Caching is ideal when you have sufficient memory. On the other hand, persisting offers more flexibility by allowing you to specify different storage levels, including memory-only, memory-and-disk, or disk-only. Persisting is suitable when memory is limited.
    It’s crucial to unpersist() DataFrames when they’re no longer required.
# either cache
df.cache()
# or persist
df.persist()

result = df.filter(df.Age > 25)
result.groupBy("country").count().show()
result.groupBy("gender").count().show()

# Unpersist the DataFrame when it's no longer needed
df.unpersist() # Works for both cache and persist
  • Start with Reduced Data Volume for Faster Processing: Reducing data volume at the beginning of the processing chain improves the performance of subsequent operations. This is achieved by filtering out unessential rows and selecting only necessary columns. After filtering or selecting the necessary data, consider using persist() or cache() to store the reduced dataset in memory.
  • Minimize Data Shuffling: Data shuffling, the process of redistributing data across the partitions, can be a performance bottleneck due to its resource-intensive nature. Operations like groupBy, certain types of join, and repartition can trigger shuffling.
    It is essential to be aware of which operations induce shuffling and try to minimize its impact by:
    1. Reducing the volume of data before such operations using filter or select.
    2. Strategically repartitioning data based on expected operations.
    3. Leveraging broadcast variables during joins.
  • Avoid Wide Transformations Where Possible: Wide transformations, such as groupByKey on RDDs or certain groupBy operations on DataFrames, cause data shuffling across partitions and can be performance-intensive. Narrow transformations, on the other hand, can be performed within a single partition. Consider using narrow transformations like map or filter as much as possible.

Conclusion

In this article, I attempted to draw contrast between PySpark and traditional Python by highlighting its distinctive features. As we navigated the expansive landscape of distributed data processing, I shared some of my personal best practices and potential pitfalls.

Thanks so much for reading my article!!! Feel free to drop me any comments, suggestions, and follow me on LinkedIn!

--

--

Jed Lee

Passionate about AI & NLP. Based in Singapore. Currently a Data Scientist at PatSnap.