Foundations — PySpark for Pandas Developers

Section 01

Why Spark from Pandas

Pandas optimizes for interactive exploration — fast iteration, rich indexing, tight NumPy integration. Spark optimizes for scale and fault tolerance — distributed execution, lazy evaluation, lineage-based recovery.

The tradeoff is real: Spark has more ceremony, slower feedback loops, and a steeper debugging curve. But once your data exceeds single-machine memory, or your pipeline needs to process thousands of partitions in parallel, or you need deterministic retries on terabyte-scale jobs — Pandas can't follow.

If you've ever hit MemoryError on a 20GB CSV, or tried to parallelize a Pandas pipeline with multiprocessing and watched it serialize/deserialize your entire DataFrame per worker, or needed to restart a 3-hour job from scratch because of one bad record — Spark solves those problems architecturally.

Key Insight

Spark doesn't make your code faster on small data. It makes large data possible at all.

Section 02

Mental Model Shift

Pandas executes eagerly: every line runs immediately and returns a concrete result in memory. Spark executes lazily: operations build a logical plan (a DAG), and nothing actually computes until you trigger an action. This is the single most important difference.

Transformations vs Actions

In Spark, operations split into two categories. Transformations (lazy) define what to do. Actions (eager) trigger the computation.

Pandas: df = pd.read_csv("data.csv") # reads entire file NOW df = df[df.price > 100] # filters NOW, new DataFrame in memory result = df.groupby("store").sum() # aggregates NOW Spark: df = spark.read.csv("data.csv") # reads schema only, no data loaded df = df.filter(df.price > 100) # adds filter to plan, nothing runs df = df.groupBy("store").sum() # adds aggregation to plan, still nothing df.show() # ACTION — now the entire plan executes

This laziness is what enables Spark's optimizer (Catalyst) to reorder, combine, and prune operations before touching any data. A .filter() after a .join() in your code can be pushed before the join in the physical plan — Spark rewrites your logic for efficiency.

Common Actions

Action	Returns	Triggers
`.show()`	Nothing (prints)	Computes and prints top rows
`.collect()`	`list[Row]`	Pulls all data to driver
`.count()`	`int`	Full scan to count rows
`.write`	Nothing (writes)	Computes and writes to storage
`.toPandas()`	`pd.DataFrame`	Collects all data to driver as Pandas
`.first()`	`Row`	Returns first row

The #1 Mistake

Calling .collect() or .toPandas() on a large DataFrame pulls the entire dataset to a single machine. This defeats the purpose of Spark and will OOM your driver. Only collect small, aggregated results.

Section 03

Environment & Setup

Locally, PySpark runs on a single machine with a simulated cluster. In production, it runs on a cluster manager — YARN, Kubernetes, or Databricks. The API is identical in both cases.

# Install PySpark
pip install pyspark

# Or with Databricks (most common production env)
pip install databricks-connect

SparkSession

Every Spark program begins by creating a SparkSession. This is your entry point — equivalent to import pandas as pd but with configuration.

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("my_batch_job") \
    .config("spark.sql.shuffle.partitions", "200") \
    .config("spark.driver.memory", "4g") \
    .getOrCreate()

Databricks Note

In Databricks, spark is pre-initialized. You never create your own SparkSession. Just start using spark.read and spark.sql() directly.

The Execution Model

Driver

Your Code

→

Catalyst

Optimize Plan

→

Tungsten

Generate Code

→

Executors

Run in Parallel

The driver is a single JVM process that builds your logical plan. Catalyst optimizes it. Tungsten compiles it to JVM bytecode. Executors run that bytecode across partitions in parallel. Your Python code never touches the actual data unless you use a UDF.