Section 07

GroupBy & Aggregations

Same concept, different syntax. Pandas aggregations return immediately. Spark aggregations add to the plan and trigger a shuffle — the most expensive operation in distributed computing.

Pandas
# Single aggregation
df.groupby("store")["revenue"].sum()

# Multiple aggregations
df.groupby("store").agg(
    total_rev=("revenue", "sum"),
    avg_price=("price", "mean"),
    num_txns=("id", "count")
)
PySpark
# Single aggregation
df.groupBy("store").agg(F.sum("revenue"))

# Multiple aggregations
df.groupBy("store").agg(
    F.sum("revenue").alias("total_rev"),
    F.avg("price").alias("avg_price"),
    F.count("id").alias("num_txns")
)

Pivot Tables

Pandas
df.pivot_table(
    values="revenue",
    index="store",
    columns="quarter",
    aggfunc="sum"
)
PySpark
df.groupBy("store") \
  .pivot("quarter", ["Q1", "Q2", "Q3", "Q4"]) \
  .agg(F.sum("revenue"))
Always Pass Pivot Values

Without explicit values, .pivot() triggers an extra action to discover distinct values — scanning the entire column. Always pass the list of pivot values explicitly.

Section 08

Null Handling

Pandas uses NaN (a float) for missing numerics and None for objects, which causes type coercion and inconsistent behavior. Spark uses SQL null semantics uniformly across all types. Null propagation is well-defined: any operation involving null returns null.

Pandas
df["col"].isna()
df["col"].fillna(0)
df.dropna(subset=["col"])
PySpark
F.col("col").isNull()
F.coalesce(F.col("col"), F.lit(0))
df.dropna(subset=["col"])  # same API here
Null != NaN in Spark

Spark distinguishes between null (missing) and NaN (IEEE 754 not-a-number). F.isnan() checks for NaN, .isNull() checks for null. They are not interchangeable. NaN == NaN is true in Spark (differs from IEEE and Pandas).

# Fill nulls with different strategies per column
df.fillna({"price": 0.0, "name": "unknown"})

# Null-safe equality (== returns null when either side is null)
df.filter(F.col("a").eqNullSafe(F.col("b")))

# Coalesce — first non-null value across columns
df.withColumn("val", F.coalesce("primary", "fallback", F.lit(0)))
Section 09

Joins

Same concept, different execution. Pandas joins happen in memory on one machine. Spark joins are distributed and the join strategy dramatically affects performance.

Pandas
pd.merge(left, right,
    on="id",
    how="left")

# Multi-key
pd.merge(left, right,
    on=["id", "date"],
    how="inner")
PySpark
left.join(right,
    on="id",
    how="left")

# Multi-key
left.join(right,
    on=["id", "date"],
    how="inner")

Join Strategies

Strategy When Mechanism
Broadcast Hash Join One side < 10MB (default threshold) Small table sent to all executors. No shuffle.
Sort-Merge Join Both sides large Both sides shuffled and sorted by key. Default for large-large.
Shuffle Hash Join One side much smaller (but too big to broadcast) Both sides shuffled by key, hash table built on smaller side.
# Force broadcast when you know the right side is small
from pyspark.sql.functions import broadcast
result = large_df.join(broadcast(small_df), on="key")

# Avoid duplicate columns after join with same-name keys
result = left.join(right, left["id"] == right["id"], "inner") \
    .drop(right["id"])
Skew Kills Joins

If one join key has 10M rows and all others have 100, that partition takes 100,000x longer. Solutions: salting the key, isolating the skewed value, or using spark.sql.adaptive.skewJoin.enabled=true (AQE handles this automatically in Spark 3+).