Section 04

DataFrames — Spark vs Pandas

Both are called "DataFrame." They share almost nothing under the hood. A Pandas DataFrame is a single in-memory data structure with a rich index. A Spark DataFrame is a distributed, immutable description of a computation — no index, no row order guarantee, no in-place mutation.

Creation

Pandas
df = pd.DataFrame({
    "name": ["Alice", "Bob"],
    "score": [95, 87]
})

# From file
df = pd.read_csv("data.csv")
df = pd.read_parquet("data.parquet")
PySpark
df = spark.createDataFrame([
    ("Alice", 95),
    ("Bob", 87)
], ["name", "score"])

# From file
df = spark.read.csv("data.csv", header=True, inferSchema=True)
df = spark.read.parquet("data.parquet")

Key Structural Differences

Property Pandas Spark
StorageSingle machine, in memoryDistributed across cluster
IndexRow index (integer, datetime, multi)No index
Row OrderGuaranteed, stableNot guaranteed
MutabilityIn-place ops (inplace=True)Immutable — all ops return new DFs
ExecutionEagerLazy (plan until action)
Type SystemNumPy dtypesSpark SQL types (separate type system)
NullNaN / None (messy)null (SQL semantics)
No Index, No Row Position

Spark DataFrames have no .iloc, no .loc, no .index. There is no concept of "row 5." Data is an unordered set of partitions. If you need row numbers, you must explicitly create them with monotonically_increasing_id() or a window function, and even then the values are non-contiguous.

Section 05

Column Expressions

In Pandas, you operate on concrete Series objects. In Spark, you build column expressions — symbolic descriptions of computation that compile into Catalyst plan nodes. Two syntaxes exist for referencing columns.

Referencing Columns

from pyspark.sql import functions as F
from pyspark.sql.column import Column

# These are equivalent
df.select(df.price)              # attribute access
df.select(df["price"])           # bracket notation
df.select(F.col("price"))        # col() function — safest, most flexible

# col() works without a DataFrame reference
# Use this inside joins and when building reusable expressions
expr = F.col("price") * F.col("quantity")

Building Expressions

Pandas
# Direct computation on Series
df["total"] = df["price"] * df["qty"]
df["label"] = df["score"].apply(
    lambda x: "high" if x > 90 else "low"
)
PySpark
# Column expression — no data touched yet
df = df.withColumn("total", F.col("price") * F.col("qty"))
df = df.withColumn("label",
    F.when(F.col("score") > 90, "high")
     .otherwise("low")
)
Never Use Python Lambdas for Column Logic

F.when() compiles to JVM code and runs distributed. A Python lambda in a UDF serializes to each executor, runs in a Python subprocess, and serializes the result back. Orders of magnitude slower. Always prefer Spark's built-in functions.

Common Column Functions

Pandas PySpark
df["col"].str.upper()F.upper(F.col("col"))
df["col"].str.contains("x")F.col("col").contains("x")
df["col"].fillna(0)F.coalesce(F.col("col"), F.lit(0))
df["col"].astype(int)F.col("col").cast("int")
df["col"].dt.dateF.to_date(F.col("col"))
pd.to_datetime(df["col"])F.to_timestamp(F.col("col"))
df["col"].isin([1, 2])F.col("col").isin([1, 2])
np.where(cond, a, b)F.when(cond, a).otherwise(b)
Section 06

Filtering & Selection

Selecting Columns

Pandas
# Select columns
df[["name", "score"]]

# Drop columns
df.drop(columns=["temp"])

# Rename
df.rename(columns={"old": "new"})
PySpark
# Select columns
df.select("name", "score")

# Drop columns
df.drop("temp")

# Rename
df.withColumnRenamed("old", "new")

Filtering Rows

Pandas
# Boolean mask
df[df["score"] > 90]

# Multiple conditions
df[(df["score"] > 90) & (df["name"] != "Bob")]

# String match
df[df["name"].str.startswith("A")]
PySpark
# filter() and where() are identical
df.filter(F.col("score") > 90)

# Multiple conditions
df.filter((F.col("score") > 90) & (F.col("name") != "Bob"))

# String match
df.filter(F.col("name").startswith("A"))

Deduplication

Pandas
df.drop_duplicates(subset=["id"], keep="first")
PySpark
# Simple dedupe (keeps arbitrary row)
df.dropDuplicates(["id"])

# Keep specific row (e.g. latest) — use window
from pyspark.sql.window import Window
w = Window.partitionBy("id").orderBy(F.col("ts").desc())
df.withColumn("rn", F.row_number().over(w)) \
  .filter(F.col("rn") == 1).drop("rn")