DataFrames — PySpark for Pandas Developers

Section 04

DataFrames — Spark vs Pandas

Both are called "DataFrame." They share almost nothing under the hood. A Pandas DataFrame is a single in-memory data structure with a rich index. A Spark DataFrame is a distributed, immutable description of a computation — no index, no row order guarantee, no in-place mutation.

Creation

Pandas

df = pd.DataFrame({
    "name": ["Alice", "Bob"],
    "score": [95, 87]
})

# From file
df = pd.read_csv("data.csv")
df = pd.read_parquet("data.parquet")

PySpark

df = spark.createDataFrame([
    ("Alice", 95),
    ("Bob", 87)
], ["name", "score"])

# From file
df = spark.read.csv("data.csv", header=True, inferSchema=True)
df = spark.read.parquet("data.parquet")

Key Structural Differences

Property	Pandas	Spark
Storage	Single machine, in memory	Distributed across cluster
Index	Row index (integer, datetime, multi)	No index
Row Order	Guaranteed, stable	Not guaranteed
Mutability	In-place ops (`inplace=True`)	Immutable — all ops return new DFs
Execution	Eager	Lazy (plan until action)
Type System	NumPy dtypes	Spark SQL types (separate type system)
Null	`NaN` / `None` (messy)	`null` (SQL semantics)

No Index, No Row Position

Spark DataFrames have no .iloc, no .loc, no .index. There is no concept of "row 5." Data is an unordered set of partitions. If you need row numbers, you must explicitly create them with monotonically_increasing_id() or a window function, and even then the values are non-contiguous.

Section 05

Column Expressions

In Pandas, you operate on concrete Series objects. In Spark, you build column expressions — symbolic descriptions of computation that compile into Catalyst plan nodes. Two syntaxes exist for referencing columns.

Referencing Columns

from pyspark.sql import functions as F
from pyspark.sql.column import Column

# These are equivalent
df.select(df.price)              # attribute access
df.select(df["price"])           # bracket notation
df.select(F.col("price"))        # col() function — safest, most flexible

# col() works without a DataFrame reference
# Use this inside joins and when building reusable expressions
expr = F.col("price") * F.col("quantity")

Building Expressions

Pandas

# Direct computation on Series
df["total"] = df["price"] * df["qty"]
df["label"] = df["score"].apply(
    lambda x: "high" if x > 90 else "low"
)

PySpark

# Column expression — no data touched yet
df = df.withColumn("total", F.col("price") * F.col("qty"))
df = df.withColumn("label",
    F.when(F.col("score") > 90, "high")
     .otherwise("low")
)

Never Use Python Lambdas for Column Logic

F.when() compiles to JVM code and runs distributed. A Python lambda in a UDF serializes to each executor, runs in a Python subprocess, and serializes the result back. Orders of magnitude slower. Always prefer Spark's built-in functions.

Common Column Functions

Pandas	PySpark
`df["col"].str.upper()`	`F.upper(F.col("col"))`
`df["col"].str.contains("x")`	`F.col("col").contains("x")`
`df["col"].fillna(0)`	`F.coalesce(F.col("col"), F.lit(0))`
`df["col"].astype(int)`	`F.col("col").cast("int")`
`df["col"].dt.date`	`F.to_date(F.col("col"))`
`pd.to_datetime(df["col"])`	`F.to_timestamp(F.col("col"))`
`df["col"].isin([1, 2])`	`F.col("col").isin([1, 2])`
`np.where(cond, a, b)`	`F.when(cond, a).otherwise(b)`

Section 06

Filtering & Selection

Selecting Columns

Pandas

# Select columns
df[["name", "score"]]

# Drop columns
df.drop(columns=["temp"])

# Rename
df.rename(columns={"old": "new"})

PySpark

# Select columns
df.select("name", "score")

# Drop columns
df.drop("temp")

# Rename
df.withColumnRenamed("old", "new")

Filtering Rows

Pandas

# Boolean mask
df[df["score"] > 90]

# Multiple conditions
df[(df["score"] > 90) & (df["name"] != "Bob")]

# String match
df[df["name"].str.startswith("A")]

PySpark

# filter() and where() are identical
df.filter(F.col("score") > 90)

# Multiple conditions
df.filter((F.col("score") > 90) & (F.col("name") != "Bob"))

# String match
df.filter(F.col("name").startswith("A"))

Deduplication

Pandas

df.drop_duplicates(subset=["id"], keep="first")

PySpark

# Simple dedupe (keeps arbitrary row)
df.dropDuplicates(["id"])

# Keep specific row (e.g. latest) — use window
from pyspark.sql.window import Window
w = Window.partitionBy("id").orderBy(F.col("ts").desc())
df.withColumn("rn", F.row_number().over(w)) \
  .filter(F.col("rn") == 1).drop("rn")