DataFrames — Spark vs Pandas
Both are called "DataFrame." They share almost nothing under the hood. A Pandas DataFrame is a single in-memory data structure with a rich index. A Spark DataFrame is a distributed, immutable description of a computation — no index, no row order guarantee, no in-place mutation.
Creation
df = pd.DataFrame({
"name": ["Alice", "Bob"],
"score": [95, 87]
})
# From file
df = pd.read_csv("data.csv")
df = pd.read_parquet("data.parquet")
df = spark.createDataFrame([
("Alice", 95),
("Bob", 87)
], ["name", "score"])
# From file
df = spark.read.csv("data.csv", header=True, inferSchema=True)
df = spark.read.parquet("data.parquet")
Key Structural Differences
| Property | Pandas | Spark |
|---|---|---|
| Storage | Single machine, in memory | Distributed across cluster |
| Index | Row index (integer, datetime, multi) | No index |
| Row Order | Guaranteed, stable | Not guaranteed |
| Mutability | In-place ops (inplace=True) | Immutable — all ops return new DFs |
| Execution | Eager | Lazy (plan until action) |
| Type System | NumPy dtypes | Spark SQL types (separate type system) |
| Null | NaN / None (messy) | null (SQL semantics) |
Spark DataFrames have no .iloc, no .loc, no .index. There is no concept of "row 5." Data is an unordered set of partitions. If you need row numbers, you must explicitly create them with monotonically_increasing_id() or a window function, and even then the values are non-contiguous.
Column Expressions
In Pandas, you operate on concrete Series objects. In Spark, you build column expressions — symbolic descriptions of computation that compile into Catalyst plan nodes. Two syntaxes exist for referencing columns.
Referencing Columns
from pyspark.sql import functions as F
from pyspark.sql.column import Column
# These are equivalent
df.select(df.price) # attribute access
df.select(df["price"]) # bracket notation
df.select(F.col("price")) # col() function — safest, most flexible
# col() works without a DataFrame reference
# Use this inside joins and when building reusable expressions
expr = F.col("price") * F.col("quantity")
Building Expressions
# Direct computation on Series
df["total"] = df["price"] * df["qty"]
df["label"] = df["score"].apply(
lambda x: "high" if x > 90 else "low"
)
# Column expression — no data touched yet
df = df.withColumn("total", F.col("price") * F.col("qty"))
df = df.withColumn("label",
F.when(F.col("score") > 90, "high")
.otherwise("low")
)
F.when() compiles to JVM code and runs distributed. A Python lambda in a UDF serializes to each executor, runs in a Python subprocess, and serializes the result back. Orders of magnitude slower. Always prefer Spark's built-in functions.
Common Column Functions
| Pandas | PySpark |
|---|---|
df["col"].str.upper() | F.upper(F.col("col")) |
df["col"].str.contains("x") | F.col("col").contains("x") |
df["col"].fillna(0) | F.coalesce(F.col("col"), F.lit(0)) |
df["col"].astype(int) | F.col("col").cast("int") |
df["col"].dt.date | F.to_date(F.col("col")) |
pd.to_datetime(df["col"]) | F.to_timestamp(F.col("col")) |
df["col"].isin([1, 2]) | F.col("col").isin([1, 2]) |
np.where(cond, a, b) | F.when(cond, a).otherwise(b) |
Filtering & Selection
Selecting Columns
# Select columns
df[["name", "score"]]
# Drop columns
df.drop(columns=["temp"])
# Rename
df.rename(columns={"old": "new"})
# Select columns
df.select("name", "score")
# Drop columns
df.drop("temp")
# Rename
df.withColumnRenamed("old", "new")
Filtering Rows
# Boolean mask
df[df["score"] > 90]
# Multiple conditions
df[(df["score"] > 90) & (df["name"] != "Bob")]
# String match
df[df["name"].str.startswith("A")]
# filter() and where() are identical
df.filter(F.col("score") > 90)
# Multiple conditions
df.filter((F.col("score") > 90) & (F.col("name") != "Bob"))
# String match
df.filter(F.col("name").startswith("A"))
Deduplication
df.drop_duplicates(subset=["id"], keep="first")
# Simple dedupe (keeps arbitrary row)
df.dropDuplicates(["id"])
# Keep specific row (e.g. latest) — use window
from pyspark.sql.window import Window
w = Window.partitionBy("id").orderBy(F.col("ts").desc())
df.withColumn("rn", F.row_number().over(w)) \
.filter(F.col("rn") == 1).drop("rn")