I/O & File Formats
Pandas reads one file at a time into one machine's memory. Spark reads directories of files distributed across executors. The file format choice determines whether Spark can use predicate pushdown (skipping data it doesn't need) and column pruning (reading only requested columns).
Format Comparison
| Format | Columnar | Predicate Pushdown | Splittable | Use Case |
|---|---|---|---|---|
Parquet | Yes | Yes | Yes | Default for everything |
Delta | Yes | Yes | Yes | ACID, time travel, merge |
ORC | Yes | Yes | Yes | Hive ecosystems |
CSV | No | No | Yes* | Ingestion only, never intermediate |
JSON | No | No | Yes* | Semi-structured ingestion |
Avro | No (row) | No | Yes | Schema evolution, Kafka |
# Reading
df = spark.read.parquet("s3://bucket/data/") # reads entire directory
df = spark.read.parquet("s3://bucket/data/year=2024/") # partition pruning
# Writing
df.write.mode("overwrite") \
.partitionBy("year", "month") \
.parquet("s3://bucket/output/")
# Delta (Databricks / open-source delta-spark)
df.write.format("delta").mode("overwrite") \
.saveAsTable("catalog.schema.table")
CSV and JSON as intermediate formats waste compute, storage, and prevent optimizations. Read CSV once, write Parquet, and never look back. Delta adds ACID transactions, schema enforcement, and time travel on top of Parquet.
Partitioning & Bucketing
Partitioning is the concept with no Pandas equivalent. In Pandas, your data is one contiguous block. In Spark, it's split into partitions — chunks that live on different machines and process in parallel. How data is partitioned determines everything about performance.
Disk Partitioning (Hive-style)
When you write .partitionBy("year", "month"), Spark creates a directory structure: year=2024/month=01/. On read, filters on these columns skip entire directories — no data scanned. This is partition pruning.
# Write with partitioning
df.write.partitionBy("region", "date").parquet("output/")
# This read only touches region=US partitions
spark.read.parquet("output/").filter(F.col("region") == "US")
In-Memory Partitions
# Check current partition count
df.rdd.getNumPartitions()
# Repartition — triggers full shuffle
df = df.repartition(200) # random, 200 partitions
df = df.repartition(200, "store_id") # hash by store_id into 200 partitions
# Coalesce — reduces partitions WITHOUT shuffle (merges only)
df = df.coalesce(10) # merge down to 10 partitions
repartition(n) shuffles all data across the cluster — expensive but creates even partitions. coalesce(n) only reduces partition count by merging adjacent partitions — cheap but can create uneven sizes. Use coalesce before writing to reduce file count. Use repartition when you need even distribution or to partition by key.
Bucketing
Bucketing pre-sorts and pre-hashes data on disk by a column. Subsequent joins or aggregations on that column avoid shuffles entirely.
# Write bucketed table (persisted in metastore)
df.write.bucketBy(256, "store_id") \
.sortBy("store_id") \
.saveAsTable("bucketed_sales")
# Joins on store_id now avoid shuffle
sales = spark.table("bucketed_sales")
stores = spark.table("bucketed_stores") # also bucketed by store_id, same count
result = sales.join(stores, "store_id") # no shuffle
Caching & Persistence
In Pandas, your DataFrame lives in memory until you delete it. In Spark, DataFrames are recomputed from the DAG every time an action is called — unless you explicitly cache them.
# Cache — stores in executor memory (deserialized)
df.cache() # equivalent to df.persist(StorageLevel.MEMORY_AND_DISK)
# Persist with specific storage level
from pyspark import StorageLevel
df.persist(StorageLevel.MEMORY_AND_DISK_SER) # serialized, spills to disk
df.persist(StorageLevel.DISK_ONLY) # disk only, frees memory
# Trigger materialization and cache
df.cache()
df.count() # action forces the cache to fill
# Release
df.unpersist()
When to Cache
| Situation | Cache? | Why |
|---|---|---|
| DataFrame used in multiple actions | Yes | Avoids recomputing the DAG |
| After expensive join or aggregation | Yes | Prevents re-shuffling |
| Inside iterative algorithms (ML) | Yes | Each iteration would re-scan source |
| Used once then discarded | No | Cache overhead exceeds recompute cost |
| Dataset is very large, cluster memory tight | Maybe | Use DISK_ONLY or skip — eviction thrashing is worse |
df.cache() doesn't compute anything. It marks the DataFrame for caching. The cache fills on the next action. If you cache and immediately write, the first write computes and caches. The second action hits cache. Common pattern: df.cache().count() to force materialization.