How to Use Spark Connect on EMR from Local Environment
Spark Connect allows you to run Spark jobs remotely, enabling local development against an EMR cluster. This guide covers setup, configuration, and common issues.
Prerequisites
- AWS EMR cluster with Spark 3.4.0 or later
- SSH access to EMR master node
- Python environment on your local machine
- Network access to EMR cluster (VPN or direct)
Reference
Spark Connect Official Documentation
Setting Up Spark Connect Server on EMR
Spark Connect is supported from version 3.4.0. Start the connect server on your EMR master node:
sudo /usr/lib/spark/sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:{your-spark-version}
Note: Replace {your-spark-version} with your actual Spark version (e.g., 3.4.1).
Configuring Local Environment
Version compatibility is critical. Mismatched versions will cause errors.
pip install pyspark==3.4.1
pip install grpcio-status==1.64.0
pip install grpcio==1.64.0
pip install protobuf==5.27.0
Connecting to Spark Connect
Using PySpark Shell
pyspark --remote "sc://{emr-cluster-master-ip}"
Using Python Script
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("Spark Connect Example") \
.remote("sc://{emr-cluster-master-ip}") \
.getOrCreate()
# Now you can use spark as usual
df = spark.createDataFrame([("Alice", 1), ("Bob", 2)], ["name", "id"])
df.show()
SparkContext Limitations
In Spark Connect, SparkContext is deprecated. The following functions are not available:
| Function | Status | Alternative |
|---|---|---|
sc.setCheckpointDir |
Deprecated | Use spark.sparkContext.setCheckpointDir() on server |
sc.addPyFile |
Deprecated | Pre-install packages on cluster |
sc.install_pypi_package |
Deprecated | Pre-install packages on cluster |
sc.parallelize |
Deprecated | Use spark.createDataFrame() |
sc.setLogLevel |
Deprecated | Configure on server side |
sc.broadcast |
Deprecated | Use DataFrame operations |
Important: Only a single connect server can run at a time on the cluster.
Troubleshooting
Error: [NOT_ITERABLE] Column is not iterable
pyspark.errors.exceptions.base.PySparkTypeError: [NOT_ITERABLE] Column is not iterable.
Cause: Protobuf version incompatibility
Solution: Ensure protobuf version matches the server:
pip install protobuf==5.27.0
Connection Refused
Cause: Firewall or security group blocking port 15002
Solution:
- Add inbound rule for port 15002 in EMR security group
- Or use SSH tunnel:
ssh -L 15002:localhost:15002 hadoop@{emr-master-ip}
Version Mismatch Errors
Cause: Local PySpark version doesn’t match EMR Spark version
Solution: Install the exact same version:
# Check EMR Spark version
spark-submit --version
# Install matching local version
pip install pyspark=={same-version}
Best Practices
- Use virtual environments: Isolate Spark Connect dependencies
- Match versions exactly: Minor version differences can cause issues
- Use SSH tunneling: More secure than opening ports
- Monitor server resources: Connect server adds overhead to master node
Comments