The AI Citizen
Posts
Scaling LightGBM Model Training with SynapseML on EMR

Scaling LightGBM Model Training with SynapseML on EMR

A Deep Dive into Spark Tuning for Huge Datasets

Alagar A
April 24, 2025

When working with large-scale tabular datasets, LightGBM (LGBM) continues to be a top choice due to its speed and accuracy. But when you're training multiple models or predicting on massive volumes of data, you need infrastructure that scales. This is where Apache Spark with SynapseML on Amazon EMR shines.

In this post, we’ll focus on how to fine-tune Spark configurations to get the best performance for training and predicting LightGBM models with SynapseML, especially when multiple Spark jobs are submitted simultaneously. We’ll also contrast this setup with Amazon SageMaker’s LightGBM training, comparing cost and performance trade-offs.

Why SynapseML + Spark for LightGBM?

SynapseML (formerly MMLSpark) offers a distributed version of LightGBM that integrates directly with Spark ML pipelines. Unlike single-node implementations, SynapseML lets you:

• Train and predict on massive datasets in parallel across many nodes
• Handle complex workflows (feature engineering, transformation, evaluation) inside Spark
• Scale horizontally with fewer memory limitations

Deep Dive: Spark Configuration for Efficient SynapseML LightGBM Training

Tuning Spark for ML workloads is not just about spinning up more nodes—you need to dial in executor memory, CPU allocation, garbage collection, and scheduling policies. Here's how to set up your EMR cluster for maximum efficiency.

1. Cluster Sizing and Instance Selection

When working with Apache Spark on EMR, one of the most important decisions you’ll make is around cluster sizing and the selection of EC2 instance types. These choices have a direct impact on cost, performance, and scalability. While it might be tempting to over-provision resources to “be safe,” smart tuning ensures cost-effectiveness without sacrificing performance.

Why cluster sizing matters: Spark distributes computation across multiple executors, each running on a worker node. Each executor gets a slice of CPU, memory, and I/O resources. If your instance types are underpowered or mismatched with your workload type (e.g., memory-intensive vs. compute-intensive), Spark will not perform optimally.

Types of EC2 Instances for EMR

1. Compute-optimized (C5):

• These are best suited for CPU-bound tasks or for workloads where high throughput matters more than RAM.
• Recommended for scoring and smaller datasets or when running multiple lightweight jobs concurrently.

1. Memory-optimized (R5):

• Ideal for LightGBM training, especially with large feature sets.
• Distributed training with SynapseML benefits greatly from high memory capacity as Spark keeps large amounts of intermediate data in memory (especially during shuffle and aggregation stages).

1. General-purpose (M5):

• Good for balanced workloads that require a bit of both compute and memory but don’t lean heavily in either direction.
• Suitable for dev/test workloads or training smaller models.

EMR Nodes:

• Master Node: Manages cluster metadata, job scheduling, and driver program coordination. Needs moderate resources, but stability is more important than performance. Always use On-Demand pricing for this node.
• Core Nodes: These run the main Spark executors and store HDFS data if used. You should provision enough of these nodes to handle your largest expected parallel workload.
• Task Nodes (optional): For bursty compute loads. These don’t store data but run Spark tasks, useful when you're dynamically scaling the cluster.

Example Cluster for SynapseML LightGBM Workload:

• 1 Master node (m5.xlarge)
• 4–10 Core nodes (r5.4xlarge or r5.2xlarge)
• Optional 5–20 Task nodes (c5.4xlarge if running concurrent scoring)

Spot Instances for Cost Efficiency

Amazon EC2 Spot Instances let you leverage unused EC2 capacity at up to 90% lower cost. For EMR, Spot Instances can be highly effective for Core and Task nodes — especially when training models that can checkpoint or be retried. LightGBM training is typically resilient to such interruptions, as SynapseML internally handles retries across tasks.

Best practice: Use Spot Instances for 80% of the cluster and On-Demand for critical nodes (like the Master or one Core node).

Storage Considerations

Attach EBS volumes of sufficient size (100–500 GB) depending on your dataset size. Ensure your volumes are of type gp3 for balanced IOPS and throughput, or io1 if you're doing I/O-heavy data transformation before training.

2. Executor and Driver Configuration

Proper executor tuning is the backbone of performance optimization in Spark. When training machine learning models—especially memory-heavy models like LightGBM—efficient use of executors ensures your jobs don’t crash due to OOM errors, and that you're not wasting cluster resources due to underutilization.

Executors: The Workhorses of Spark

Each executor is a JVM process responsible for running your tasks. It handles data caching, shuffles, and model computations. If you allocate too much memory, Spark may over-commit, leading to YARN killing containers. Too little memory leads to garbage collection (GC) overhead and poor performance.

Here’s how to plan:

• Determine total memory and vCPUs on each node (e.g., r5.4xlarge = 128 GB RAM, 16 vCPUs).
• Reserve memory for OS and YARN (e.g., 10–15%).
• Divide remaining memory among 2–4 executors per node, depending on your workload.

Example Calculation for Executor Memory

For an r5.4xlarge node:

• Total memory = 128 GB
• Reserve ~10% = ~12 GB
• Available = 116 GB
• Choose 4 executors per node → each gets ~29 GB

Configure:

--executor-memory 25G
--executor-cores 4
--conf spark.yarn.executor.memoryOverhead=4G

Overhead memory is especially important for LightGBM training. The native LightGBM code can use off-heap memory, and Spark must account for this.

Driver Configuration

The driver is the Spark application coordinator. In many training workflows, the driver doesn’t handle large volumes of data directly, but if you're aggregating training metrics, collecting predictions, or using broadcast variables, the driver memory needs attention.

Use:
--driver-memory 16G
--driver-cores 4

This is enough for most scenarios unless you're managing a large number of parallel models from the driver.

3. Dynamic Resource Allocation

Dynamic Resource Allocation (DRA) is a powerful feature that allows Spark to automatically scale the number of executors based on workload needs. When training or predicting multiple models on the same EMR cluster, DRA helps improve resource utilization and avoids idle containers.

Why Use DRA in Model Training?

In ML pipelines, some stages (e.g., reading data or shuffling) require lots of parallelism, while others (e.g., fitting a model) may be serialized. Without DRA, Spark allocates a fixed number of executors for the entire job, leading to resource underutilization.

With DRA, Spark:

• Requests more executors when task backlog is high
• Releases idle executors to free up YARN memory
• Enables multi-tenancy: multiple Spark jobs can co-exist without starving each other

How to Configure DRA

Set the following Spark configurations:

--conf spark.dynamicAllocation.enabled=true
--conf spark.shuffle.service.enabled=true
--conf spark.dynamicAllocation.initialExecutors=4
--conf spark.dynamicAllocation.minExecutors=2
--conf spark.dynamicAllocation.maxExecutors=50

The spark.shuffle.service.enabled=true is mandatory for DRA to work. EMR has this enabled by default, but you must confirm it if you're using a custom bootstrap or Spark version.

Benefits for SynapseML

LightGBM training using SynapseML is stage-based and highly parallel. When a training stage starts, DRA allocates additional executors, finishes the computation, and scales back down. This leads to:

• Faster training with minimal resource wastage
• Multiple jobs sharing the same EMR cluster with less contention
• Lower EC2 cost, as you can run a lean cluster with burst scaling

4. Fair Scheduling for Parallel Workloads

In multi-tenant environments or workloads involving multiple concurrent Spark applications, having a fair and efficient resource-sharing mechanism is critical. This is where Spark’s Fair Scheduler comes into play. By default, Spark uses a FIFO (First-In, First-Out) scheduler where jobs are executed in the order they're submitted. While simple, this approach isn't ideal when you're training or predicting multiple LightGBM models in parallel.

The Fair Scheduler allows multiple Spark jobs (or stages) to share resources fairly, making it especially valuable when using a single EMR cluster for many parallel tasks — like when launching multiple SynapseML LGBM training jobs from different users, notebooks, or airflow pipelines.

Why Fair Scheduling Matters

Let’s say you’re running 5 LightGBM training jobs, each submitted as a separate Spark application to the same YARN cluster. Without a fair scheduling policy, one job could monopolize all available executors, starving others and leading to long wait times or failed jobs due to timeout.

Using fair scheduling:

• Spark allocates resources in pools, distributing executors equitably
• Each job (or task within a job) gets at least its fair share of the cluster’s resources
• High-priority jobs (like prediction pipelines for real-time scoring) can be given greater weight

How to Enable Spark Fair Scheduler on EMR

Enable Fair Scheduler in Spark configs:
--conf spark.scheduler.mode=FAIR

Create a fair scheduler configuration file (XML format): This file defines the resource pools and sets rules like weights, min shares, and scheduling modes.

Example: fairscheduler.xml

<allocations>
  <pool name="training">
    <schedulingMode>FAIR</schedulingMode>
    <minShare>2</minShare>
    <weight>1</weight>
  </pool>
  <pool name="prediction">
    <schedulingMode>FAIR</schedulingMode>
    <minShare>2</minShare>
    <weight>3</weight>
  </pool>
</allocations>

This configuration ensures that jobs assigned to the “prediction” pool get a higher scheduling weight, meaning they can access more resources if available — ideal for latency-sensitive tasks.

1. Reference the scheduler config in Spark submit:

--conf spark.scheduler.allocation.file=/etc/spark/conf/fairscheduler.xml

Using Pools in Your Code

In your Spark code or notebooks, assign pools using:

sc.setLocalProperty("spark.scheduler.pool", "training")

This way, you can dynamically assign different parts of your workflow (e.g., training vs. prediction) to specific resource pools with different scheduling priorities.

Benefits in LightGBM Multi-App Context

When you're using SynapseML to train dozens of LightGBM models (e.g., per user segment or product category), using fair scheduling ensures:

• All models train concurrently rather than waiting in line
• Important pipelines (like fraud detection or recommendation models) can be prioritized
• The EMR cluster is utilized efficiently without bottlenecks or starvation

This is especially powerful when combined with Dynamic Resource Allocation, as Spark will now grow and shrink executors per job within their fair share.

5. Partitioning and Parallelism

One of the most overlooked yet critical aspects of Spark performance tuning is how data is partitioned. In Spark, data is broken into partitions, and each partition is processed by one task (on one CPU core). For LightGBM training in SynapseML, poor partitioning can lead to straggler tasks, OOM errors, and underutilization of your EMR cluster.

Understanding Partitioning in Spark

Each RDD or DataFrame in Spark is made up of a number of partitions. If partitions are:

• Too small: You create a massive number of small tasks, overwhelming the scheduler.

• Too large: You overload a single executor, potentially causing it to run out of memory.

LightGBM models benefit from uniformly distributed partitions since the training phase involves multiple parallel stages (e.g., data loading, histogram building, tree construction) that need to be balanced across executors.

Best Practices for Partitioning in SynapseML LGBM

1. Initial Data Load:

• Use repartition(n) to control partition count when loading data.
• If reading from S3 or HDFS, use:

spark.read.option("maxRecordsPerFile", 1000000)

1. Training Phase:

• Repartition based on executor count:

num_partitions = num_executors * executor_cores
df = df.repartition(num_partitions)

• Example: On a cluster with 40 executors and 4 cores each, aim for 160 partitions.

1. Shuffle Optimization:

• Set the following Spark configs:

--conf spark.sql.shuffle.partitions=200
--conf spark.default.parallelism=200

These values should be tuned based on the number of concurrent jobs and cluster size. Having too many shuffle partitions increases memory usage; too few limits parallelism.

1. Avoid Skew:

• Make sure your partitioning key is not skewed (e.g., avoid partitioning by country if 90% of your users are from one country).
• Use .sample() or .approxQuantile() to analyze distributions before repartitioning.

Benefits During Prediction

Partitioning is just as important during the prediction phase. If you load a large test set (e.g., 1B rows) and don’t partition it properly, Spark may overload a few executors. Ensure test data is:

• Repartitioned before .transform():

testData = testData.repartition(200)
predictions = model.transform(testData)

• If scoring is done in batches (e.g., daily segments), each job should still aim for balanced partition counts.

Why Partitioning Matters for Cost and Stability

• Poor partitioning leads to job retries, long tail tasks, and wasted executor hours
• Good partitioning reduces executor idle time and ensures balanced CPU utilization
• It’s one of the easiest and cheapest ways to boost performance without touching hardware

When you're submitting dozens of training/prediction jobs, even a 10–15% efficiency gain from proper partitioning translates to significant dollar savings on EMR billing.

6. Training Multiple Models Concurrently

One of the greatest strengths of using Spark on EMR is the ability to scale horizontally and handle multiple training jobs at once. In real-world scenarios, it’s common to train dozens or hundreds of LightGBM models—one for each customer segment, time window, or product category. This strategy is often referred to as multi-model training, and it’s especially powerful when paired with SynapseML LightGBM.

Why Train Multiple Models in Parallel?

• Model specialization: You often get better performance by training separate models for each cohort or feature group rather than a single global model.
• Throughput: Running jobs serially is slow and inefficient.
• Architecture scaling: EMR supports multi-app YARN scheduling, enabling large parallel jobs.

Ways to Achieve Multi-Model Training

1. Submit separate Spark jobs: Each job trains one model using its own executor pool.

• This gives isolation and fault-tolerance.
• Ideal for production environments with job orchestration (e.g., Airflow, Step Functions).

1. Run models in parallel within one job: Use parallel threads or asynchronous execution (e.g., with Python’s concurrent.futures) to kick off model training on different DataFrames within the same application.

2. Use parameter maps for parallel pipelines:

• If using Spark ML’s CrossValidator or TrainValidationSplit, you can parallelize over hyperparameters and models simultaneously.

Best Practices for Parallel Training

• Use Dynamic Resource Allocation and Fair Scheduler to ensure resource fairness across concurrent jobs.
• Tag logs or job names with model IDs for observability.
• Use distributed logging (e.g., CloudWatch, S3 logs) to monitor job-level metrics.

Job Submission Tips

To submit multiple jobs from the CLI:

spark-submit --master yarn --deploy-mode cluster train_model.py --segment=1
spark-submit --master yarn --deploy-mode cluster train_model.py --segment=2
...

Each job will get its own YARN application and use resources independently.

Handling Prediction Jobs in Parallel

Just like training, LightGBM predictions can be distributed:

• Pre-partition the dataset by the model segment (e.g., customer_group)
• Route each partition to the appropriate model
• Predict in parallel across segments

This parallel prediction workflow is useful when scoring billions of rows daily for real-time or near real-time applications (ads, recommendations, etc.).

Distributed Prediction with LightGBM on Spark

Unlike SageMaker, which often requires pulling data to a single node for batch or real-time prediction, SynapseML allows in-cluster distributed inference:

predictions = model.transform(largeTestData)

This is ideal for huge datasets, scoring billions of records without bottlenecks.

EMR vs. SageMaker: Cost and Efficiency

	Feature	EMR + SynapseML	SageMaker LightGBM
1	Training Speed	Fast (due to Spark parallelism)	Fast for small-medium jobs
2	Cost	Low (Spot pricing + flexible scaling)	High (per-instance pricing, managed overhead)
3	Concurrency	Excellent (multi-apps supported)	Limited unless orchestrated manually
4	Customization	Full control over Spark stack	Limited to built-in parameters
5	Inference	Distributed, scalable	Limited to batch or endpoint

Conclusion: For teams handling multiple pipelines, large datasets, and tight budgets, EMR + SynapseML offers better scalability and cost control.

SynapseML LightGBM vs SageMaker LightGBM

	Aspect	SynapseML LightGBM	SageMaker LightGBM
1	Platform	Spark-based (distributed)	Native (scikit-learn style)
2	Scaling	Native Spark parallelism	Manual or limited multi-instance support
3	Data Handling	In-cluster (Spark DataFrames)	Requires dataset upload or integration
4	Inference	Distributed across cluster	Typicallysingle-node batch or endpoint

Despite these operational differences, model outputs and accuracy are comparable, since both wrap the same LightGBM core engine. The key differences lie in scalability, parallel processing, and ecosystem integration.

Final Thoughts

If you're building ML systems that need to process terabytes of data, run multiple model training jobs, and predict at scale, SynapseML + Spark on EMR is a compelling choice. With proper tuning:

• You gain fine-grained control over CPU, memory, and application scheduling
• You can drastically reduce training and inference times
• You save significantly compared to SageMaker, especially at scale

Whether you're handling clickstream logs, financial modeling, or recommendation engines, this architecture provides speed, cost-efficiency, and flexibility.

Need a working template, code samples, or help setting this up? Drop a comment or reach out at [email protected] —we’re happy to help! 🚀

About the Author

Alagar A is a Senior Software Development Engineer at Amazon with over a decade of experience building scalable, high-performance systems. His work spans across E-Reader (Kindle), Smart devices, and Grocery Tech, where he has led initiatives that built ML model platform for improving forecast accuracy, reduced operational waste, and enhanced user experiences on platforms like Alexa, Fire TV, and Kindle. With expertise in Python, AWS, Java, C++, and Embedded systems, Alagarsamy combines deep technical skill with a passion for solving real-world problems. Based in Texas, he enjoys exploring the intersection of software engineering and impactful innovation

About The AI Citizen Hub - by World AI X

This isn’t just another AI newsletter; it’s an evolving journey into the future. When you subscribe, you're not simply receiving the best weekly dose of AI and tech news, trends, and breakthroughs—you're stepping into a living, breathing entity that grows with every edition. Each week, The AI Citizen evolves, pushing the boundaries of what a newsletter can be, with the ultimate goal of becoming an AI Citizen itself in our visionary World AI Nation.

By subscribing, you’re not just staying informed—you’re joining a movement. Leaders from all sectors are coming together to secure their place in the future. This is your chance to be part of that future, where the next era of leadership and innovation is being shaped.

Join us, and don’t just watch the future unfold—help create it.

For advertising inquiries, feedback, or suggestions, please reach out to us at [email protected].

Reply

or to participate.