Recently, we see the trend of disaggregated storage and computation from the perspective of infra’s development. Among the distributed computation framework, Hadoop/Spark is de facto standard for big-data processing and analytics. Follow the idea of layer separation, the final data has been persisted in the distributed filesystem or Cloud Storage (Azure Blob or AWS S3), however, the intermediate data still requires storage/state due to shuffle operation, hence, it will bring some uncontrollable impacts.
- Long-tail resource consumption. For example, the above diagram shows one Spark job’s CPU utilization from Hadoop Platform, and we observe that CPU utilization decreases with time because of the variant resource requirement for different stages in one specific job. Under this circumstance, reserving a large scale of computation will cause unavoidable waste. However, it is difficult to stop the nodes before the whole job completes, because intermediate dependent data might spread across all nodes.
- Complex local disk volume size configuration. Different application has different disk requirements, what’s more, it varies a lot on traffic fluctuation. It is suboptimal for cost if it provides unified disk volume as per the largest application’s requirements.
- Hard to leverage the power of a wide spectrum of computation in cloud environment. Cloud vendors provide compute resource in a preemptible and economic manner, such as AWS Spot Instance, GCP Preemptible VM instances. What’s more, AWS EKS fargate provides a more compact resource unit, which aims at reducing cost due to the fragmentation of resource. Hence, if it chooses different kinds of resource, application could achieve better flexibility and cost-efficiency.
The solution is straightforward, that is remote shuffle service, which manages all the shuffle data from different kinds of application. What’s more, the additional network IO could not be one significant bottleneck under the current situation of big-data distributed system and hardware. It brings almost trivial impact as shuffle write is streaming pipeline operation. The following diagram shows the functional difference after adoption of Remote Shuffle Service.
What’s benefits for RSS in cloud environment?
- Optimal instance for Disk IO/Network IO. For example, i3en provides high speed of NVMe SSD as instance store and extremely optimized network IO, which is best suited as shuffle service’s workload.
- Horizontally scale based upon workloads. It could be easy for cloud environment to add instances for extended capability. It can scale out/in in a holistic manner instead of application wise control, so it makes easier for improving the efficiency of resource utilization in a shared pool.
- Reduce the MTTR for preemptible instance lost. It usually requires regenerating the intermediate data from previous stage of task, it will cause those applications unable to use these economic instances if they have tight SLA requirements. Obviously, RSS can help these jobs to get a comparably stable execution time even if some instances have been reclaimed unexpectedly.
- Capability of enabling spark dynamic allocation. In case the shuffle data has been migrated out of executor, it can be easier for spark application to determine how to shrink the unused executor. It could achieve a better graceful scale-in for 1 executor in 1 instance.
What’s pitfall in case of the adoption of RSS in spark application?
- Speculative execution cannot work. In most use case, this feature provides a good fault- tolerance for uneven resource utilization environment. In current cloud environment, it will provide homogeneous and dedicated compute resource for spark applications. So, the impact of disabling this feature does not make a big difference.
- The maintenance of RSS. Obviously, introducing this component will bring additional maintenance, especially, the intermediate shuffle data is hard to predict on its volume and its connection model. However, this kind of issues exist in previous method of data shuffle as well. Further, the issues of connection instability and performance penalty could be easily exposed in a centralized remote shuffle service.
 Magnet: A scalable and performant shuffle architecture for Apache Spark (Linkedin’s solution) https://engineering.linkedin.com/blog/2020/introducing-magnet
 Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service (Uber’s solution) https://databricks.com/session_na20/zeus-ubers-highly-scalable-and-distributed-shuffle-as-a- service
 Use remote storage for persisting shuffle data (Spark Community) https://issues.apache.org/jira/browse/SPARK-25299
 Google’s Dremel’s architecture. http://www.vldb.org/pvldb/vol13/p3461-melnik.pdf