Overview: A Priori Data Unification

Apache Spark is an open source cluster computing engine for large scale data processing. Using advanced DAG (Directed Acrylic Graph) execution that supports acyclic data flow and in-memory computing, Spark is known to run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.

Besides its so-called “lightning-speed” programming model, Apache Spark has emerged as the de facto framework for big data analytics. By providing a superior library of machine learning algorithms, Spark enables data scientists to manage their pipelines at a higher level of performance and scalability than previously possible.

However, complaints about Spark’s out-of-memory errors have continued to surface, especially when dealing with disparate data sources and multiple join conditions. Under such circumstances, Spark’s memory utilization, which is based on the complex partitioning of data, becomes vastly elevated, and the platform quickly runs out of memory.

SparkPLUS offers a simple solution to this problem. By essentially unifying data a priori, SparkPLUS frees up the Spark platform for the high speed computational and analytical functions that it was designed to fulfill.

How does Apache Spark work?
Spark, developed by Matei Zaharia at UC Berkeley’s AMPLab in 2009,was open sourced in 2010. The project was donated to the Apache Software Foundation in 2013.
Since then, Apache Spark has quickly gained recognition from the data community. While as fault-tolerant and scalable as Hadoop, the Spark platform uses an multi-stage model that is less rigid than the sequential map and reduce processes that define Hadoop. As a result, Spark is faster and arguably, easier to use.

Figure 1: Spark RDD Operations

Out-Of-Memory Problems

As more enterprises have sought to introduce Spark into their data environment, so the frustration of failed jobs due to insufficient memory have also grown correspondingly. An often repeated complaint is that applications have run out of heap space because executors dealing with Spark partitions require more memory than was assigned.

It is not unusual to find analysts complaining about out-of-memory outages for files sizes of just 250 MB using Spark clusters containing more than 20 worker nodes. The reality is that Spark can be a memory guzzler arising from its execution and storage computes. These share a common memory pool, so if less memory is used for execution, there is more left over for storage.

To reduce execution memory drain, it is critical to partition optimally, because Java or Scala features, especially collection classes and wrapper objects, can introduce significant RDD overheads. It is estimated that a single 10-character Java string can actually consume as much as 60 bytes of memory. It is also important not to have a skewed distribution of data across partitions.

While troubleshooting guides provide a variety of work arounds to overcome high memory consumption, one of the easiest ways of doing so is to pre-agggregating data before it is processed in Spark. In this way, it is possible to avoid aggregation operations like cogroup and join. These operations trigger shuffles which are highly memory intensive. Other suggestions, such as the use of accumulators which enable Spark programs to aggregate the results of a distributed execution, is that this requires data to be passed to the driver node, which increases the memory pressure on the driver.

SparkPLUS: Separating Spark’s data aggregation and analytical functions

Ultimately, there are few options as elegant as SparkPLUS. The SparkPLUS solution leverages on the integration of Percipient’s UniConnect platform with Aparche Spark. By using a proprietary in-memory data access layer, disparate data sources can be pre-joined via standard ANSI query language. This aggregated data is then delivered to the Spark cluster via a JDBC connector. This essentially bypasses the technical challenges of partitioning, so no memory configuration tuning is required. And by allowing the UniConnect platform to perform the “heavy lifting” of data aggregation, the SparkPLUS solution frees up the Spark platform for the high speed computational functions that it was designed to fulfill.

This SparkPLUS solution has several other advantages. With more memory available, con currency across a large number of users is also possible. This remains a stumbling block when putting Spark to use at an enterprise level. As a result of the direct access to multiple data sources, SparkPLUS also reduces the need to persist the resulting aggregated data in a separate store before making it available to the Spark engine.

Now that the memory issues are addressed, Spark’s machine learning library – MLIib – can be leveraged to solve advanced analytical problems at high speed and scale. But SparkPLUS delivers Spark to a new level, by ensuring an efficient and seamless end-to-end solution from data ingestion, to unification, analytics and visualisation. Despite these heightened capabilities, the learning curve is short. With SparkR and UniConnect, users can continue to use common R and SQL syntax without the pain of learning Spark’s new Scala programming language.

SparkPLUS open the door to Spark for every enterprise, big and tech-savvy, or small but intelligence-hungry.

Percipient Partners Pte. Ltd. #09-04, LATTICE80, 80 Robinson Road, Singapore 068898


© 2017, Percipient Partners Pte. Ltd., All Rights Reserved. Reproduction Prohibited.