1. Overview

    GridGain is Java-based middleware for in-memory processing of big data in a distributed environment. It is based on high performance in-memory data platform that integrates fast In-Memory MapReduce implementation with In-Memory Data Grid technology delivering easy to use and easy to scale software. Using GridGain you can process terabytes of data, on 1000s of nodes in under a second. 

    GridGain typically resides between business, analytics, transactional or BI applications and long term data storage such as RDBMS, ERP or Hadoop HDFS, and provides in-memory data platform for high performance, low latency data storage and processing.

    Both, GridGain and Hadoop, are designed for parallel processing of distributed data. However, both products serve very different goals and in most cases are very complementary to each other. Hadoop is mostly geared towards batch-oriented offline processing of historical and analytics payloads where latencies and transactions don't really matter, while GridGain is meant for real-time in-memory processing of both transactional and non-transactional live data with very low latencies. To better understand where each product really fits, let us compare some main concepts of each product.

    GridGain In-Memory Compute Grid vs Hadoop MapReduce

    MapReduce is a programming model developed by Google for processing large data sets of data stored on disks. Hadoop MapReduce is an implementation of such model. The model is based on the fact that data in a single file can be distributed across multiple nodes and hence the processing of those files has to be co-located on the same nodes to avoid moving data around. The processing is based on scanning files record by record in parallel on multiple nodes and then reducing the results in parallel on multiple nodes as well. Because of that, standard disk-based MapReduce is good for problem sets which require analyzing every single record in a file and does not fit for cases when direct access to a certain data record is required. Furthermore, due to offline batch orientation of Hadoop it is not suited for low-latency applications. 

    GridGain In-Memory Compute Grid (IMCG) on the other hand is geared towards in-memory computations and very low latencies. GridGain IMCG has its own implementation of MapReduce which is designed specifically for real-time in-memory processing use cases and is very different from Hadoop one. Its main goal is to split a task into multiple sub-tasks, load balance those sub-tasks among available cluster nodes, execute them in parallel, then aggregate the results from those sub-tasks and return them to user.
    Splitting tasks into multiple sub-tasks and assigning them to nodes is the mapping step and aggregating of results is reducing step. However, there is no concept of mandatory data built in into this design and it can work in the absence of any data at all which makes it a good fit for both, stateless and state-full computations, like traditional HPC. In cases when data is present, GridGain IMCG will also automatically colocate computations with the nodes where the data is to avoid redundant data movement.

    It is also worth mentioning, that unlike Hadoop, GridGain IMCG is very well suited for processing of computations which are very short-lived in nature, e.g. below 100 milliseconds and may not require any mapping or reducing.

    Here is a simple Java coding example of GridGain IMCG which counts number of letters in a phrase by splitting it into multiple words, assigning each word to a sub-task for parallel remote execution in the map step, and then adding all lengths receives from remote jobs in reduce step.

        int letterCount = g.reduce(
            BALANCE,
            // Mapper
            new GridClosure<String, Integer>() {
                @Override public Integer apply(String s) {
                    return s.length();
                }
            },
            Arrays.asList("GridGain Letter Count".split(" ")),
            // Reducer
            F.sumIntReducer()
        ));
    

    GridGain In-Memory Data Grid vs Hadoop Distributed File System

    Hadoop Distributed File System (HDFS) is designed for storing large amounts of data in files on disk. Just like any file system, the data is mostly stored in textual or binary formats. To find a single record inside an HDFS file requires a file scan. Also, being distributed in nature, to update a single record within a file in HDFS requires copying of a whole file (file in HDFS can only be appended). This makes HDFS well-suited for cases when data is appended at the end of a file, but not well suited for cases when data needs to be located and/or updated in the middle of a file. With indexing technologies, like HBase or Impala, data access becomes somewhat easier because keys can be indexed, but not being able to index into values (*secondary indexes*) only allow for primitive query execution.

    GridGain In-Memory Data Grid (IMDG) on the other hand is an in-memory key-value data store. The roots of IMDGs came from distributed caching, however GridGain IMDG also adds transactions, data partitioning, and SQL querying to cached data. The main difference with HDFS (or Hadoop ecosystem overall) is the ability to transact and update any data directly in real time. This makes GridGain IMDG well suited for working on operational data sets, the data sets that are currently being updated and queried, while HDFS is suited for working on historical data which is constant and will never change. 

    Unlike a file system, GridGain IMDG works with user domain model by directly caching user application objects. Objects are accessed and updated by key which allows IMDG to work with volatile data which requires direct key-based access.

    GridGain IMDG allows for indexing into keys and values (i.e. primary and secondary indices) and supports native SQL for data querying & processing. One of unique features of GridGain IMDG is support for distributed joins which allow to execute complex SQL queries on the data in-memory without limitations.

    GridGain and Hadoop Working Together

    To summarize:
    Hadoop essentially is a Big Data warehouse which is good for batch processing of historic data that never changes, while GridGain, on the other hand, is an In-Memory Data Platform which works with your current operational data set in transactional fashion with very low latencies. Focusing on very different use cases make GridGain and Hadoop very complementary with each other.

    Up-Stream Integration

    The diagram above shows integration between GridGain and Hadoop. Here we have GridGain In-Memory Compute Grid and Data Grid working directly in real-time with user application by partitioning and caching data within data grid, and executing in-memory computations and SQL queries on it. Every so often, when data becomes historic, it is snapshotted into HDFS where it can be analyzed using Hadoop MapReduce and analytical tools from Hadoop eco-system. 

    Down-Stream Integration

    Another possible way to integrate would be for cases when data is already stored in HDFS but needs to be loaded into IMDG for faster in-memory processing. For cases like that GridGain provides fast loading mechanisms from HDFS into GridGain IMDG where it can be further analyzed using GridGain in-memory Map Reduce and indexed SQL queries.

    Conclusion

    Integration between an in-memory data platform like GridGain and disk based data platform like Hadoop allows businesses to get valuable insights into the whole data set at once, including volatile operational data set cached in memory, as well as historic data set stored in Hadoop. This essentially eliminates any gaps in processing time caused by Extract-Transfer-Load (ETL) process of copying data from operational system of records, like standard databases, into historic data warehouses like Hadoop. Now data can be analyzed and processed at any point of its lifecycle, from the moment when it gets into the system up until it gets put away into a warehouse.


    1

    View comments


  2. GridGain 4.3.1 service release includes several important bug fixes and host of new optimizations. It is 100% backward compatible and it is highly recommended update for anyone running production systems on 4.x code line.

    Details

    DateNovember 10th, 2012
    Version4.3.1e
    Build10112012

    New Features and Enhancements

      • Added remove operation to data loader
      • Significantly improved performance of partition to node mapping
      • Added GridSerializationBenchmark for comparing performance of Java, Kryo, and GridGain serialization
      • Added property-based configuration to remote clients
      • Optimized concurrency for asynchronous methods in C++ client
      • Removed support for Groovy++ DSL Grover

    Core Bug Fixes

      • Unmarshalling of SimpleDateFormat fails with NPE
      • Possible NPE in Indexing Manager when using distributed data structures
      • Swap partition iterator skips entries if off-heap iterator is empty
      • `GridDataLoader` does not allow to cache primitive arrays
      • Excessive memory consumption in indexing SPI
      • Add check on startup that GridOptimizedMarshaller is supported by running JDK version
      • If ordered message is timed out, other messages for the same topic may not be processed
      • ScalarPiCalculationExample does not provide correct estimate for PI

    Client Connectivity Bug Fixes

      • Client router with explicit default configuration leads to NPE.
      • Repair REST client support to make session token and client ID optional
      • Ping does not work properly in C++ client

    Visor Management Bug Fixes

      • Clear and Compact operations in Visor do not account for node selection
      • Move Visor management tasks into a separate thread pool
      • Preload dialog in Visor does not show correct number of keys
      • GC dialog in Visor waits indefinitely for dead nodes
      • Increase tooltip dismiss time in Visor
      • Visor log search does not show nodes table correctly on Windows
    0

    Add a comment


  3. In-memory processing has been a pretty hot topic lately. Many companies that historically would not have considered using in-memory technology because it was cost prohibitive are now changing their core systems’ architectures to take advantage of the low-latency transaction processing that in-memory technology offers.  This is a consequence of the fact that the price of RAM is dropping significantly and rapidly and as a result, it has become economical to load the entire operational dataset into memory with performance improvements of over 1000x faster. In-Memory Compute and Data Grids provide the core capabilities of an in-memory architecture.

    The goal of In-Memory Data Grids (IMDG) is to provide extremely high availability of data by keeping it in memory and in highly distributed (i.e. parallelized) fashion. By loading Terabytes of data into memory IMDGs are able to support most of the Big Data processing requirements today.

    At a very high level, an IMDG is a distributed object store similar in interface to a typical concurrent hash map. You store objects with keys. Unlike traditional systems where keys and values are often limited to byte arrays or strings, with IMDGs you can use any domain object as either value or key. This gives tremendous flexibility by allowing you to keep exactly the same object your business logic is dealing with in the Data Grid without the extra step of marshaling and de-marshaling that alternative technologies would require. It also simplifies the usage of your data grid as you can, in most cases, interface with the distributed data store as you do with a simple hash map. Being able to work with domain objects directly is one of the main differences between IMDGs and In-Memory Databases (IMDB). In the case of the latter, users still need to perform Object-To-Relational Mapping which typically adds significant performance overhead.

    There are also some other features in IMDGs that distinguish them from other products, such as NoSql databases, IMDBs,  or NewSql databases. One of the main differences is truly scalable Data Partitioning across the cluster. Essentially IMDGs in their purest form can be viewed as distributed hash maps with every key cached on a particular cluster node - the bigger the cluster, the more data you can cache. The trick to this architecture is to make sure that you collocate your processing with the cluster nodes where data is cached to make sure that all cache operations become local and that there is no (or minimal) data movement within the cluster. In fact, when using well designed IMDGs, there should be absolutely no data movement on stable topologies. The only time when some of the data is moved is when new nodes join in or some existing nodes leave the cluster, thus causing some data repartitioning within the cluster.

    The picture below shows a classic IMDG with a key set of {k1, k2, k3} where each key belongs to a different node. The external database component is optional. If present, then IMDGs will usually automatically read data from the database or write data to it.

    Another distinguishing characteristic of IMDGs is Transactional ACID support. Generally a 2-phase-commit (2PC) protocol is used to ensure data consistency within cluster. Different IMDGs will have different underlying locking mechanisms, but usually more advanced implementations will provide concurrent locking mechanisms (GridGain, for instance, uses MVCC - multi-version concurrency control) and reduce network chattiness to a minimum, hence guaranteeing transactional ACID consistency with very high performance.

    Data consistency is one of the main differences between IMDGs and NoSQL databases. NoSQL databases are usually designed using an Eventual Consistency (EC) approach where data is allowed to be inconsistent for a period of time as long as it will become consistent *eventually*. Generally, the writes on EC-based systems are somewhat fast, but reads are slow (or to be more precise, as fast as writes are). Latest IMDGs with an *optimized* 2PC should at least match if not outperform EC-based systems on writes, and be significantly faster on reads. It is interesting to note that the industry has made a full circle moving from  a then-slow 2PC approach to the EC approach, and now from EC to a much faster *optimized* 2PC.

    Different products provide different 2PC optimizations, but generally the purpose of all optimizations is to increase concurrency, minimize network overhead, and reduce the number of locks a transaction requires to complete. As an example, Google's distributed global database, Spanner, is based on a transactional 2PC approach simply because 2PC provided a faster and more straightforward way to guarantee data consistency and high throughput compared to MapReduce or EC.

    Even though IMDGs usually share some common basic functionality, there are many features and implementation details that are different between vendors. When evaluating an IMDG product pay attention to eviction policies, (pre)loading techniques, concurrent repartitioning, memory overhead, etc... Also pay attention to the ability to query data at runtime. Some IMDGs, such as GridGain for example, allow users to query in-memory data using standard SQL, including support for distributed joins, which is pretty rare.

    Storing data in an IMDG is only half of the functionality required for an in-memory architecture. This data must also be processed in a high-performance and parallelized manner. The typical in-memory architecture partitions data across the cluster using an IMDG, and then computations are sent to the nodes where the data is for collocated (local) execution. Since computations are usually part of Compute Grids and have to be properly deployed, load-balanced, failed-over, or scheduled, the integration between Compute Grids and IMDGs is very important. It is especially beneficial if both In-Memory Compute and Data Grids are part of the same product and utilize the same APIs which removes the burden of integration from the developer and usually renders the highest performance and reliable in-memory systems.



    IMDGs (together with Compute Grids) are used throughout a wide spectrum of industries in applications as diverse as Risk Analytics, Trading Systems, real time Fraud Detection, Biometrics, eCommerce, or Online Gaming. Essentially every project that struggles with scalability and performance can benefit from In-Memory Processing and IMDG architecture.
    9

    View comments

About me
About me
- Antoine de Saint-Exupery -
- Antoine de Saint-Exupery -
"A designer knows he has achieved perfection not when there is nothing left to add, but when there is nothing left to take away."
Blog Archive
Blogs I frequent
Loading
Dynamic Views theme. Powered by Blogger.