1. We at GridGain recently were faced with the following problem. It turns out (may be old news to some), that java.util.concurrent.ExecutorService in JDK 1.6 is not backward compatible with JDK 1.5. Although backward compatibility is preserved at binary level, the backward *compilability* is broken. This means that if you implement your own ExecutorService in 1.6, the source code won't compile in 1.5 and vice versa.

    What we are implementing in our upcoming 2.0.3 release is grid-enabled ExecutorService, where users will simply submit standard Java Runnable or Callable tasks to execute them on remote grid nodes. To user this would still look no different than using standard java.util.concurrent.ExecutorService locally, but the grid-enabled service transparently provides all the grid computing features underneath, such as fail-over, load balancing, scheduling, peer-class-loading, etc...

    However, it turns out that Sun made a mistake with generics in JDK5 and decided to "fix" it in JDK6. So these methods from JDK5

    invokeAll(Collection<Callable<T>> tasks)
    invokeAll(Collection<Callable<T>> tasks, long timeout, TimeUnit unit)
    invokeAny(Collection<Callable<T>> tasks)
    invokeAny(Collection<Callable<T>> tasks, long timeout, TimeUnit unit)

    now look as following in JDK 6 (note the <? extends ...> clause)

    invokeAll(Collection<? extends Callable<T>> tasks)
    invokeAll(Collection<? extends Callable<T>> tasks, long timeout, TimeUnit unit)
    invokeAny(Collection<? extends Callable<T>> tasks)
    invokeAny(Collection<? extends Callable<T>> tasks, long timeout, TimeUnit unit)

    The weird thing is that Sun acknowledges the mistake (here is the bug #6267833), but what I don't like is the "JSR-166 expert group" explanation which goes as following:

    ...requires minor source code changes for the small set of developers who have implemented ExecutorService without inheriting the default implementations in AbstractExecutorService. The set of affected developers are developers creating sophisticated thread pool applications, putting them into the "concurrency rocket scientist" category. They will generally appreciate this change. The possible compiler error is trivial to fix in the source code.

    Well, I guess GridGain dev team falls into "concurrency rocket scientist" category, as we really do need to implement ExecutorService and we cannot use AbstractExecutorService just because it was not really designed for reuse in the first place.

    So here is my beef with SUN:

    1) Sun messed up with generics so badly, that even their own developers don't understand how to use them. On top of that, why in the world do I need to specify <? extends SomeInterface> for generics? What else am I going to do with an interface other than *extend* it. Well... I guess I can stare at it, but I don't think it counts.

    2) The JSR166 expert group decided that that they are the smartest bunch in the world, and that people who use JDK for purposes other than writing simple if statements and for loops are really hard to come by. So, they simply decided that it is good enough for 95% of JAVA users, and the other 5% will appreciate their "rocket science" fix that brilliantly breaks compilation.

    3) They decided to fix it and break source code backward compatibility while there was a clear work around described in the same bug. People could still do the following without any warnings in the code:

    service.invokeAll(Collections.<Callable<Object>>singleton(x));

    Now, this is not fatal to us in any way. Since binary compatibility is still preserved, most of our users who just use GridGain binaries will never notice. But being an open source company, we ship with both, binaries and source code, and now our source code will not compile out of the box on both, JDK 1.5 and 1.6, which I find really annoying.

     

    2

    View comments

  2. It is no secret that automatic fail-over in distributed environments is no picnic to implement. Here are some useful pointers if you ever decide to do it on your own:

    1. Make sure to implement some sort of heartbeat protocol. A heartbeat is a message that every node emits to tell others that it's alive. It is usually implemented with IP Multicast, however actual communication protocol is not important here. Other nodes will consider a node to be failed after it missed a certain pre-configured number of heartbeats.
    2. Account for delays in node discovery. There is always a time window between an actual node crash and when other nodes find out about it.
    3. Store all messages on sender node until they get processed. This way you can fail them over to other nodes in case if the processing node failed.
    4. Account for possibility of receiving multiple notification events for the same node failure - you don't want to process the same fail-over event more than once.
    5. Make sure that your message does not get failed-over forever, i.e. keeps jumping between grid nodes indefinitely. After a certain number of fail-over attempts, let the whole processing of the message fail.
    6. Make sure that your message does not get failed-over to the same node it failed on initially - always give preference to other grid nodes.
    7. Make sure that message failure is not limited to node crashes. For example, you may potentially want to fail-over a message if it threw some exception on remote node or returned a bad result.
    8. Avoid sending any messages within synchronization blocks - this is a sure way to introduce deadlocks into your code.
    9. Make sure that fail-over happens automatically at infrastructure level and is transparent to your application logic.
    10. Provide a good interface for your Failover module and make it pluggable - failover logic, such as selecting a new node, may differ based on your application policy, so it is essential to be able to easily switch underlying implementation.
    Of course you could always download GridGain and get all of the above right out of the box ;-)

     

    1

    View comments

  3. What does fail-over in distributed grid or cluster environment really mean? In a standard notion of it, users usually expect their data or logic to automatically fail-over to a new available grid node in case of a node crash. But is this really enough? What if, for example, a grid node is still alive, but it did not have the available resources to process your job. What if I/O on that node is to slow or database connection is not available? Also, a result of a computation could be application specific. If a computation throws an exception, depending on application logic it may or may not be worth while to retry the same computation on another node.

    The correct approach is to allow users to control their fail-over logic whenever a custom behavior is needed. In GridGain, in addition to standard fairly rich fail-over policy provided out-of-the-box, we have 2 pluggability points where user can plug a custom fail-over behavior - one is for overall application fail-over policy, and another is for every individual computation.

    The application-specific behavior is provided via GridFailoverSpi (GridGain uses SPI's, Service Provider Interfaces, as plugins into any kernel level functionality). A user simply has to implement 'fail-over' method on Failover SPI interface. Here is a very simple example of fail-over logic that picks another available node using underlying load balancer:

    public class MyFailoverSpi extends GridSpiAdapter
    implements GridFailoverSpi {
    ...
    /**
    * This logic handles fail-over of a computation
    * job from one node to another.
    */
    public GridNode failover(
    GridFailoverContext ctx,
    List topology) {
    GridJobResult failedResult = ctx.getJobResult();

    List newTopology = new ArrayList(topology);

    // Remove failed node from topology to
    // avoid retries on the same node.
    newTopology.remove(failedResult.getNode());

    // Delegate to load balancing.
    return ctx.getBalancedNode(newTopology);
    }
    ...
    }

    The computation-specific behavior is overridden at the computation logic level. In GridGain a computation unit, GridTask, is responsible for splitting your logic into smaller sub-computations, GridJobs, assigning them to remote nodes and then aggregating job results into one task result (GridTask is our main MapReduce abstraction). Here is an example of how a GridTask can decide that a job should be failed over to another node:

    public class MyGridTask
    extends GridTaskSplitAdapter {
    public List split(int gridSize, Object arg) {
    ...
    }

    /**
    * Callback for every job result that came
    * from remote grid nodes.
    */
    public public GridJobResultPolicy result(
    GridJobResult result,
    List receivedResults) {
    if (result.getData().equals(someBadResult)) {
    // Delegate to failover SPI to pick
    // another node.
    return GridJobResultPolicy.FAILOVER;
    }

    // Wait for other results to come in, or
    // reduce if all results have arrived.
    return GridJobResultPolicy.WAIT;
    }

    public Object reduce(List allResults) {
    ...
    }
    }

    Enjoy!

    0

    Add a comment

  4. Let me ask you - how many grid computing products do you know that provide runtime statistics of of the grid via regular API? I assume not many (if any). Some grid products I know don't even expose their grid topology - they treat cluster as one black box.

    Well, in GridGain we have a concept of Node Metrics which provide almost real-time information about activity on every grid node. These metrics include current and average values for CPU utilization, heap, thread stats, job execution time (current and average), number of running/rejected/cancelled jobs, size of waiting queue, job wait time, total number of executed jobs and a lot more useful node runtime information.

    So, why do we do that you may ask. The answer is simple - this data is very useful when you really want to have a fine-grained control on how your jobs are distributed across grid nodes. For example, what if you want to segment your grid based on CPU utilization and execute your jobs only on nodes with CPU load under 50%? Or what if you need to adapt to average CPU load or job execution time in order to send more jobs to the nodes that can process your computations faster?

    In fact, our Adaptive Load Balancing SPI does just that. On top of providing several out-of-the-box implementations, we allow users to plug any custom adaptive behavior suitable for their applications. Here is how simple it is to implement a policy that adapts to job processing time and returns a near-real-time node's load score on top of GridGain (note that we use Node Metrics to detect current and average job processing time):

    public class GridAdaptiveProcessingTimeLoadProbe
    implements GridAdaptiveLoadProbe {
    ...
    /**
    * Returns node's load score
    * based on job execution time.
    */
    public double getLoad(
    GridNode node,
    int jobsSentSinceLastUpdate) {
    // Obtain node metrics.
    GridNodeMetrics metrics = node.getMetrics();

    if (useAverageMetrics == true) {
    // Use average metrics data.
    return
    metrics.getAverageJobExecuteTime() +
    metrics.getAverageJobWaitTime();

    }

    // Return current metrics score.
    return
    metrics.getCurrentJobExecuteTime() +
    metrics.getCurrentJobWaitTime();
    }
    ...
    }

    You can download GridGain here. Enjoy grid computing!

    1

    View comments

  5. There has been a lot of fuzz lately about Google App Engine and how it is going to compete with Amazon EC2. Well, the answer is simple - it does not! And judging by its current limitations, I am not sure it ever will.

    Let me say first and foremost that I am a huge fan of on-demand cloud computing. I believe that once cloud computing matures, it will make little sense for businesses to run anything locally or in standard data centers - why pay for boxes while you are not using them? For example, at GridGain we constantly run our build and JUnits using Bamboo. We have 3 boxes deployed in a data center that constantly execute builds and tests in parallel (we are using our own Distributed JUnit support for it which speeds up execution from 1 hour to 17 minutes). Once we migrate to Amazon EC2 our costs will be cut approximately by a half.

    So, now about Google App Engine. First of all, it currently supports only Python. I can see how support for Python is important, but why paint the whole world into one color? Rumor has it that other languages will be added soon, but I still don't get a point of releasing Beta version only for Python users. Somehow Amazon EC2 is still in Beta, but it supports everything and a kitchen sink too. Google is actually known for announcing their beta releases too soon (perhaps they should call them alphas). Take a look at Android, for example, which is so buggy right now that you can't even run anything modestly serious on it.

    Secondly, external communication with Google App Engine instances is supported only via HTTP or HTTPS. This is a very serious limitation from Google, as they are basically implying that this platform is only good enough for websites. Even if they did support Java right now, the fact that no one can connect to GAE instances over normal TCP can be a show stopper for many enterprises.

    So, to summarize, Google App Engine Beta release is a plain vanilla Python web hosting data center on steroids. Ironically enough, they could have used Amazon EC2 to implement this. Simply create EC2 images with Python runtime on them and you are good to go.

    Being an open source Java grid computing shop, I am forced to wait for "gamma" release to try GridGain on Google App Engine. But judging by what I am seeing in beta, I am not overly optimistic here. I guess I will have to keep my fingers crossed.
    4

    View comments

  6. We recently were faced with a problem - how to make our toString() methods refactor-safe. Up until recently we were using simple toString() plugins for IDEA and Eclipse (Jutils) which generated toString() method automatically based on class fields. Then developer would have to tweak the generated code to remove fields that should not be included.

    However, faced with numerous support questions, we noticed that sometimes during refactoring a developer would forget to add a new field to existing toString() method or print out too much and clutter up the log. Surprisingly, there is no open source library that supports this basic functionality (ToStringBuilder from Apache is not even close), so we had to implement our own.

    So, to summarize, the functionality we needed is this:
    • Make sure that new fields are automatically included.
    • Make sure that certain classes, like Object, Collection, Array are automatically excluded.
    • Provide class-level overrides of default rules, which will include auto-excluded fields and vice versa.
    • Provide support for custom ordering of fields in toString() output.
    Here is the design we came out with:

    @GridToStringInclude Annotation
    This annotation can be attached to any field in the class to make sure that it is automatically included even if it is excluded by default.

    @ToStringExclude Annotation
    This annotation can be attached to any field in the class to make sure that it is automatically excluded even if it is included by default.

    @ToStringOrder(int) Annotation
    This annotation provides custom ordering of class fields. Fields with smaller order value will come before in toString() output. By default the order is the same as the order of field declarations in the class.

    ToStringBuilder Class
    This class is responsible for reflectively parsing all fields in class hierarchy, caching all annotations for performance reasons, and properly outputting toString() content.

    So, here is an example of a class that uses this simple framework:

    public class MySimpleClass {
    /**
    * This field would be included by
    * default, but is excluded due to
    * @ToStringExclude annotation.
    */
    @ToStringExclude
    private int intField = 1;

    /**
    * This field will be included
    * first for toString() purposes.
    */
    private String strField = "TestString";

    /**
    * This array field would be excluded
    * by default, but is included due to
    * @ToStringInclude annotation.
    */
    @ToStringInclude
    private int[] intArr = new int[] { 1, 2, 3 };

    /**
    * This field is excluded by default.
    */
    private Object obj = new Object();

    /**
    * Generic toString() implementation.
    */
    @Override
    public String toString() {
    return ToStringBuilder.
    toString(MySimpleClass.class, this);
    }
    }

    The toString() output of the class above will look as follows:

    MySimpleClass [strField=TestString, intArr={1,2,3}]

    The complete source code is available in our public WebSVN . When clicking on this link you will be prompted with login popup. Just enter "guest" for username and leave the password blank. The source code is in org.gridgain.grid.utils.toString package.

    Enjoy!

    6

    View comments


  7. Over the weekend we have released GridGain 2.0.2. Apart from multiple bug fixes, the main addition of this release is support for custom class loaders for task deployment. We got this request from Grails team and company behind Grails, G2One, who is currently working on grid-enabling Grails applications. Grails has a custom class loader that is used during runtime to deploy classes into it, and GridGain now has support for deploying code on it with basic approach, where default @Gridify annotation will be used to grid-enable Grails closures.


    You can download GridGain 2.0.2 here.

    0

    Add a comment

  8. So, how did it all get started?

    The project got started about 3 years ago by Nikita Ivanov and myself. Generally, having a significant experience in grid computing, we were to a large extent fed up with existing grid computing solutions. Majority of them were (and still are) either too expensive or too unusable. Just take a look at Globus , Sun GridEngine, or Platform for example. Globus is a command line tool penetrated by a whole soup of technologies, from C++ and scripting to Java and Web Services (good example of design by committee by the way). Sun Grid Engine is burdened by the same plague of complexity and inflexibility, which is the main reason why Sun's utility computing initiative was dead on arrival. Platform and many other commercial products, besides being overly complex to use, are extremely expensive. Some of these products don't even let you download an evaluation copy without having to contact sales (!) Common... in this day and age?

    And this is how GridGain got born. We from the get-go decided that the product should be Open Source simply because we don't believe that paying up front is the right model for middleware. Middleware does not bring any revenue to customers right away. It usually involves a development cycle, then testing, and then production push. So why should a customer pay while no money is earned? Professional Open Source allows users of middleware products to have a no-cost project jump start and then purchase professional support when their product goes into production and starts earning. It just makes sense!

    Another major focus of GridGain was Simplicity. That is not to say that we didn't focus on advanced grid computing features, such as scalability, fail-over, scheduling, deployment, load balancing, etc... We did, of course, but we approached it so it would be very simple to use and pluggable where custom behavior was needed. Just take a look at our forums or user testimonials, and you will quickly see that most users are able to start with a quick prototype in a matter of hours.

    So, now, we are already 2 major releases behind, have a strong R&D team in Saint Petersburg, Russia, impressive customer list, and quite large and rapidly growing community that keeps us moving forward with constant feedback and feature requests. As we usually say at GridGain, grid computing should be fun, simple, and productive!
    0

    Add a comment

About me
About me
- Antoine de Saint-Exupery -
- Antoine de Saint-Exupery -
"A designer knows he has achieved perfection not when there is nothing left to add, but when there is nothing left to take away."
Blog Archive
Blogs I frequent
Loading
Dynamic Views theme. Powered by Blogger.