The Grid

What are Grids?

Grids are collections of interconnected resources harnessed together in order to satisfy various needs of users. The resources may be administered by different organizations and may be distributed, heterogeneous and fault-prone. The manner in which users interact with these resources may vary widely, as may the usage policies for those resources. A grid infrastructure must manage this complexity so that users can interact with resources as easily and smoothly as possible.

Our definition, and indeed a popular definition, is: A grid system is a collection of distributed resources connected by a network. A grid system, also called a grid, gathers resources – desktop and hand-held hosts, devices with embedded processing resources such as digital cameras and phones or tera-scale supercomputers – and makes them accessible to users and applications in order to reduce overhead and accelerate projects. A grid application can be defined as an application that operates in a grid environment or is "on" a grid system. Grid system software (or middleware), is software that facilitates writing grid applications and manages the underlying grid infrastructure.

A grid enables users to collaborate securely by sharing processing, applications and data across systems with the above characteristics in order to facilitate collaboration, faster application execution and easier access to data. More concretely this means being able to:

  • Find and share data. Access to remote data should be as simple as access to local data. Incidental system boundaries should be invisible to users who have been granted legitimate access.
  • Find and share applications. Many development, engineering and research efforts consist of custom applications – permanent or experimental, new or legacy, public-domain or proprietary – each with its own requirements. Users should be able to share applications with their own data sets.
  • Find and share computing resources. Providers should be able to grant access to their computing cycles to users who need them without compromising the rest of the network.

Grid Resources

The above definitions of a grid and a grid infrastructure are necessarily general. What constitutes a "resource" is a deep question, and the actions performed by a user on a resource can vary widely. For example, a traditional definition of a resource has been "machine", or more specifically "CPU cycles on a machine". The actions users perform on such a resource can be "running a job", "checking availability in terms of load", and so on. These definitions and actions are legitimate, but limiting. Today, resources can be as diverse as "biotechnology application", "stock market database" and "wide-angle telescope", with actions being "run if license is available", "join with user profiles" and "procure data from specified sector" respectively. A grid can encompass all such resources and user actions. Therefore a grid infrastructure must be designed to accommodate these varieties of resources and actions without compromising on some basic principles such as ease of use, security, autonomy, etc.

The resources in a grid typically share at least some of the following characteristics.

  • They are numerous;
  • They are owned and managed by different, potentially mutually-distrustful organizations and individuals;
  • They are potentially faulty;
  • They have different security requirements and policies;
  • They are heterogeneous, e.g., they have different CPU architectures, are running different operating systems, and have different amounts of memory and disk;
  • They are connected by heterogeneous, multi-level networks;
  • They have different resource management policies;
  • They are likely to be geographically-separated (on a campus, in an enterprise, on a continent)

Requirements for Grids

Clearly, the minimum capability needed to develop grid applications is the ability to transmit bits from one machine to another – all else can be built from that. However, several challenges frequently confront a developer constructing applications for a grid. These challenges lead us to a number of requirements that any complete grid system must address. The designers of Legion believed and continue to believe that all of these requirements must be addressed by the grid infrastructure in order to reduce the burden on the application developer. If the system does not address these issues, then the programmer must – forcing programmers to spend valuable time on basic grid functions, thus needlessly increasing development time and costs. The requirements are:

  • Security. Security covers a gamut of issues, including authentication, data integrity, authorization (access control) and auditing. If grids are to be accepted by corporate and government IT departments, a wide range of security concerns must be addressed. Security mechanisms must be integral to applications and capable of supporting diverse policies. Furthermore, we believe that security must be built in firmly from the beginning. Trying to patch security in as an afterthought (as some systems are attempting today) is a fundamentally flawed approach. We also believe that no single security policy is perfect for all users and organizations. Therefore, a grid system must have mechanisms that allow users and resource owners to select policies that fit particular security and performance needs, as well as meet local administrative requirements.
  • Fault tolerance. Failure in large-scale grid systems is and will be a fact of life. Machines, networks, disks and applications frequently fail, restart, disappear and behave otherwise unexpectedly. Forcing the programmer to predict and handle all of these failures significantly increases the difficulty of writing reliable applications. Fault-tolerant computing is known to be a very difficult problem. Nonetheless it must be addressed or else businesses and researchers will not entrust their data to grid computing.
  • Global name space. The lack of a global name space for accessing data and resources is one of the most significant obstacles to wide-area distributed and parallel processing. The current multitude of disjoint name spaces greatly impedes developing applications that span sites. All grid objects must be able to access (subject to security constraints) any other grid object transparently without regard to location or replication.
  • Accommodating heterogeneity. A grid system must support interoperability between heterogeneous hardware and software platforms. Ideally, a running application should be able to migrate from platform to platform if necessary. At a bare minimum, components running on different platforms must be able to communicate transparently.

Binary management and application provisioning. The underlying system should keep track of executables and libraries, knowing which ones are current, which ones are used with which persistent states, where they have been installed and where upgrades should be installed. These tasks reduce the burden on the programmer.

  • Multi-language support. Diverse languages will always be used and legacy applications will need support.

Scalability. There are over 500 million computers in the world today and over 100 million network-attached devices (including computers). Scalability is clearly a critical necessity. Any architecture relying on centralized resources is doomed to failure. A successful grid architecture must adhere strictly to the distributed systems principle: the service demanded of any given component must be independent of the number of components in the system. In other words, the service load on any given component must not increase as the number of components increases.

  • Persistence. I/O and the ability to read and write persistent data are critical in order to communicate between applications and to save data. However, the current files/file libraries paradigm should be supported, since it is familiar to programmers.
  • Extensibility. Grid systems must be flexible enough to satisfy current user demands and unanticipated future needs. Therefore, we feel that mechanism and policy must be realized by replaceable and extensible components, including (and especially) core system components. This model facilitates development of improved implementations that provide value-added services or site-specific policies while enabling the system to adapt over time to a changing hardware and user environment.
  • Site autonomy. Grid systems will be composed of resources owned by many organizations, each of which desires to retain control over its own resources. The owner of a resource must be able to limit or deny use by particular users, specify when it can be used, etc. Sites must also be able to choose or rewrite an implementation of each Legion component as best suits their needs. If a given site trusts the security mechanisms of a particular implementation it should be able to use that implementation.
  • Complexity management. Finally, but importantly, complexity management is one of the biggest challenges in large-scale grid systems. In the absence of system support, the application programmer is faced with a confusing array of decisions. Complexity exists in multiple dimensions: heterogeneity in policies for resource usage and security, a range of different failure modes and different availability requirements, disjoint namespaces and identity spaces, and the sheer number of components. For example, professionals who are not IT experts should not have to remember the details of five or six different file systems and directory hierarchies (not to mention multiple user names and passwords) in order to access the files they use on a regular basis. Thus, providing the programmer and system administrator with clean abstractions is critical to reducing their cognitive burden.

Grid Design Principles

We have developed a set of principles for the architecture and design of grid systems. These include:

  • Provide a single-system view. With today’s operating systems and tools such as LSF, SGE, and PBS we can maintain the illusion that our local area network is a single computing resource. But once we move beyond the local network or cluster to a geographically-dispersed group of sites, perhaps consisting of several different types of platforms, the illusion breaks down. Researchers, engineers and product development specialists (most of whom do not want to be experts in computer technology) are forced to request access through the appropriate gatekeepers, manage multiple passwords, remember multiple protocols for interaction, keep track of where everything is located, and be aware of specific platform-dependent limitations (e.g., this file is too big to copy or to transfer to that system; that application runs only on a certain type of computer, etc.). Re-creating the illusion of single computing resource for heterogeneous, distributed resources reduces the complexity of the overall system and provides a single namespace.
  • Provide transparency as a means of hiding detail. Grid systems should support the traditional distributed system transparencies: access, location, heterogeneity, failure, migration, replication, scaling, concurrency and behavior. For example, users and programmers should not have to know where something is located in order to use it (access, location and migration transparency), nor should they need to know that a component across the country failed – they want the system to recover automatically and complete the desired task (failure transparency). This behavior is the traditional way to mask details of the underlying system.
  • Provide flexible semantics. A grid architecture should be suitable to as many users and purposes as possible. A rigid system design in which policies are limited, trade-off decisions are pre-selected, or all semantics are pre-determined and hard-coded would not achieve this goal. Indeed, if a system dictates a single system-wide solution to almost any of the technical objectives outlined above, it would preclude large classes of potential users and uses. Therefore, a good grid design should allow users and programmers as much flexibility as possible in their applications’ semantics, resisting the temptation to dictate solutions. Whenever possible, users can select both the kind and the level of functionality and choose their own trade-offs between function and cost. This philosophy must be manifested in the system architecture by specifying the functionality but not the implementation of the system’s core features. The core system should, therefore, consist of extensible, replaceable components whereever possible while its implementation should provide reasonable default implementations of each functional component, which will be useful for average users. The system should encourage users to select or create components that solve their specific needs.
  • Reduce user effort. In general, there are four classes of grid users who are trying to accomplish some mission with the available resources: end-users of applications, applications developers, system administrators and managers. We believe that users want to focus on their jobs, e.g., their applications, and not on the underlying grid plumbing and infrastructure. Thus, for example, to run an application should be fairly straight forward, such as “run <my_application> <my paramaters>“. The grid should then take care of all of the messy details such as finding an appropriate host on which to execute the application, moving data and executables around, etc. Of course, the user may optionally be aware and specify or override certain behaviors, for example, by specifying the name of the host or the type of host on which to run the job or by specifying a different meta scheduler then the default.
  • Reduce “activation energy”. One of the typical problems in technology adoption is getting users to use it. If it is difficult to shift to a new technology then users will tend not to take the effort to try it unless their need is immediate and extremely compelling. This is not a problem unique to grids – it is human nature. Therefore, one of the most important goals for grid design is to make using the technology easy. Using an analogy from chemistry, you want to keep the activation energy of adoption as low as possible. Thus, users can easily and readily realize the benefit of using grids – and get the reaction going – creating a self-sustaining spread of grid usage throughout the organization. Some examples of ways to accomplish this for grid users include: requiring no recompilation of user applications when usign them in the grid and supporting the ability to map a grid to a local operating system file system and vice versa (mapping portions of the local file system into the grid - subject to security, of course). Another variant of this concept is the motto “no play, no pay“. The basic idea is that if you do not need a feature, e.g., encrypted data streams, fault resilient files or strong access control, you should not have to pay the overhead of using it.
  • Do no harm. Grid software should not require resource owners to open themselves up to increased risks (or at least these should be minimized). A good example is to design the system such that the grid software can run with the least security priviledges possible - minimizing the requirement for root priviledges in particular - and sandboxing outside access to local resources.
  • Do not change host operating systems. Organizations will not permit their machines to be used if their operating systems must be replaced. Therefore, grid systems must be designed to work with and on top of host operating systems and require as few configuration changes as possible. Our experience with Mentat and Legion indicates that building such a grid on top of host operating systems is a viable approach.

Overall, the application of these design principles at every level provides a unique, consistent, and extensible framework upon which to create grid applications.