GFFS – Global Federated File System

Main.GFFS History

Hide minor edits - Show changes to output

January 25, 2012, at 12:49 PM by 128.143.136.101 -
Changed lines 7-9 from:
Transparent access to data (and resources more generally) is realized by using OS-specific file system drivers that understand the underlying standard security, directory, and file access protocols employed by the GFFS. These file system drivers map the GFFS global namespace onto a local file system mount. Data and other resources in the GFFS can then be accessed exactly the same way local files and directories are accessed – applications cannot tell the difference.
to:
Transparent access to data (and resources more generally) is realized by using OS-specific file system drivers that understand the underlying standard security, directory, and file access protocols employed by the GFFS. These file system drivers map the GFFS global namespace onto a local file system mount. Data and other resources in the GFFS can then be accessed exactly the same way local files and directories are accessed – applications cannot tell the difference.

PLEASE NOTE: XSEDE is currently under development. Once deployed, it will be implemented as described below
.
January 25, 2012, at 12:47 PM by 128.143.136.101 -
November 18, 2011, at 01:12 PM by 128.143.137.203 -
Changed line 15 from:
%rfloat% http://genesis2.virginia.edu/wiki/uploads/Main/gffs2.jpeg
to:
%lfloat% http://genesis2.virginia.edu/wiki/uploads/Main/gffs2.jpeg
November 18, 2011, at 01:12 PM by 128.143.137.203 -
Changed lines 14-15 from:
Not all resources are directories and flat files. The GFFS reflects this by facilitating the inclusion of non-file system data; any resource type can be modeled as a file or directory, compute resources, databases, running jobs, and communications channel.%rfloat% http://genesis2.virginia.edu/wiki/uploads/Main/gffs2.jpeg
to:
Not all resources are directories and flat files. The GFFS reflects this by facilitating the inclusion of non-file system data; any resource type can be modeled as a file or directory, compute resources, databases, running jobs, and communications channel. (See Figure 2)
%rfloat% http://genesis2.virginia.edu/wiki/uploads/Main/gffs2.jpeg
November 18, 2011, at 01:11 PM by 128.143.137.203 -
Changed lines 13-14 from:
!!![[GFFS access|Access to non file system resources]]%rfloat% http://genesis2.virginia.edu/wiki/uploads/Main/gffs2.jpeg
Not all resources are directories and flat files. The GFFS reflects this by facilitating the inclusion of non-file system data; any resource type can be modeled as a file or directory, compute resources, databases, running jobs, and communications channel
.
to:
!!![[GFFS access|Access to non file system resources]]
Not all resources are directories and flat files. The GFFS reflects this by facilitating the inclusion of non-file system data; any resource type can be modeled as a file or directory, compute resources, databases, running jobs, and communications channel.%rfloat% http://genesis2.virginia.edu/wiki/uploads/Main/gffs2.jpeg
November 18, 2011, at 01:10 PM by 128.143.137.203 -
Changed lines 13-14 from:
!!![[GFFS access|Access to non file system resources]]
Not all resources are directories and flat files. The GFFS reflects this by facilitating the inclusion of non-file system data; any resource type can be modeled as a file or directory, compute resources, databases, running jobs, and communications channel. %rfloat% http://genesis2.virginia.edu/wiki/uploads/Main/gffs2.jpeg
to:
!!![[GFFS access|Access to non file system resources]]%rfloat% http://genesis2.virginia.edu/wiki/uploads/Main/gffs2.jpeg
Not all resources are directories and flat files. The GFFS reflects this by facilitating the inclusion of non-file system data; any resource type can be modeled as a file or directory, compute resources, databases, running jobs, and communications channel
.
November 18, 2011, at 01:10 PM by 128.143.137.203 -
Changed lines 10-12 from:
Three cases illustrate GFFS typical uses cases, accessing data at an NSF center from a home or campus, accessing data on a campus machine from an NSF center, and directly sharing data with a collaborator at another institution. For each of these three examples suppose that Sarah is an [[http://xsede.org | Extreme Science and Engineering Discovery Environment (XSEDE)]] user at Big State U and her students regularly runs jobs on Ranger at TACC.
to:
Three cases illustrate GFFS typical uses cases, accessing data at an NSF center from a home or campus, accessing data on a campus machine from an NSF center, and directly sharing data with a collaborator at another institution. For each of these three examples suppose that Sarah is an [[http://xsede.org | Extreme Science and Engineering Discovery Environment (XSEDE)]] user at Big State U and her students regularly runs jobs on Ranger at TACC. (See Figure 1)
Changed line 14 from:
Not all resources are directories and flat files. The GFFS reflects this by facilitating the inclusion of non-file system data; any resource type can be modeled as a file or directory, compute resources, databases, running jobs, and communications channel.
to:
Not all resources are directories and flat files. The GFFS reflects this by facilitating the inclusion of non-file system data; any resource type can be modeled as a file or directory, compute resources, databases, running jobs, and communications channel. %rfloat% http://genesis2.virginia.edu/wiki/uploads/Main/gffs2.jpeg
November 18, 2011, at 01:08 PM by 128.143.137.203 -
Changed lines 9-10 from:
!!![[Three examples|Three examples of GFFS Typical Use Cases]]
Three cases illustrate GFFS typical uses cases, accessing data at an NSF center from a home or campus, accessing data on a campus machine from an NSF center, and directly sharing data with a collaborator at another institution. For each of these three examples suppose that Sarah is an [[http://xsede.org | Extreme Science and Engineering Discovery Environment (XSEDE)]] user at Big State U and her students regularly runs jobs on Ranger at TACC. %rfloat% http://genesis2.virginia.edu/wiki/uploads/Main/gffs.jpeg
to:
!!![[Three examples|Three examples of GFFS Typical Use Cases]]%rfloat% http://genesis2.virginia.edu/wiki/uploads/Main/gffs.jpeg
Three
cases illustrate GFFS typical uses cases, accessing data at an NSF center from a home or campus, accessing data on a campus machine from an NSF center, and directly sharing data with a collaborator at another institution. For each of these three examples suppose that Sarah is an [[http://xsede.org | Extreme Science and Engineering Discovery Environment (XSEDE)]] user at Big State U and her students regularly runs jobs on Ranger at TACC.
November 18, 2011, at 01:07 PM by 128.143.137.203 -
Changed line 10 from:
Three cases illustrate GFFS typical uses cases, accessing data at an NSF center from a home or campus, accessing data on a campus machine from an NSF center, and directly sharing data with a collaborator at another institution. For each of these three examples suppose that Sarah is an [[http://xsede.org | Extreme Science and Engineering Discovery Environment (XSEDE)]] user at Big State U and her students regularly runs jobs on Ranger at TACC.
to:
Three cases illustrate GFFS typical uses cases, accessing data at an NSF center from a home or campus, accessing data on a campus machine from an NSF center, and directly sharing data with a collaborator at another institution. For each of these three examples suppose that Sarah is an [[http://xsede.org | Extreme Science and Engineering Discovery Environment (XSEDE)]] user at Big State U and her students regularly runs jobs on Ranger at TACC. %rfloat% http://genesis2.virginia.edu/wiki/uploads/Main/gffs.jpeg
November 14, 2011, at 09:46 AM by 128.143.137.203 -
Changed line 10 from:
Three cases illustrate GFFS typical uses cases, accessing data at an NSF center from a home or campus, accessing data on a campus machine from an NSF center, and directly sharing data with a collaborator at another institution.
to:
Three cases illustrate GFFS typical uses cases, accessing data at an NSF center from a home or campus, accessing data on a campus machine from an NSF center, and directly sharing data with a collaborator at another institution. For each of these three examples suppose that Sarah is an [[http://xsede.org | Extreme Science and Engineering Discovery Environment (XSEDE)]] user at Big State U and her students regularly runs jobs on Ranger at TACC.
November 14, 2011, at 09:39 AM by 128.143.137.203 -
Changed line 14 from:
Not all resources are directories and flat files. The GFFS reflects this by facilitating the inclusion of non-file system data in much the same manner as Plan 9 [1]; any resource type can be modeled as a file or directory, compute resources, databases, running jobs, and communications channel.
to:
Not all resources are directories and flat files. The GFFS reflects this by facilitating the inclusion of non-file system data; any resource type can be modeled as a file or directory, compute resources, databases, running jobs, and communications channel.
November 14, 2011, at 09:36 AM by 128.143.137.203 -
November 11, 2011, at 04:14 PM by 128.143.137.203 -
Added line 39:
Compute resources such as clusters, parallel machines, and desktop compute resources can be shard in a similar manner. For example to create an OGSA-BES resource that proxies a PBS queue.
November 11, 2011, at 04:13 PM by 128.143.137.203 -
Changed lines 40-50 from:
'''Compute Sharing.''' Compute resources such as clusters, parallel machines, and desktop compute resources can be shard in a similar manner. For example to create an OGSA-BES resource that proxies a PBS queue e.g.,

create-resource /containers/UVA/CS/camillus/ /home/grimshaw/testPBS-resource

creates new OGSA-BES resources on the host camillus in the computer science department at UVA. It places a link to the new resource (i.e., how it can be accessed in the future) in the GFFS directory /home/grimshaw. The new resource can be passed a configuration file that tells it information it needs to know about the local queuing system, in this case that it is a PBS queue, where shared scratch space is to be found, and so forth.'^1^' Access control is now at the user’s discretion.


!!!Footnote


'^1^' Users can also create OGSA-BES resources that exploit cloud resources that are Amazon EC2 compliant, such as the Amazon cloud, the NSF funded FutureGrid clouds, and Penguin clouds.
to:
November 11, 2011, at 04:11 PM by 128.143.137.203 -
Changed line 38 from:
!!!Compute Resources
to:
!!![[Compute Resources | Compute Resources]]
November 11, 2011, at 04:09 PM by 128.143.137.203 -
Changed line 29 from:
As discussed above there are many resource types that can be shared, file system resources, storage resources, relational databases, compute clusters, and running jobs.
to:
There are many resource types that can be shared, file system resources, storage resources, relational databases, compute clusters, and running jobs.
November 11, 2011, at 04:06 PM by 128.143.137.203 -
Changed lines 36-44 from:
Earlier we said that the new files and directories could be stored in the containers’ own databases and file system resources. What exactly does that mean and how can we use it?

The basic idea is this, not all data is stored in exports. We can associate a new directory with any Genesis II container we want. Each Genesis II container, in turn, will store newly created file and directory resources on its local storage resources. For example, there is a container at the GFFS path of /containers/FutureGrid/IU/india. To create a new directory on that container Sarah could

grid mkdir –rns-service=/containers/FutureGrid/IU/india /home/Sarah/india-directory

This will create a new directory on India, and link it into Sarah’s GFFS directory. Subsequent file and directory create operations with the new directory will cause new files and directories to be stored at IU on India. Of course those files and directories can still be accessed via the GFFS just as any other file.

A different mkdir command must be used because the Unix mkdir command has no idea about the Grid, and no concept of creating files and directories except in the underlying file system.
to:
New files and directories could be stored in the containers’ own databases and file system resources.
November 11, 2011, at 04:03 PM by 128.143.137.203 -
November 11, 2011, at 04:02 PM by 128.143.137.203 -
Changed line 35 from:
!!!Storage Resources
to:
!!![[Storage Resources| Storage Resources]]
November 11, 2011, at 03:56 PM by 128.143.137.203 -
Changed lines 52-58 from:
creates new OGSA-BES resources on the host camillus in the computer science department at UVA. It places a link to the new resource (i.e., how it can be accessed in the future) in the GFFS directory /home/grimshaw. The new resource can be passed a configuration file that tells it information it needs to know about the local queuing system, in this case that it is a PBS queue, where shared scratch space is to be found, and so forth.'^3^' Access control is now at the user’s discretion.


!!!Footnotes


'^3^' Users can also create OGSA-BES resources that exploit cloud resources that are Amazon EC2 compliant, such as the Amazon cloud, the NSF funded FutureGrid clouds, and Penguin clouds.
to:
creates new OGSA-BES resources on the host camillus in the computer science department at UVA. It places a link to the new resource (i.e., how it can be accessed in the future) in the GFFS directory /home/grimshaw. The new resource can be passed a configuration file that tells it information it needs to know about the local queuing system, in this case that it is a PBS queue, where shared scratch space is to be found, and so forth.'^1^' Access control is now at the user’s discretion.


!!!Footnote


'^1^' Users can also create OGSA-BES resources that exploit cloud resources that are Amazon EC2 compliant, such as the Amazon cloud, the NSF funded FutureGrid clouds, and Penguin clouds.
November 11, 2011, at 03:55 PM by 128.143.137.203 -
Changed line 19 from:
One of the most common complaints about grid computing, and the national cyberinfrastructure more generally, is that it is not easy to use. We feel strongly that rather than have users adapt to the infrastructure, the infrastructure should adapt to users. In other words, the infrastructure must support interaction modalities and paradigms with which users are already familiar. Towards that end, simplicity and ease-of-use is critical.
to:
November 11, 2011, at 03:54 PM by 128.143.137.203 -
Changed lines 33-37 from:
An export takes the specified rooted directory tree, maps it into the global namespace, and thus provides a means for non-local users to access data in the directory via the GFFS. Local access to the exported directory is un-effected. Existing scripts, cron jobs, and applications can continue to access the data.
grid export /containers/Big-State-U/Sarah-server /development/sources /home/Sarah/dev
Once a GFFS container is running that can “see”'^1^' the directory to be exported, it is quite simple to share data. For example, Sarah could share out using the simple command'^2^'
to:
An export takes the specified rooted directory tree, maps it into the global namespace, and thus provides a means for non-local users to access data in the directory via the GFFS.
November 11, 2011, at 03:54 PM by 128.143.137.203 -
Changed line 37 from:
This exports from the machine “Sarah-server”, the directory tree rooted at “/development/sources”, and links it into the global namespace at the path “/home/Sarah/dev”. Once exported data is accessible (subject to access control) until the export is terminated. The net result is a user can decide to securely share out a particular directory structure with colleagues anywhere with a network connection and this collaborator can subsequently access it with no effort.
to:
November 11, 2011, at 03:53 PM by 128.143.137.203 -
Deleted lines 60-62:
'^1^' The host on which the GFFS container is running must have the file system that contains the data mounted, and must have permission to access the file system.

'^2^' There are also GUI mechanisms to do this as well.
November 11, 2011, at 03:52 PM by 128.143.137.203 -
November 11, 2011, at 03:50 PM by 128.143.137.203 -
Changed line 32 from:
!!!File System Resources a.k.a. exports
to:
!!![[File System Resources| File System Resources – a.k.a. exports]]
November 11, 2011, at 03:43 PM by 128.143.137.203 -
Changed lines 56-58 from:
creates new OGSA-BES resources on the host camillus in the computer science department at UVA. It places a link to the new resource (i.e., how it can be accessed in the future) in the GFFS directory /home/grimshaw. The new resource can be passed a configuration file that tells it information it needs to know about the local queuing system, in this case that it is a PBS queue, where shared scratch space is to be found, and so forth.'^6^' Access control is now at the user’s discretion.
to:
creates new OGSA-BES resources on the host camillus in the computer science department at UVA. It places a link to the new resource (i.e., how it can be accessed in the future) in the GFFS directory /home/grimshaw. The new resource can be passed a configuration file that tells it information it needs to know about the local queuing system, in this case that it is a PBS queue, where shared scratch space is to be found, and so forth.'^3^' Access control is now at the user’s discretion.
Changed line 65 from:
'^6^' Users can also create OGSA-BES resources that exploit cloud resources that are Amazon EC2 compliant, such as the Amazon cloud, the NSF funded FutureGrid clouds, and Penguin clouds.
to:
'^3^' Users can also create OGSA-BES resources that exploit cloud resources that are Amazon EC2 compliant, such as the Amazon cloud, the NSF funded FutureGrid clouds, and Penguin clouds.
November 11, 2011, at 03:42 PM by 128.143.137.203 -
Changed lines 35-36 from:
Once a GFFS container is running that can “see”'^4^' the directory to be exported, it is quite simple to share data. For example, Sarah could share out using the simple command'^5^'
to:
Once a GFFS container is running that can “see”'^1^' the directory to be exported, it is quite simple to share data. For example, Sarah could share out using the simple command'^2^'
Changed lines 61-63 from:
'^4^' The host on which the GFFS container is running must have the file system that contains the data mounted, and must have permission to access the file system.

'^5^' There are also GUI mechanisms to do this as well.
to:
'^1^' The host on which the GFFS container is running must have the file system that contains the data mounted, and must have permission to access the file system.

'^2^' There are also GUI mechanisms to do this as well.
November 11, 2011, at 03:34 PM by 128.143.137.203 -
Changed lines 25-34 from:
By “client-side”, we mean the users of resources in the GFFS (the data clients in Figure 1). For example, a visualization application Sarah might run on her workstation that access files residing at an NSF service provider such as TACC.
Three mechanisms can be used to access data in the GFFS: a command line tool; a graphical user interface; and an operating system specific file system driver. (http://genesisii.cs.virginia.edu/docs/Client-usage-v1.0.pdf). The first step in using any of the GFFS access mechanisms is to install the XSEDE Genesis II client. There are client installers for Windows, Linux, and MacOS (http://genesis2.virginia.edu/wiki/Main/Downloads ). The installers work like most installers. You download the installer, double click on it, and follow the directions. It is designed to be as easy to install as TurboTax®. Within two or three minutes, you will be up and ready to go.

On Linux and MacOS, we provide a GFFS-aware FUSE file system driver to map the global namespace into the local file system namespace. FUSE is a user space file system driver that requires no special permission to run. Thus, one does not have to be “root” to mount a FUSE device.

Once the client has been installed and the user is logged in, mounting the GFFS in Linux requires two simple steps: create a mount-point, and mount the file system as shown below.
mkdir XSEDE
nohup grid fuse –mount local:XSEDE &

Once mounted, the XSEDE directory can be used just like any other mounted file system.
to:
By “client-side”, we mean the users of resources in the GFFS. For example, a visualization application Sarah might run on her workstation that access files residing at an NSF service provider such as TACC.
November 11, 2011, at 03:21 PM by 128.143.137.203 -
Changed line 24 from:
!!!Client Side – Accessing Resources
to:
!!![[Client Side Resources|Client Side – Accessing Resources]]
November 11, 2011, at 03:20 PM by 128.143.137.203 -
Deleted lines 67-71:
'^1^' Access from login nodes is assured. The GFFS is not always accessible from the compute nodes.

'^2^' This capability has been demonstrated, but is not ready for production use.

'^3^' There are as many possibilities as there are different implementations of the RNS and ByteIO specifications. We are using the most typical implementations in the GFFS as of this writing.
November 11, 2011, at 03:18 PM by 128.143.137.203 -
November 11, 2011, at 03:13 PM by 128.143.137.203 -
Deleted lines 80-186:
1. Pike, R., et al. Plan 9 from Bell Labs. in UKUUG Summer 1990 Conference. 1990. London, UK.

2. Campbell, R.H., et al., Designing and implementing Choices: an object-oriented system in C++. Communications of the ACM, 1993. 36(9): p. 117 - 126

3. Campbell, R.H., et al., Principles of Object Oriented Operating System Design. 1989, Department of Computer Science, University of Illinois: Urbana, Illinois.

4. Gropp, W., E. Lusk, and A. Skjellum, Using MPI: Portable Parallel Programming with the Message-Passing Interface. 1994: MIT Press.

5. Geist, A., et al., PVM: Parallel Virtual Machine. 1994: MIT Press.

6. Lustre, The Lustre File System. 2009.

7. IBM. General Parallel File System. 2005 [cited; Available from: http://www-03.ibm.com/systems/clusters/software/gpfs.html.

8. OGF, Open Grid Forum, Open Grid Forum.

9. Grimshaw, A., D. Snelling, and M. Morgan, WS-Naming Specification. 2007, Open Grid Forum, GFD-109.

10. Antonioletti, M., et al., Web Services Data Access and Integration - The Core (WS-DAI) Specification, Version 1.0 2006, Open Grid Forum.

11. Newhouse, S. and A. Grimshaw, Independent Software Vendors (ISV) Remote Computing Usage Primer, in Grid Forum Document, G. Newby, Editor. 2008, Open Grid Forum. p. 141.

12. Jordan, C. and H. Kishimoto, Defining the Grid: A Roadmap for OGSA® Standards v1.1 [Obsoletes GFD.53] 2008, Open Grid Forum.

13. Merrill, D., Secure Addressing Profile 1.0 2008, Open Grid Forum.

14. Merrill, D., Secure Communication Profile 1.0 2008, Open Grid Forum.

15. Snelling, D., D. Merrill, and A. Savva, OGSA® Basic Security Profile 2.0. 2008, Open Grid Forum.

16. Grimshaw, A., et al., An Open Grid Services Architecture Primer. IEEE Computer, 2009. 42(2): p. 27-34.

17. Allcock, W., GridFTP Protocol Specification Open Grid Forum, 2003. GFD.20.

18. Foster, I., T. Maguire, and D. Snelling, OGSA WSRF Basic Profile 1.0, in Open Grid Forum Documents. 2006. p. 23.

19. Morgan, M., A.S. Grimshaw, and O. Tatebe, RNS Specification 1.1. 2010, Open Grid Forum. p. 23.

20. Morgan, M. and O. Tatebe, RNS 1.1 OGSA WSRF Basic Profile Rendering 1.0. 2010, Open Grid Forum. p. 16.

21. OASIS. Organization for the Advancement of Structured Information Standards. [cited; Available from: http://www.oasis-open.org/.

22. OASIS-SOAPSec. Web Services Security: SOAP Message Security. 2003 [cited August 27 2003]; Working Draft 17:[

23. OASIS. WS-Security. 2005 [cited; Available from: http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=wss.

24. Graham, S., et al., Web Services Resource 1.2 2 (WS-Resource). 2005.

25. Snelling, D., I. Robinson, and T. Banks, WSRF - Web Services Resrouce Framework. 2006, OASIS.

26. OASIS, Web Services Security X.509 Certificate Token Profile, in OASIS Standard Specification. 2006, OASIS.

27. OASIS, Web Services Security Username Token Profile 1.1, in OASIS Standard Specification. 2006.

28. OASIS, WS-Trust 1.3, in OASIS Standard Specification. 2007.

29. OASIS, Web Services Security Kerberos Token Profile, in OASIS Standard Specification. 2006.

30. Box, D., et al., Web Services Addressing (WS-Addressing). 2004, W3C.

31. Christensen, E., et al. Web Services Description Language (WSDL) 1.1. 2001 [cited; Available from: http://www.w3.org/TR/wsdl.

32. W3C, XML Encryption Syntax and Processing, in W3C Recommendation. 2002, W3C.

33. WS-I, Basic Security Profile 1.0, in WS-I Final Material. 2007.

34. Morgan, M. and A. Grimshaw. Genesis II - Standards Based Grid Computing. in Seventh IEEE International Symposium on Cluster Computing and the Grid 2007. Rio de Janario, Brazil: IEEE Computer Society.

35. Virginia, U.o. The Genesis II Project. 2010 [cited; Available from: http://genesis2.virginia.edu/wiki/Main/HomePage.

36. Group, G.I., Cross Campus Grid (XCG). 2009.

37. Satyanarayanan, M., Scalable, Secure, and Highly Available Distributed File Access. IEEE Computer, 1990. 23(5): p. 9-21.

38. Levy, E. and A. Silberschatz, Distributed File Systems: Concepts and Examples. ACM Computing Surveys, 1990. 22(4): p. 321-374.

39. White, B., et al. LegionFS: A Secure and Scalable File System Supporting Cross-Domain High-Performance Applications. in SC 01. 2001. Denver, CO.

40. Stockinger, H., et al., File and Object Replication in Data Grids. Journal of Cluster Computing, 2002. 5(3): p. 305-314.

41. Huang, H. and A. Grimshaw, Grid-Based File Access: The Avaki I/O Model Performance Profile. 2004, Department of Computer Science, University of Virginia: Charlottesville, VA.

42. Heizer, I., P.J. Leach, and D.C. Naik. A Common Internet File System (CIFS/1.0) Protocol. 1996 [cited; Available from: http://www.tools.ietf.org/html/draft-heizer-cifs-v1-spec-00.

43. Walker, B.e.a. The LOCUS Distributed Operating System. in 9th ACM Symposium on Operating Systems Principles. 1983. Bretton Woods, N. H.: ACM.

44. Adya, A., et al. FARSITE: Federated, Available, and Reliable Storage for an Incompletely Trusted Environment. 2002 [cited.

45. Morris, J.H.e.a., Andrew: A distributed personal computing environment. Communications of the ACM, 1986. 29(3).

46. Shepler, S., et al. Network File System (NFS) version 4 Protocol. 2003 [cited RFC 3530; Available from: http://www.ietf.org/rfc/rfc3530.txt.

47. Kunszt, P., et al. Data storage, access and catalogs in gLite
Data storage, access and catalogs in gLite. in Local to Global Data Interoperability - Challenges and Technologies, 2005. 2005.

48. White, B.S., A.S. Grimshaw, and A. Nguyen-Tuong. Grid-Based File Access: The Legion I/O Model. in 9th IEEE International Symposium on High Performance Distributed Computing. 2000.

49. Foster, I., et al., Modeling and Managing State in Distributed Systems: The Role of OGSI and WSRF, in Proceedings of the IEEE, 93(3). 2005.

50. Bester, J., et al. GASS: A Data Movement and Access Service for Wide Area Computing Systems. in Sixth Workshop on I/O in Parallel and Distributed Systems. 1999.

51. Chervenak, A., et al., The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Datasets. Journal of Network and Compute Applications, 2001. 23: p. 187-200.

52. Fitzgerald, S., et al. A Directory Service for Configuring High-Performance Distributed Computations. in 6th IEEE Symposium on High-Performance Distributed Computing. 1997: IEEE Computer Society Press.

November 11, 2011, at 03:12 PM by 128.143.137.203 -
November 11, 2011, at 03:09 PM by 128.143.137.203 -
Changed line 80 from:
!!!References
to:
!!![[GFFS References|]]References
November 11, 2011, at 03:07 PM by 128.143.137.203 -
Deleted lines 19-24:

When considering ease-of-use the first and most important observation is that most scientists do not want to become computer hackers. They view the computer as a tool that they use every day for a wide variety of tasks: reading email, saving attachments, opening documents, cruising through the directory/folder structure looking for a file, and so on. Therefore, rather than have scientists learn a whole new paradigm to search for and access data we believe the paradigm with which they are already familiar should be extended across organizational boundaries and to a wider variety of file types.

Therefore, the core, underlying goal of the GFFS is to empower science and engineering by lowering barriers to carry out computationally based research. Specifically we believe that the mechanisms used must be easy to use and learn, must not require change in existing infrastructures on campuses and labs, and must support interactions between the centers and campuses, campuses, and with other international infrastructures. We believe complexity is the major problem that must be addressed.

Ease of use is just one of many quality attributes a system such as the GFFS exhibits. Others are security, performance, availability, reliability, and so on. With respect to performance we are often asked how GFFS performance compares to parallel file systems such as Luster [6] or GPFS [7]? For us this is somewhat of a non sequitur. Competing with Luster and GPFS is not a goal – the GFFS is not designed to be high performance parallel file system. It is designed to make it easy to federate across many different organizations and make data easily accessible to users and applications.
November 11, 2011, at 03:05 PM by 128.143.137.203 -
November 11, 2011, at 03:03 PM by 128.143.137.203 -
Changed lines 43-44 from:
to:
As discussed above there are many resource types that can be shared, file system resources, storage resources, relational databases, compute clusters, and running jobs.
Deleted line 64:
As discussed above there are many resource types that can be shared, file system resources, storage resources, relational databases, compute clusters, and running jobs.
November 11, 2011, at 03:01 PM by 128.143.137.203 -
Added line 64:
As discussed above there are many resource types that can be shared, file system resources, storage resources, relational databases, compute clusters, and running jobs.
November 11, 2011, at 02:59 PM by 128.143.137.203 -
Changed line 17 from:
!!!An Aside on GFFS Goals and Non-Goals
to:
!!![[GFFS Aside|An Aside on GFFS Goals and Non-Goals]]
November 11, 2011, at 02:53 PM by 128.143.137.203 -
Changed lines 42-59 from:
!!!Sharing Resources
As discussed above there are many resource types that can be shared, file system resources, storage resources, relational databases, compute clusters, and running jobs. We will keep our focus here on sharing data, specifically files and directories. For all of the below we will assume the GFFS client is already installed and that we have linked the GFFS into our Unix home directory at $HOME/XSEDE. We will further assume that our user Sarah has a directory in the GFFS at /home/Sarah. Given where the GFFS is mounted, the Unix path to that directory is $HOME/XSEE/home/Sarah. We will also assume below unless otherwise noted that our current working directory is $HOME/XSEE/home/Sarah.

Before we get to sharing local data resources, lets’ look first how to create a file or directory “somewhere in the GFFS”. To create a file or directory in the GFFS is simple. For example,
mkdir test
echo “This is a test” >> test/newfile

creates a new file in the newly created “test” directory. Once created both the directory and the file are available throughout the GFFS subject to access control.

However, where is the data actually stored? The short answer is that the “test” directory will be created in the same place where the current working directory is located. Similarly, “newfile” will be placed in the same location as the “test” directory.

So, where is that? A bit of background here is useful. The GFFS uses a standards-based Web Services model. Most Web Services (including those that implement the GFFS) execute inside of a program called a Web Services container. A container is a program that accepts Web Services connections (in our case https connections), parses the request, and calls the appropriate function to handle the request. Web Services containers are often written in Java, and execute on Windows, Linux, or MacOS machines like any other application. The difference is that they listen for http/https connections and respond to them.

In the GFFS, files and directories are stored in different GFFS Web Service containers (just “containers” from here on.) There are GFFS containers at the NSF service providers, and there are containers wherever someone wants to share a resource. Therefore, the first step to sharing a resource is to install the GFFS container. The installation process is very similar to installing the client if one chooses to use only the default options. It can be more complicated if, for example, your resource is behind a NAT or firewall. The GFFS container requires no special permissions or privilege, though it is recommend that it be started as its own user with minimum privilege.

To get back to our earlier question, “where is the data actually stored?” The new directory and the new file will be placed on the same container as Sarah’s GFFS home directory. In a moment, we will see how to over-ride this. For now, however, we will assume that is where they will be placed.

Given that the new items will be placed on the same GFFS container as Sarah’s home directory, exactly where will they be placed? There are two possibilities'^3^'. Either Sarah’s home directory is being stored a container in the containers own databases and storage space (in other words, the container is acting as a storage service), or her home directory is an export, in other words it is being stored in a local file system somewhere. Below we briefly examine each of these options. Keep in mind that either way a GFFS container is providing access to the data.
to:
!!![[Sharing Resources|Sharing Resources]]
November 11, 2011, at 02:47 PM by 128.143.137.203 -
Deleted lines 15-23:
'''Computer resources:''' A compute resource, such as a PBS controlled cluster, can be modeled as a directory (folder). To start a job, simply drag or copy a JSDL XML file describing the job into the directory. The job will then logically begin executing. (We say logically because on some resources such as queuing systems it is scheduled for execution.) A directory listing of the folder will show sub-folders for each of the jobs “in” the compute resource. (A similar concept was first introduced with Choices [2, 3].) Within each job folder is a text file with the job status, e.g., Running or Finished, and a subfolder that is the current working directory of the job with all of the intermediate job files, input files, output files, stdout, stderr, etc. The user can interact with the files in the working directory while the job executes, both reading them to monitor execution and writing them to steer computation. %rfloat% http://genesis2.virginia.edu/wiki/uploads/Main/gffs2.jpeg
Recall that in our earlier example Sarah’s group had a local compute cluster. Sarah could also export her compute cluster into the GFFS as a shared compute resource and give Bob’s group access to the cluster. Bob’s group could then use Sarah’s resource without needing special accounts, and without having to login to Sarah’s machines. If Bob too had a cluster, he could export that cluster into the GFFS. They could then create a shared Grid Queue that includes both of their clusters that would load balance jobs between the two resources – effectively creating a mini-compute grid.

'''RDBMS:'''
Relational databases can similarly be modeled as a folder or directory containing a set of tables'^2^' . Each sub-table is itself a folder that contains sub tables (created by executing queries against it) and a text file that can be used as a CSV text representation of the table. Queries can be executed by copying or dragging a text file with a SQL query into the folder. The result of the query is itself a new sub folder.

'''Named pipes:''' Often two more applications need to communicate. Traditionally applications can communicate via files in the file system, e.g., application A writes file A_output and application B reads the file, or via message passing [4, 5] or sockets of some kind, e.g., open a TCP connection to a well-known address and send bytes down the channel. In Unix for programs started on the same machine, pipes are often also used. Unfortunately, in wide-area distributed systems, many resources are behind NATs and firewalls and simply opening a socket is not always an easy option.

To address this problem the GFFS supports named pipes. GFFS named pipes are analogous to their Unix counterparts; they are buffered streams of bytes. Named pipes appear in the namespace just as any other file, and have access control like any other file. As with Unix named pipes GFFS named pipes may have many readers and writers, though the same caveats apply. Thus, an application can create a named pipe at a well known location and then read from it, awaiting another application to write to it.
November 11, 2011, at 11:10 AM by 128.143.137.203 -
Changed line 13 from:
!!!Access to non file system resources
to:
!!![[GFFS access|Access to non file system resources]]
November 11, 2011, at 11:09 AM by 128.143.137.203 -
Changed lines 9-10 from:
[[Three examples|Three examples]] illustrate GFFS typical uses cases, accessing data at an NSF center from a home or campus, accessing data on a campus machine from an NSF center, and directly sharing data with a collaborator at another institution.
to:
!!![[Three examples|Three examples of GFFS Typical Use Cases]]
Three cases
illustrate GFFS typical uses cases, accessing data at an NSF center from a home or campus, accessing data on a campus machine from an NSF center, and directly sharing data with a collaborator at another institution.
November 11, 2011, at 11:08 AM by 128.143.137.203 -
Changed line 2 from:
!!! Intro
to:
November 11, 2011, at 11:02 AM by 128.143.137.203 -
Changed lines 11-21 from:
For each of these three examples suppose that Sarah is an [[http://xsede.org | Extreme Science and Engineering Discovery Environment (XSEDE)]] user at Big State U and her students regularly runs jobs on Ranger at TACC. She and her students run many of the same sorts of jobs (though much smaller) on their local cluster, and they do software and script development on their local cluster. The software consists of a workflow (pipeline) comprised of a number of programs that generate intermediate results used in subsequent stages of the pipeline. Further, Sarah and her students frequently need to check on the pipeline as it is executing by examining or visualizing intermediate files. %lfloat% http://genesis2.virginia.edu/wiki/uploads/Main/gffs.jpeg
!!!Accessing data at an NSF center from a home or campus
Using the GFFS Sarah and her students can export their home directories or scratch directories at TACC into the global namespace. They can then mount the GFFS on their Linux workstations and on their cluster nodes. This permits them to directly edit, view, and visualize application parameter files, input files, intermediate files, and final output files directly from their desktop. Further, they can start local applications that can monitor application progress (by checking for files in a directory, or scanning an output file) all in real time against the actual data at TACC. There is no need to explicitly transfer (copy) files back and forth, nor is there any need to keep track of which version of which file has been copied –consistency with the data at TACC is assured.

!!!Accessing data on a campus machine from an NSF center
Similarly, Sarah and her students can directly access files on their clusters and desktops at Big State U. directly from the centers. This means they can keep one set of sources, makefiles, and scripts at Big State U., and compile and execute against them from any of the NSF service providers. For example, suppose that Sarah’s group keeps their sources and scripts in the directory /home/Sarah/sources on her departmental file server. She, could export /home/Sarah/sources into the GFFS and access it in scripts or at the command line from any of the service providers. Any changes made to the files, either at Big State U, or at any of the service providers, will be immediately visible to GFFS users, including her own jobs and scripts running at Big State U or any of the service providers'^1^' .

Next, consider the case when Sarah’s lab has an instrument that generates data files from experiments and places them in a local directory. As is so often the case, suppose the instrument comes with a Windows computer onto which the data is dumped. Sarah could export the directory in which the data is placed by the instrument, e.g., c:\labMaster-1000\outfiles into the GFFS. The data will then be directly accessible not only at her home institution, but also at the service providers, without any need to copy the data explicitly.

!!!Sharing data with a collaborator at another institution
Finally consider the case of a multi-institution collaboration in which Sarah is collaborating with a team led by Bob at Small-State-U. Suppose Bob’s team is developing and maintaining some of the applications used in the workflow. Suppose that Bob’s team also needs to access both Sarah’s instrument data as well as the data her team has generated at TACC. First, Bob can export his source and binary trees into the GFFS and give Sarah and her team access to the directories. Sarah can similarly give Bob and his team access to the necessary directories in the GFFS. Bob can then directly access Sarah’s data both at Big-State-U and at TACC. An interesting aspect is that, Bob accessing Sarah’s data at Big-State-U, and Sarah accessing Bob’s code at Small-State-U does not necessarily involve XSEDE at all though they are using the XSEDE-provided GFFS as a medium.
to:
November 11, 2011, at 11:00 AM by 128.143.137.203 -
Changed line 9 from:
[[3_examples|Three examples]] illustrate GFFS typical uses cases, accessing data at an NSF center from a home or campus, accessing data on a campus machine from an NSF center, and directly sharing data with a collaborator at another institution.
to:
[[Three examples|Three examples]] illustrate GFFS typical uses cases, accessing data at an NSF center from a home or campus, accessing data on a campus machine from an NSF center, and directly sharing data with a collaborator at another institution.
November 11, 2011, at 10:59 AM by 128.143.137.203 -
Changed line 9 from:
[[Three examples|Three examples]] illustrate GFFS typical uses cases, accessing data at an NSF center from a home or campus, accessing data on a campus machine from an NSF center, and directly sharing data with a collaborator at another institution.
to:
[[3_examples|Three examples]] illustrate GFFS typical uses cases, accessing data at an NSF center from a home or campus, accessing data on a campus machine from an NSF center, and directly sharing data with a collaborator at another institution.
November 11, 2011, at 10:54 AM by 128.143.137.203 -
November 11, 2011, at 10:54 AM by 128.143.137.203 -
Changed line 9 from:
[[Three examples illustrate GFFS typical uses cases, accessing data at an NSF center from a home or campus, accessing data on a campus machine from an NSF center, and directly sharing data with a collaborator at another institution.]]
to:
[[Three examples|Three examples]] illustrate GFFS typical uses cases, accessing data at an NSF center from a home or campus, accessing data on a campus machine from an NSF center, and directly sharing data with a collaborator at another institution.
November 11, 2011, at 10:51 AM by 128.143.137.203 -
Changed line 9 from:
Three examples illustrate GFFS typical uses cases, accessing data at an NSF center from a home or campus, accessing data on a campus machine from an NSF center, and directly sharing data with a collaborator at another institution.
to:
[[Three examples illustrate GFFS typical uses cases, accessing data at an NSF center from a home or campus, accessing data on a campus machine from an NSF center, and directly sharing data with a collaborator at another institution.]]
November 09, 2011, at 04:25 PM by 128.143.137.203 -
Changed line 11 from:
For each of these three examples suppose that Sarah is an [[http://xsede.org | eXtreme Science and Engineering Discovery Environment (XSEDE)]] user at Big State U and her students regularly runs jobs on Ranger at TACC. She and her students run many of the same sorts of jobs (though much smaller) on their local cluster, and they do software and script development on their local cluster. The software consists of a workflow (pipeline) comprised of a number of programs that generate intermediate results used in subsequent stages of the pipeline. Further, Sarah and her students frequently need to check on the pipeline as it is executing by examining or visualizing intermediate files. %lfloat% http://genesis2.virginia.edu/wiki/uploads/Main/gffs.jpeg
to:
For each of these three examples suppose that Sarah is an [[http://xsede.org | Extreme Science and Engineering Discovery Environment (XSEDE)]] user at Big State U and her students regularly runs jobs on Ranger at TACC. She and her students run many of the same sorts of jobs (though much smaller) on their local cluster, and they do software and script development on their local cluster. The software consists of a workflow (pipeline) comprised of a number of programs that generate intermediate results used in subsequent stages of the pipeline. Further, Sarah and her students frequently need to check on the pipeline as it is executing by examining or visualizing intermediate files. %lfloat% http://genesis2.virginia.edu/wiki/uploads/Main/gffs.jpeg
November 09, 2011, at 04:24 PM by 128.143.137.203 -
Changed line 11 from:
For each of these three examples suppose that Sarah is an XSEDE user at Big State U and her students regularly runs jobs on Ranger at TACC. She and her students run many of the same sorts of jobs (though much smaller) on their local cluster, and they do software and script development on their local cluster. The software consists of a workflow (pipeline) comprised of a number of programs that generate intermediate results used in subsequent stages of the pipeline. Further, Sarah and her students frequently need to check on the pipeline as it is executing by examining or visualizing intermediate files. %lfloat% http://genesis2.virginia.edu/wiki/uploads/Main/gffs.jpeg
to:
For each of these three examples suppose that Sarah is an [[http://xsede.org | eXtreme Science and Engineering Discovery Environment (XSEDE)]] user at Big State U and her students regularly runs jobs on Ranger at TACC. She and her students run many of the same sorts of jobs (though much smaller) on their local cluster, and they do software and script development on their local cluster. The software consists of a workflow (pipeline) comprised of a number of programs that generate intermediate results used in subsequent stages of the pipeline. Further, Sarah and her students frequently need to check on the pipeline as it is executing by examining or visualizing intermediate files. %lfloat% http://genesis2.virginia.edu/wiki/uploads/Main/gffs.jpeg
November 08, 2011, at 01:35 PM by 128.143.137.203 -
Changed lines 103-105 from:
creates new OGSA-BES resources on the host camillus in the computer science department at UVA. It places a link to the new resource (i.e., how it can be accessed in the future) in the GFFS directory /home/grimshaw. The new resource can be passed a configuration file that tells it information it needs to know about the local queuing system, in this case that it is a PBS queue, where shared scratch space is to be found, and so forth. Access control is now at the user’s discretion.
to:
creates new OGSA-BES resources on the host camillus in the computer science department at UVA. It places a link to the new resource (i.e., how it can be accessed in the future) in the GFFS directory /home/grimshaw. The new resource can be passed a configuration file that tells it information it needs to know about the local queuing system, in this case that it is a PBS queue, where shared scratch space is to be found, and so forth.'^6^' Access control is now at the user’s discretion.
Added lines 116-117:

'^6^' Users can also create OGSA-BES resources that exploit cloud resources that are Amazon EC2 compliant, such as the Amazon cloud, the NSF funded FutureGrid clouds, and Penguin clouds.
November 08, 2011, at 10:29 AM by 128.143.137.203 -
Added line 66:
November 08, 2011, at 10:28 AM by 128.143.137.203 -
Added lines 85-95:
!!!Storage Resources
Earlier we said that the new files and directories could be stored in the containers’ own databases and file system resources. What exactly does that mean and how can we use it?

The basic idea is this, not all data is stored in exports. We can associate a new directory with any Genesis II container we want. Each Genesis II container, in turn, will store newly created file and directory resources on its local storage resources. For example, there is a container at the GFFS path of /containers/FutureGrid/IU/india. To create a new directory on that container Sarah could

grid mkdir –rns-service=/containers/FutureGrid/IU/india /home/Sarah/india-directory

This will create a new directory on India, and link it into Sarah’s GFFS directory. Subsequent file and directory create operations with the new directory will cause new files and directories to be stored at IU on India. Of course those files and directories can still be accessed via the GFFS just as any other file.

A different mkdir command must be used because the Unix mkdir command has no idea about the Grid, and no concept of creating files and directories except in the underlying file system.
Deleted lines 102-107:

!!!Performance
During the XSEDE proposal preparation process doubts were raised as to whether a Web Services based solution complete with XML processing, message signing, and SSL could meet the performance requirements of the XSEDE user community. Two different performance scenarios were discussed, the bandwidth and latency of a single client, and the aggregate throughput of a large number of clients.

We have developed and adopted a number of IO benchmarks for the GFFS.
November 08, 2011, at 10:17 AM by 128.143.137.203 -
Added line 104:
Added line 106:
Added line 108:
November 08, 2011, at 10:17 AM by 128.143.137.203 -
Changed lines 98-109 from:
!!!Related Work
Sharing data within a department or research group is easy – data is typically stored on network accessible file system such as NSF or CIFS, and sharing is as easy as setting permissions and using the correct file system path to the data. It is much more complicated when collaborators are at different institutions with different identity domains (e.g., Unix UID spaces) with no common shared file system. Current best practice is to copy files around with scp or some other copy based tool, place the data on a web site and either manually, or via scripts copy it around. This presents significant challenges, e.g., trust, account management, consistency, etc.

Wide area file systems have been developed and deployed over the years [6, 37-46] to address these problems. However, they have not had much adoption, particularly in multi-organizational settings. We believe that they have not become ubiquitous because: 1) they are not federating file systems, i.e., they require data to be stored a particular way and cannot simply layer on top of local file systems; 2) they often require kernel modifications that require root permission and possibly synchronizing kernel versions across sites; 3) they often require changes to an organization’s authentication infrastructure, an extremely difficult task, particularly if undertaken to satisfy only one user or group’s need to share; and 4) they require a degree of trust among partners that is infeasible in practice due to the requirement that the kernels of different machines in the system trust one other.

The Grid community has developed a plethora of different low-level mechanisms and standards for data management in Grids, including Gridftp [17], EU DG [47], OGSA-ByteIO [48], DIAS [49] and others, none of which provides the seamless transparent transition for end users from their local environment to a Grid environment. This frustrates potential users and uses of Grid systems.

Globus has GASS [50], RLS [51], and GridFTP [17], and has provided RIO [52] in the past. Our approach differs significantly, in that we are masking from the application that they are using the Grid at all – instead in our approach users view the Grid as an extension of the file system. Similarly, we differ in that the coherence of the data is automatically maintained (in an object-specific fashion). This is significantly different from the Globus approach where the user must explicitly manage checking file status, and copying files back and forth with GridFTP.

dCache (www.dcache.org) is a distributed data management infrastructure gaining increasing use in the high-performance community. The goal of this project is to provide a system for storing and retrieving huge amounts of data, distributed among a large number of heterogeneous server nodes, under a single virtual filesystem tree with a variety of standard access methods. dCache shares many of the same goals as the proposed work. The primary difference is the set of standards used, dCache uses SRM (Storage Resource Manager) rather than RNS and ByteIO, the result is that dCache is focused on data only, the name space provided cannot include resources other than data and directories, whereas in GFFS pathnames can refer to any type of resource.

Lustre/WAN [6], GPFS/Wan[7], are wide area implementations of their local cluster production versions. The single site versions of Lustre and GPFS, as well as other cluster file systems such as gluster (www.gluster.org ), are in widespread use. The performance of both in the wide-area is very good. The problem with both has to do with the need to choose one – and store data in one. This requires everybody to buy into the same file system implementation, and all of the kernel requirements that follow from the decision. It also requires a degree of trust between kernels that, while possible between the centers, is simply not realistic between campuses, the centers, and individual researcher machines.
to:
November 08, 2011, at 10:09 AM by 128.143.137.203 -
Changed lines 87-88 from:
Compute Sharing. Compute resources such as clusters, parallel machines, and desktop compute resources can be shard in a similar manner. For example to create an OGSA-BES resource that proxies a PBS queue e.g.,
create-resource /containers/UVA/CS/camillus/ /home/grimshaw/testPBS-resource
to:
'''Compute Sharing.''' Compute resources such as clusters, parallel machines, and desktop compute resources can be shard in a similar manner. For example to create an OGSA-BES resource that proxies a PBS queue e.g.,

create-resource /containers/UVA/CS/camillus/ /home/grimshaw/testPBS-resource
Changed lines 92-99 from:
Security
Signing on in XSEDE is accomplished using your existing XSEDE portal username and password
As a standards based grid implementation, the GFFS pulls together a number of well adopted and supported security specifications to ensure that GFFS users enjoy a secure and protected grid environment. For starters, all communication in the GFFS happens over SSL encrypted communication channels. In addition, messages are signed using WS-Security SOAP headers and contain signed assertions detailing what permissions the caller has with respect to the message in question. Finally, upon receipt of such a message, GFFS compares the signed credential assertions contained in the message (or alternatively, the X.509 Certificate used on the SSL connection) against that resource’s access control list (ACL) in order to authorize the operation.
Because of the file system nature of the GFFS, access control lists for resources are typically broken up into familiar (at least in terms of Mac OS X, UNIX, and Microsoft Windows) read, write, and execute categorizations. Correspondingly, operations on grid resources are also classified into these categories (e.g., the ability to create a job on a compute resource is considered an execute permission while the ability to manage the compute resource is considered a write privilege). This makes it very easy for users to maintain their seamless view of the grid as a file system. Naturally, in support of this file system interface the GFFS supports a chmod tool (both through the included grid client and through the FUSE file system driver) that allows for easy manipulation of a grid resource’s access control list.
The GFFS supports the notion of grid users and grid groups - both of which can be used in access control lists for any resource. The GFFS represents users and groups with a grid resource that implements the well-known web standard Secure Token Service (STS) interface. A user simply has to log into to these resources using a login tool and prove their identity (usually username/password, but can be certificate based). If the login is successful, the user receives a set of signed SAML assertions that state he/she has the permissions of that user or group. Other tools then automatically use these assertions during subsequent operations and target resources check their validity and compare them against access control lists to do authorization.
Groups are a cornerstone for easy permission maintenance and for forming virtual organizations. Users with proper permission can create a new group and easily maintain its membership. Group credentials can then be used to provide access to grid resources to members of the virtual organization as appropriate. Since The GFFS embodies a wide range of relevant system components as grid resources, permissions to execution resources (BES and grid queues), data (files, exports/file system proxies, directories), and users/groups can all be managed with one simple mechanism.
In the interest of making user login and credential management as simple as possible, The GFFS supports a user being able to acquire both his/her user credentials as well as his/her group credentials in a single login command. A user simply links, “ln”, his desired auto-login groups to his user resource and the login command will attempt to log into each group listed using his/her user credentials. The group already is setup to allow only the proper users to be able to login and acquire the group’s credentials.

Performance
to:

!!!
Performance
Added line 95:
Changed lines 97-98 from:
Related Work
to:

!!!
Related Work
Added line 100:
Changed lines 102-104 from:
The Grid community has developed a plethora of different low-level mechanisms and standards for data management in Grids, including Gridftp [17], EU DG [47], OGSA-ByteIO [48], DIAS [49] and others, none of which provides the seamless transparent transition for end users from their local environment to a Grid environment. This frustrates potential users and uses of Grid systems.
to:

The Grid community has developed a plethora of different low-level mechanisms and standards for data management in Grids, including Gridftp [17], EU DG [47], OGSA-ByteIO [48], DIAS [49] and others, none of which provides the seamless transparent transition for end users from their local environment to a Grid environment. This frustrates potential users and uses of Grid systems.
Added line 106:
Added line 108:
November 08, 2011, at 10:06 AM by 128.143.137.203 -
Changed lines 77-78 from:
File System Resources – a.k.a. exports
to:

!!!
File System Resources – a.k.a. exports
Changed lines 81-82 from:
Once a GFFS container is running that can “see” the directory to be exported, it is quite simple to share data. For example, Sarah could share out using the simple command
to:
Once a GFFS container is running that can “see”'^4^' the directory to be exported, it is quite simple to share data. For example, Sarah could share out using the simple command'^5^'
Changed lines 84-87 from:
Storage Resources
Earlier we said that the new files and directories could be stored in the containers own databases and file system resources. What exactly does that mean and how can we use it?

Compute Resources
to:

!!!Compute
Resources
Added lines 113-114:
'^4^' The host on which the GFFS container is running must have the file system that contains the data mounted, and must have permission to access the file system.
'^5^' There are also GUI mechanisms to do this as well.
November 08, 2011, at 10:02 AM by 128.143.137.203 -
Changed line 76 from:
Given that the new items will be placed on the same GFFS container as Sarah’s home directory, exactly where will they be placed? There are two possibilities . Either Sarah’s home directory is being stored a container in the containers own databases and storage space (in other words, the container is acting as a storage service), or her home directory is an export, in other words it is being stored in a local file system somewhere. Below we briefly examine each of these options. Keep in mind that either way a GFFS container is providing access to the data.
to:
Given that the new items will be placed on the same GFFS container as Sarah’s home directory, exactly where will they be placed? There are two possibilities'^3^'. Either Sarah’s home directory is being stored a container in the containers own databases and storage space (in other words, the container is acting as a storage service), or her home directory is an export, in other words it is being stored in a local file system somewhere. Below we briefly examine each of these options. Keep in mind that either way a GFFS container is providing access to the data.
Added line 112:
'^3^' There are as many possibilities as there are different implementations of the RNS and ByteIO specifications. We are using the most typical implementations in the GFFS as of this writing.
November 08, 2011, at 09:58 AM by 128.143.137.203 -
Deleted line 25:
November 08, 2011, at 09:57 AM by 128.143.137.203 -
Added line 111:
November 08, 2011, at 09:56 AM by 128.143.137.203 -
Changed lines 30-31 from:
Relational databases can similarly be modeled as a folder or directory containing a set of tables . Each sub-table is itself a folder that contains sub tables (created by executing queries against it) and a text file that can be used as a CSV text representation of the table. Queries can be executed by copying or dragging a text file with a SQL query into the folder. The result of the query is itself a new sub folder.
to:
Relational databases can similarly be modeled as a folder or directory containing a set of tables'^2^' . Each sub-table is itself a folder that contains sub tables (created by executing queries against it) and a text file that can be used as a CSV text representation of the table. Queries can be executed by copying or dragging a text file with a SQL query into the folder. The result of the query is itself a new sub folder.
Added line 68:
Added line 70:
Added line 72:
Added line 74:
Added line 76:
Added line 111:
'^2^' This capability has been demonstrated, but is not ready for production use.
November 08, 2011, at 09:52 AM by 128.143.137.203 -
Deleted lines 26-27:
November 08, 2011, at 09:51 AM by 128.143.137.203 -
Changed line 25 from:
'''Computer resources:''' A compute resource, such as a PBS controlled cluster, can be modeled as a directory (folder). To start a job, simply drag or copy a JSDL XML file describing the job into the directory. The job will then logically begin executing. (We say logically because on some resources such as queuing systems it is scheduled for execution.) A directory listing of the folder will show sub-folders for each of the jobs “in” the compute resource. (A similar concept was first introduced with Choices [2, 3].) Within each job folder is a text file with the job status, e.g., Running or Finished, and a subfolder that is the current working directory of the job with all of the intermediate job files, input files, output files, stdout, stderr, etc. The user can interact with the files in the working directory while the job executes, both reading them to monitor execution and writing them to steer computation. %lfloat% http://genesis2.virginia.edu/wiki/uploads/Main/gffs2.jpeg
to:
'''Computer resources:''' A compute resource, such as a PBS controlled cluster, can be modeled as a directory (folder). To start a job, simply drag or copy a JSDL XML file describing the job into the directory. The job will then logically begin executing. (We say logically because on some resources such as queuing systems it is scheduled for execution.) A directory listing of the folder will show sub-folders for each of the jobs “in” the compute resource. (A similar concept was first introduced with Choices [2, 3].) Within each job folder is a text file with the job status, e.g., Running or Finished, and a subfolder that is the current working directory of the job with all of the intermediate job files, input files, output files, stdout, stderr, etc. The user can interact with the files in the working directory while the job executes, both reading them to monitor execution and writing them to steer computation. %rfloat% http://genesis2.virginia.edu/wiki/uploads/Main/gffs2.jpeg
November 08, 2011, at 09:49 AM by 128.143.137.203 -
Changed lines 30-31 from:
RDBMS: Relational databases can similarly be modeled as a folder or directory containing a set of tables . Each sub-table is itself a folder that contains sub tables (created by executing queries against it) and a text file that can be used as a CSV text representation of the table. Queries can be executed by copying or dragging a text file with a SQL query into the folder. The result of the query is itself a new sub folder.
Named pipes: Often two more applications need to communicate. Traditionally applications can communicate via files in the file system, e.g., application A writes file A_output and application B reads the file, or via message passing [4, 5] or sockets of some kind, e.g., open a TCP connection to a well-known address and send bytes down the channel. In Unix for programs started on the same machine, pipes are often also used. Unfortunately, in wide-area distributed systems, many resources are behind NATs and firewalls and simply opening a socket is not always an easy option.
to:

'''
RDBMS:'''
Relational
databases can similarly be modeled as a folder or directory containing a set of tables . Each sub-table is itself a folder that contains sub tables (created by executing queries against it) and a text file that can be used as a CSV text representation of the table. Queries can be executed by copying or dragging a text file with a SQL query into the folder. The result of the query is itself a new sub folder.

'''
Named pipes:''' Often two more applications need to communicate. Traditionally applications can communicate via files in the file system, e.g., application A writes file A_output and application B reads the file, or via message passing [4, 5] or sockets of some kind, e.g., open a TCP connection to a well-known address and send bytes down the channel. In Unix for programs started on the same machine, pipes are often also used. Unfortunately, in wide-area distributed systems, many resources are behind NATs and firewalls and simply opening a socket is not always an easy option.
Changed lines 37-38 from:
An Aside on GFFS Goals and Non-Goals
to:

!!!
An Aside on GFFS Goals and Non-Goals
Added line 41:
Added line 43:
Added line 45:
Changed lines 47-48 from:
GFFS Implementation
to:

!!!
GFFS Implementation
Changed lines 50-51 from:
Remainder of paper
Client Side – Accessing Resources
to:

!!!
Client Side – Accessing Resources
Added line 54:
Added line 56:
Changed lines 58-60 from:
mkdir XSEDE
nohup
grid fuse –mount local:XSEDE &
to:
mkdir XSEDE
nohup grid fuse –mount local:XSEDE &
Changed lines 62-63 from:
Sharing Resources
to:

!!!
Sharing Resources
Added line 65:
Changed lines 67-68 from:
mkdir test
echo
“This is a test” >> test/newfile
to:
mkdir test
echo “This is a test” >> test/newfile
November 08, 2011, at 09:43 AM by 128.143.137.203 -
Changed lines 19-20 from:
Sharing data with a collaborator at another institution
to:

!!!
Sharing data with a collaborator at another institution
Changed line 22 from:
Access to non file system resources
to:
!!!Access to non file system resources
Changed lines 24-25 from:
Computer resources: A compute resource, such as a PBS controlled cluster, can be modeled as a directory (folder). To start a job, simply drag or copy a JSDL XML file describing the job into the directory. The job will then logically begin executing. (We say logically because on some resources such as queuing systems it is scheduled for execution.) A directory listing of the folder will show sub-folders for each of the jobs “in” the compute resource. (A similar concept was first introduced with Choices [2, 3].) Within each job folder is a text file with the job status, e.g., Running or Finished, and a subfolder that is the current working directory of the job with all of the intermediate job files, input files, output files, stdout, stderr, etc. The user can interact with the files in the working directory while the job executes, both reading them to monitor execution and writing them to steer computation.
to:

'''
Computer resources:''' A compute resource, such as a PBS controlled cluster, can be modeled as a directory (folder). To start a job, simply drag or copy a JSDL XML file describing the job into the directory. The job will then logically begin executing. (We say logically because on some resources such as queuing systems it is scheduled for execution.) A directory listing of the folder will show sub-folders for each of the jobs “in” the compute resource. (A similar concept was first introduced with Choices [2, 3].) Within each job folder is a text file with the job status, e.g., Running or Finished, and a subfolder that is the current working directory of the job with all of the intermediate job files, input files, output files, stdout, stderr, etc. The user can interact with the files in the working directory while the job executes, both reading them to monitor execution and writing them to steer computation. %lfloat% http://genesis2.virginia.edu/wiki/uploads/Main/gffs2.jpeg
Changed line 27 from:
Figure 2 Directory listing of a running job showing the working directory of the job.
to:
November 08, 2011, at 09:31 AM by 128.143.137.203 -
Changed line 92 from:
'^1^' Access from login nodes is assured. The GFFS is not alwasy accesible from the compute nodes.
to:
'^1^' Access from login nodes is assured. The GFFS is not always accessible from the compute nodes.
November 08, 2011, at 09:31 AM by 128.143.137.203 -
Changed line 92 from:
'^Superscript^'1 Access from login nodes is assured. The GFFS is not alwasy accesible from the compute nodes.
to:
'^1^' Access from login nodes is assured. The GFFS is not alwasy accesible from the compute nodes.
November 08, 2011, at 09:30 AM by 128.143.137.203 -
Changed lines 91-93 from:
Summary

References
to:
!!!Footnotes
'^Superscript^'1 Access from login nodes is assured. The GFFS is not alwasy accesible from the compute nodes.

!!!
References
November 08, 2011, at 09:28 AM by 128.143.137.203 -
Changed lines 95-97 from:
2. Campbell, R.H., et al., Designing and implementing Choices: an object-oriented system in C++. Communications of the ACM, 1993. 36(9): p. 117 - 126
to:

2. Campbell, R.H., et al., Designing and implementing Choices: an object-oriented system in C++. Communications of the ACM, 1993. 36(9): p. 117 - 126
Added line 99:
Added line 101:
Added line 103:
Added line 105:
Added line 107:
Added line 109:
Added line 111:
Added line 113:
Added line 115:
Added line 117:
Added line 119:
Added line 121:
Added line 123:
Added line 125:
Added line 127:
Added line 129:
Added line 131:
Added line 133:
Added line 135:
Added line 137:
Added line 139:
Added line 141:
Added line 143:
Added line 145:
Added line 147:
Added line 149:
Added line 151:
Added line 153:
Added line 155:
Added line 157:
Added line 159:
Added line 161:
Added line 163:
Added line 165:
Added line 167:
Added line 169:
Added line 171:
Added line 173:
Added line 175:
Added line 177:
Added line 179:
Added line 181:
Added line 183:
Added line 185:
Added line 188:
Added line 190:
Added line 192:
Added line 194:
Added line 196:
November 08, 2011, at 09:24 AM by 128.143.137.203 -
Changed lines 15-16 from:
Accessing data on a campus machine from an NSF center
Similarly, Sarah and her students can directly access files on their clusters and desktops at Big State U. directly from the centers. This means they can keep one set of sources, makefiles, and scripts at Big State U., and compile and execute against them from any of the NSF service providers. For example, suppose that Sarah’s group keeps their sources and scripts in the directory /home/Sarah/sources on her departmental file server. She, could export /home/Sarah/sources into the GFFS and access it in scripts or at the command line from any of the service providers. Any changes made to the files, either at Big State U, or at any of the service providers, will be immediately visible to GFFS users, including her own jobs and scripts running at Big State U or any of the service providers .
to:
!!!Accessing data on a campus machine from an NSF center
Similarly, Sarah and her students can directly access files on their clusters and desktops at Big State U. directly from the centers. This means they can keep one set of sources, makefiles, and scripts at Big State U., and compile and execute against them from any of the NSF service providers. For example, suppose that Sarah’s group keeps their sources and scripts in the directory /home/Sarah/sources on her departmental file server. She, could export /home/Sarah/sources into the GFFS and access it in scripts or at the command line from any of the service providers. Any changes made to the files, either at Big State U, or at any of the service providers, will be immediately visible to GFFS users, including her own jobs and scripts running at Big State U or any of the service providers'^1^' .
November 08, 2011, at 09:23 AM by 128.143.137.203 -
Added lines 14-146:

Accessing data on a campus machine from an NSF center
Similarly, Sarah and her students can directly access files on their clusters and desktops at Big State U. directly from the centers. This means they can keep one set of sources, makefiles, and scripts at Big State U., and compile and execute against them from any of the NSF service providers. For example, suppose that Sarah’s group keeps their sources and scripts in the directory /home/Sarah/sources on her departmental file server. She, could export /home/Sarah/sources into the GFFS and access it in scripts or at the command line from any of the service providers. Any changes made to the files, either at Big State U, or at any of the service providers, will be immediately visible to GFFS users, including her own jobs and scripts running at Big State U or any of the service providers .
Next, consider the case when Sarah’s lab has an instrument that generates data files from experiments and places them in a local directory. As is so often the case, suppose the instrument comes with a Windows computer onto which the data is dumped. Sarah could export the directory in which the data is placed by the instrument, e.g., c:\labMaster-1000\outfiles into the GFFS. The data will then be directly accessible not only at her home institution, but also at the service providers, without any need to copy the data explicitly.
Sharing data with a collaborator at another institution
Finally consider the case of a multi-institution collaboration in which Sarah is collaborating with a team led by Bob at Small-State-U. Suppose Bob’s team is developing and maintaining some of the applications used in the workflow. Suppose that Bob’s team also needs to access both Sarah’s instrument data as well as the data her team has generated at TACC. First, Bob can export his source and binary trees into the GFFS and give Sarah and her team access to the directories. Sarah can similarly give Bob and his team access to the necessary directories in the GFFS. Bob can then directly access Sarah’s data both at Big-State-U and at TACC. An interesting aspect is that, Bob accessing Sarah’s data at Big-State-U, and Sarah accessing Bob’s code at Small-State-U does not necessarily involve XSEDE at all though they are using the XSEDE-provided GFFS as a medium.
Access to non file system resources
Not all resources are directories and flat files. The GFFS reflects this by facilitating the inclusion of non-file system data in much the same manner as Plan 9 [1]; any resource type can be modeled as a file or directory, compute resources, databases, running jobs, and communications channel.
Computer resources: A compute resource, such as a PBS controlled cluster, can be modeled as a directory (folder). To start a job, simply drag or copy a JSDL XML file describing the job into the directory. The job will then logically begin executing. (We say logically because on some resources such as queuing systems it is scheduled for execution.) A directory listing of the folder will show sub-folders for each of the jobs “in” the compute resource. (A similar concept was first introduced with Choices [2, 3].) Within each job folder is a text file with the job status, e.g., Running or Finished, and a subfolder that is the current working directory of the job with all of the intermediate job files, input files, output files, stdout, stderr, etc. The user can interact with the files in the working directory while the job executes, both reading them to monitor execution and writing them to steer computation.

Figure 2 Directory listing of a running job showing the working directory of the job.

Recall that in our earlier example Sarah’s group had a local compute cluster. Sarah could also export her compute cluster into the GFFS as a shared compute resource and give Bob’s group access to the cluster. Bob’s group could then use Sarah’s resource without needing special accounts, and without having to login to Sarah’s machines. If Bob too had a cluster, he could export that cluster into the GFFS. They could then create a shared Grid Queue that includes both of their clusters that would load balance jobs between the two resources – effectively creating a mini-compute grid.
RDBMS: Relational databases can similarly be modeled as a folder or directory containing a set of tables . Each sub-table is itself a folder that contains sub tables (created by executing queries against it) and a text file that can be used as a CSV text representation of the table. Queries can be executed by copying or dragging a text file with a SQL query into the folder. The result of the query is itself a new sub folder.
Named pipes: Often two more applications need to communicate. Traditionally applications can communicate via files in the file system, e.g., application A writes file A_output and application B reads the file, or via message passing [4, 5] or sockets of some kind, e.g., open a TCP connection to a well-known address and send bytes down the channel. In Unix for programs started on the same machine, pipes are often also used. Unfortunately, in wide-area distributed systems, many resources are behind NATs and firewalls and simply opening a socket is not always an easy option.
To address this problem the GFFS supports named pipes. GFFS named pipes are analogous to their Unix counterparts; they are buffered streams of bytes. Named pipes appear in the namespace just as any other file, and have access control like any other file. As with Unix named pipes GFFS named pipes may have many readers and writers, though the same caveats apply. Thus, an application can create a named pipe at a well known location and then read from it, awaiting another application to write to it.
An Aside on GFFS Goals and Non-Goals
The complexity of sharing resources between researchers creates a barrier to resource sharing and collaboration – an activation energy if you like. Too often, the energy barrier is too high – and valuable collaborations, that could lead to breakthrough science, do not happen, or if they do, take much longer and cost more.
One of the most common complaints about grid computing, and the national cyberinfrastructure more generally, is that it is not easy to use. We feel strongly that rather than have users adapt to the infrastructure, the infrastructure should adapt to users. In other words, the infrastructure must support interaction modalities and paradigms with which users are already familiar. Towards that end, simplicity and ease-of-use is critical.
When considering ease-of-use the first and most important observation is that most scientists do not want to become computer hackers. They view the computer as a tool that they use every day for a wide variety of tasks: reading email, saving attachments, opening documents, cruising through the directory/folder structure looking for a file, and so on. Therefore, rather than have scientists learn a whole new paradigm to search for and access data we believe the paradigm with which they are already familiar should be extended across organizational boundaries and to a wider variety of file types.
Therefore, the core, underlying goal of the GFFS is to empower science and engineering by lowering barriers to carry out computationally based research. Specifically we believe that the mechanisms used must be easy to use and learn, must not require change in existing infrastructures on campuses and labs, and must support interactions between the centers and campuses, campuses, and with other international infrastructures. We believe complexity is the major problem that must be addressed.
Ease of use is just one of many quality attributes a system such as the GFFS exhibits. Others are security, performance, availability, reliability, and so on. With respect to performance we are often asked how GFFS performance compares to parallel file systems such as Luster [6] or GPFS [7]? For us this is somewhat of a non sequitur. Competing with Luster and GPFS is not a goal – the GFFS is not designed to be high performance parallel file system. It is designed to make it easy to federate across many different organizations and make data easily accessible to users and applications.
GFFS Implementation
The GFFS uses as its foundation standard protocols from the Open Grid Forum [8-20], OASIS [21-29], the W3C [30-32], and others [33]. As an open, standards-based system, any implementation can be used. The first realization of the GFFS at XSEDE is using the Genesis II implementation from the University of Virginia [34, 35]. Genesis II has been in continuous operation at the University of Virginia since 2007 in the Cross Campus Grid (XCG) [36]. In mid-2010, the XCG was extended to include FutureGrid resources at Indiana University, SDSC, and TACC.
Remainder of paper
Client Side – Accessing Resources
By “client-side”, we mean the users of resources in the GFFS (the data clients in Figure 1). For example, a visualization application Sarah might run on her workstation that access files residing at an NSF service provider such as TACC.
Three mechanisms can be used to access data in the GFFS: a command line tool; a graphical user interface; and an operating system specific file system driver. (http://genesisii.cs.virginia.edu/docs/Client-usage-v1.0.pdf). The first step in using any of the GFFS access mechanisms is to install the XSEDE Genesis II client. There are client installers for Windows, Linux, and MacOS (http://genesis2.virginia.edu/wiki/Main/Downloads ). The installers work like most installers. You download the installer, double click on it, and follow the directions. It is designed to be as easy to install as TurboTax®. Within two or three minutes, you will be up and ready to go.
On Linux and MacOS, we provide a GFFS-aware FUSE file system driver to map the global namespace into the local file system namespace. FUSE is a user space file system driver that requires no special permission to run. Thus, one does not have to be “root” to mount a FUSE device.
Once the client has been installed and the user is logged in, mounting the GFFS in Linux requires two simple steps: create a mount-point, and mount the file system as shown below.
mkdir XSEDE
nohup grid fuse –mount local:XSEDE &

Once mounted, the XSEDE directory can be used just like any other mounted file system.
Sharing Resources
As discussed above there are many resource types that can be shared, file system resources, storage resources, relational databases, compute clusters, and running jobs. We will keep our focus here on sharing data, specifically files and directories. For all of the below we will assume the GFFS client is already installed and that we have linked the GFFS into our Unix home directory at $HOME/XSEDE. We will further assume that our user Sarah has a directory in the GFFS at /home/Sarah. Given where the GFFS is mounted, the Unix path to that directory is $HOME/XSEE/home/Sarah. We will also assume below unless otherwise noted that our current working directory is $HOME/XSEE/home/Sarah.
Before we get to sharing local data resources, lets’ look first how to create a file or directory “somewhere in the GFFS”. To create a file or directory in the GFFS is simple. For example,
mkdir test
echo “This is a test” >> test/newfile
creates a new file in the newly created “test” directory. Once created both the directory and the file are available throughout the GFFS subject to access control.
However, where is the data actually stored? The short answer is that the “test” directory will be created in the same place where the current working directory is located. Similarly, “newfile” will be placed in the same location as the “test” directory.
So, where is that? A bit of background here is useful. The GFFS uses a standards-based Web Services model. Most Web Services (including those that implement the GFFS) execute inside of a program called a Web Services container. A container is a program that accepts Web Services connections (in our case https connections), parses the request, and calls the appropriate function to handle the request. Web Services containers are often written in Java, and execute on Windows, Linux, or MacOS machines like any other application. The difference is that they listen for http/https connections and respond to them.
In the GFFS, files and directories are stored in different GFFS Web Service containers (just “containers” from here on.) There are GFFS containers at the NSF service providers, and there are containers wherever someone wants to share a resource. Therefore, the first step to sharing a resource is to install the GFFS container. The installation process is very similar to installing the client if one chooses to use only the default options. It can be more complicated if, for example, your resource is behind a NAT or firewall. The GFFS container requires no special permissions or privilege, though it is recommend that it be started as its own user with minimum privilege.
To get back to our earlier question, “where is the data actually stored?” The new directory and the new file will be placed on the same container as Sarah’s GFFS home directory. In a moment, we will see how to over-ride this. For now, however, we will assume that is where they will be placed.
Given that the new items will be placed on the same GFFS container as Sarah’s home directory, exactly where will they be placed? There are two possibilities . Either Sarah’s home directory is being stored a container in the containers own databases and storage space (in other words, the container is acting as a storage service), or her home directory is an export, in other words it is being stored in a local file system somewhere. Below we briefly examine each of these options. Keep in mind that either way a GFFS container is providing access to the data.
File System Resources – a.k.a. exports
An export takes the specified rooted directory tree, maps it into the global namespace, and thus provides a means for non-local users to access data in the directory via the GFFS. Local access to the exported directory is un-effected. Existing scripts, cron jobs, and applications can continue to access the data.
grid export /containers/Big-State-U/Sarah-server /development/sources /home/Sarah/dev
Once a GFFS container is running that can “see” the directory to be exported, it is quite simple to share data. For example, Sarah could share out using the simple command

This exports from the machine “Sarah-server”, the directory tree rooted at “/development/sources”, and links it into the global namespace at the path “/home/Sarah/dev”. Once exported data is accessible (subject to access control) until the export is terminated. The net result is a user can decide to securely share out a particular directory structure with colleagues anywhere with a network connection and this collaborator can subsequently access it with no effort.
Storage Resources
Earlier we said that the new files and directories could be stored in the containers own databases and file system resources. What exactly does that mean and how can we use it?
Compute Resources

Compute Sharing. Compute resources such as clusters, parallel machines, and desktop compute resources can be shard in a similar manner. For example to create an OGSA-BES resource that proxies a PBS queue e.g.,
create-resource /containers/UVA/CS/camillus/ /home/grimshaw/testPBS-resource
creates new OGSA-BES resources on the host camillus in the computer science department at UVA. It places a link to the new resource (i.e., how it can be accessed in the future) in the GFFS directory /home/grimshaw. The new resource can be passed a configuration file that tells it information it needs to know about the local queuing system, in this case that it is a PBS queue, where shared scratch space is to be found, and so forth. Access control is now at the user’s discretion.
Security
Signing on in XSEDE is accomplished using your existing XSEDE portal username and password
As a standards based grid implementation, the GFFS pulls together a number of well adopted and supported security specifications to ensure that GFFS users enjoy a secure and protected grid environment. For starters, all communication in the GFFS happens over SSL encrypted communication channels. In addition, messages are signed using WS-Security SOAP headers and contain signed assertions detailing what permissions the caller has with respect to the message in question. Finally, upon receipt of such a message, GFFS compares the signed credential assertions contained in the message (or alternatively, the X.509 Certificate used on the SSL connection) against that resource’s access control list (ACL) in order to authorize the operation.
Because of the file system nature of the GFFS, access control lists for resources are typically broken up into familiar (at least in terms of Mac OS X, UNIX, and Microsoft Windows) read, write, and execute categorizations. Correspondingly, operations on grid resources are also classified into these categories (e.g., the ability to create a job on a compute resource is considered an execute permission while the ability to manage the compute resource is considered a write privilege). This makes it very easy for users to maintain their seamless view of the grid as a file system. Naturally, in support of this file system interface the GFFS supports a chmod tool (both through the included grid client and through the FUSE file system driver) that allows for easy manipulation of a grid resource’s access control list.
The GFFS supports the notion of grid users and grid groups - both of which can be used in access control lists for any resource. The GFFS represents users and groups with a grid resource that implements the well-known web standard Secure Token Service (STS) interface. A user simply has to log into to these resources using a login tool and prove their identity (usually username/password, but can be certificate based). If the login is successful, the user receives a set of signed SAML assertions that state he/she has the permissions of that user or group. Other tools then automatically use these assertions during subsequent operations and target resources check their validity and compare them against access control lists to do authorization.
Groups are a cornerstone for easy permission maintenance and for forming virtual organizations. Users with proper permission can create a new group and easily maintain its membership. Group credentials can then be used to provide access to grid resources to members of the virtual organization as appropriate. Since The GFFS embodies a wide range of relevant system components as grid resources, permissions to execution resources (BES and grid queues), data (files, exports/file system proxies, directories), and users/groups can all be managed with one simple mechanism.
In the interest of making user login and credential management as simple as possible, The GFFS supports a user being able to acquire both his/her user credentials as well as his/her group credentials in a single login command. A user simply links, “ln”, his desired auto-login groups to his user resource and the login command will attempt to log into each group listed using his/her user credentials. The group already is setup to allow only the proper users to be able to login and acquire the group’s credentials.
Performance
During the XSEDE proposal preparation process doubts were raised as to whether a Web Services based solution complete with XML processing, message signing, and SSL could meet the performance requirements of the XSEDE user community. Two different performance scenarios were discussed, the bandwidth and latency of a single client, and the aggregate throughput of a large number of clients.
We have developed and adopted a number of IO benchmarks for the GFFS.
Related Work
Sharing data within a department or research group is easy – data is typically stored on network accessible file system such as NSF or CIFS, and sharing is as easy as setting permissions and using the correct file system path to the data. It is much more complicated when collaborators are at different institutions with different identity domains (e.g., Unix UID spaces) with no common shared file system. Current best practice is to copy files around with scp or some other copy based tool, place the data on a web site and either manually, or via scripts copy it around. This presents significant challenges, e.g., trust, account management, consistency, etc.
Wide area file systems have been developed and deployed over the years [6, 37-46] to address these problems. However, they have not had much adoption, particularly in multi-organizational settings. We believe that they have not become ubiquitous because: 1) they are not federating file systems, i.e., they require data to be stored a particular way and cannot simply layer on top of local file systems; 2) they often require kernel modifications that require root permission and possibly synchronizing kernel versions across sites; 3) they often require changes to an organization’s authentication infrastructure, an extremely difficult task, particularly if undertaken to satisfy only one user or group’s need to share; and 4) they require a degree of trust among partners that is infeasible in practice due to the requirement that the kernels of different machines in the system trust one other.
The Grid community has developed a plethora of different low-level mechanisms and standards for data management in Grids, including Gridftp [17], EU DG [47], OGSA-ByteIO [48], DIAS [49] and others, none of which provides the seamless transparent transition for end users from their local environment to a Grid environment. This frustrates potential users and uses of Grid systems.
Globus has GASS [50], RLS [51], and GridFTP [17], and has provided RIO [52] in the past. Our approach differs significantly, in that we are masking from the application that they are using the Grid at all – instead in our approach users view the Grid as an extension of the file system. Similarly, we differ in that the coherence of the data is automatically maintained (in an object-specific fashion). This is significantly different from the Globus approach where the user must explicitly manage checking file status, and copying files back and forth with GridFTP.
dCache (www.dcache.org) is a distributed data management infrastructure gaining increasing use in the high-performance community. The goal of this project is to provide a system for storing and retrieving huge amounts of data, distributed among a large number of heterogeneous server nodes, under a single virtual filesystem tree with a variety of standard access methods. dCache shares many of the same goals as the proposed work. The primary difference is the set of standards used, dCache uses SRM (Storage Resource Manager) rather than RNS and ByteIO, the result is that dCache is focused on data only, the name space provided cannot include resources other than data and directories, whereas in GFFS pathnames can refer to any type of resource.
Lustre/WAN [6], GPFS/Wan[7], are wide area implementations of their local cluster production versions. The single site versions of Lustre and GPFS, as well as other cluster file systems such as gluster (www.gluster.org ), are in widespread use. The performance of both in the wide-area is very good. The problem with both has to do with the need to choose one – and store data in one. This requires everybody to buy into the same file system implementation, and all of the kernel requirements that follow from the decision. It also requires a degree of trust between kernels that, while possible between the centers, is simply not realistic between campuses, the centers, and individual researcher machines.

Summary

References
1. Pike, R., et al. Plan 9 from Bell Labs. in UKUUG Summer 1990 Conference. 1990. London, UK.
2. Campbell, R.H., et al., Designing and implementing Choices: an object-oriented system in C++. Communications of the ACM, 1993. 36(9): p. 117 - 126
3. Campbell, R.H., et al., Principles of Object Oriented Operating System Design. 1989, Department of Computer Science, University of Illinois: Urbana, Illinois.
4. Gropp, W., E. Lusk, and A. Skjellum, Using MPI: Portable Parallel Programming with the Message-Passing Interface. 1994: MIT Press.
5. Geist, A., et al., PVM: Parallel Virtual Machine. 1994: MIT Press.
6. Lustre, The Lustre File System. 2009.
7. IBM. General Parallel File System. 2005 [cited; Available from: http://www-03.ibm.com/systems/clusters/software/gpfs.html.
8. OGF, Open Grid Forum, Open Grid Forum.
9. Grimshaw, A., D. Snelling, and M. Morgan, WS-Naming Specification. 2007, Open Grid Forum, GFD-109.
10. Antonioletti, M., et al., Web Services Data Access and Integration - The Core (WS-DAI) Specification, Version 1.0 2006, Open Grid Forum.
11. Newhouse, S. and A. Grimshaw, Independent Software Vendors (ISV) Remote Computing Usage Primer, in Grid Forum Document, G. Newby, Editor. 2008, Open Grid Forum. p. 141.
12. Jordan, C. and H. Kishimoto, Defining the Grid: A Roadmap for OGSA® Standards v1.1 [Obsoletes GFD.53] 2008, Open Grid Forum.
13. Merrill, D., Secure Addressing Profile 1.0 2008, Open Grid Forum.
14. Merrill, D., Secure Communication Profile 1.0 2008, Open Grid Forum.
15. Snelling, D., D. Merrill, and A. Savva, OGSA® Basic Security Profile 2.0. 2008, Open Grid Forum.
16. Grimshaw, A., et al., An Open Grid Services Architecture Primer. IEEE Computer, 2009. 42(2): p. 27-34.
17. Allcock, W., GridFTP Protocol Specification Open Grid Forum, 2003. GFD.20.
18. Foster, I., T. Maguire, and D. Snelling, OGSA WSRF Basic Profile 1.0, in Open Grid Forum Documents. 2006. p. 23.
19. Morgan, M., A.S. Grimshaw, and O. Tatebe, RNS Specification 1.1. 2010, Open Grid Forum. p. 23.
20. Morgan, M. and O. Tatebe, RNS 1.1 OGSA WSRF Basic Profile Rendering 1.0. 2010, Open Grid Forum. p. 16.
21. OASIS. Organization for the Advancement of Structured Information Standards. [cited; Available from: http://www.oasis-open.org/.
22. OASIS-SOAPSec. Web Services Security: SOAP Message Security. 2003 [cited August 27 2003]; Working Draft 17:[
23. OASIS. WS-Security. 2005 [cited; Available from: http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=wss.
24. Graham, S., et al., Web Services Resource 1.2 2 (WS-Resource). 2005.
25. Snelling, D., I. Robinson, and T. Banks, WSRF - Web Services Resrouce Framework. 2006, OASIS.
26. OASIS, Web Services Security X.509 Certificate Token Profile, in OASIS Standard Specification. 2006, OASIS.
27. OASIS, Web Services Security Username Token Profile 1.1, in OASIS Standard Specification. 2006.
28. OASIS, WS-Trust 1.3, in OASIS Standard Specification. 2007.
29. OASIS, Web Services Security Kerberos Token Profile, in OASIS Standard Specification. 2006.
30. Box, D., et al., Web Services Addressing (WS-Addressing). 2004, W3C.
31. Christensen, E., et al. Web Services Description Language (WSDL) 1.1. 2001 [cited; Available from: http://www.w3.org/TR/wsdl.
32. W3C, XML Encryption Syntax and Processing, in W3C Recommendation. 2002, W3C.
33. WS-I, Basic Security Profile 1.0, in WS-I Final Material. 2007.
34. Morgan, M. and A. Grimshaw. Genesis II - Standards Based Grid Computing. in Seventh IEEE International Symposium on Cluster Computing and the Grid 2007. Rio de Janario, Brazil: IEEE Computer Society.
35. Virginia, U.o. The Genesis II Project. 2010 [cited; Available from: http://genesis2.virginia.edu/wiki/Main/HomePage.
36. Group, G.I., Cross Campus Grid (XCG). 2009.
37. Satyanarayanan, M., Scalable, Secure, and Highly Available Distributed File Access. IEEE Computer, 1990. 23(5): p. 9-21.
38. Levy, E. and A. Silberschatz, Distributed File Systems: Concepts and Examples. ACM Computing Surveys, 1990. 22(4): p. 321-374.
39. White, B., et al. LegionFS: A Secure and Scalable File System Supporting Cross-Domain High-Performance Applications. in SC 01. 2001. Denver, CO.
40. Stockinger, H., et al., File and Object Replication in Data Grids. Journal of Cluster Computing, 2002. 5(3): p. 305-314.
41. Huang, H. and A. Grimshaw, Grid-Based File Access: The Avaki I/O Model Performance Profile. 2004, Department of Computer Science, University of Virginia: Charlottesville, VA.
42. Heizer, I., P.J. Leach, and D.C. Naik. A Common Internet File System (CIFS/1.0) Protocol. 1996 [cited; Available from: http://www.tools.ietf.org/html/draft-heizer-cifs-v1-spec-00.
43. Walker, B.e.a. The LOCUS Distributed Operating System. in 9th ACM Symposium on Operating Systems Principles. 1983. Bretton Woods, N. H.: ACM.
44. Adya, A., et al. FARSITE: Federated, Available, and Reliable Storage for an Incompletely Trusted Environment. 2002 [cited.
45. Morris, J.H.e.a., Andrew: A distributed personal computing environment. Communications of the ACM, 1986. 29(3).
46. Shepler, S., et al. Network File System (NFS) version 4 Protocol. 2003 [cited RFC 3530; Available from: http://www.ietf.org/rfc/rfc3530.txt.
47. Kunszt, P., et al. Data storage, access and catalogs in gLite
Data storage, access and catalogs in gLite. in Local to Global Data Interoperability - Challenges and Technologies, 2005. 2005.
48. White, B.S., A.S. Grimshaw, and A. Nguyen-Tuong. Grid-Based File Access: The Legion I/O Model. in 9th IEEE International Symposium on High Performance Distributed Computing. 2000.
49. Foster, I., et al., Modeling and Managing State in Distributed Systems: The Role of OGSI and WSRF, in Proceedings of the IEEE, 93(3). 2005.
50. Bester, J., et al. GASS: A Data Movement and Access Service for Wide Area Computing Systems. in Sixth Workshop on I/O in Parallel and Distributed Systems. 1999.
51. Chervenak, A., et al., The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Datasets. Journal of Network and Compute Applications, 2001. 23: p. 187-200.
52. Fitzgerald, S., et al. A Directory Service for Configuring High-Performance Distributed Computations. in 6th IEEE Symposium on High-Performance Distributed Computing. 1997: IEEE Computer Society Press.
November 08, 2011, at 09:15 AM by 128.143.137.203 -
Added lines 12-14:
!!!Accessing data at an NSF center from a home or campus
Using the GFFS Sarah and her students can export their home directories or scratch directories at TACC into the global namespace. They can then mount the GFFS on their Linux workstations and on their cluster nodes. This permits them to directly edit, view, and visualize application parameter files, input files, intermediate files, and final output files directly from their desktop. Further, they can start local applications that can monitor application progress (by checking for files in a directory, or scanning an output file) all in real time against the actual data at TACC. There is no need to explicitly transfer (copy) files back and forth, nor is there any need to keep track of which version of which file has been copied –consistency with the data at TACC is assured.
November 08, 2011, at 09:14 AM by 128.143.137.203 -
Added line 4:
Added line 6:
Added line 8:
Added line 10:
November 08, 2011, at 09:12 AM by 128.143.137.203 -
Changed lines 3-7 from:
%lfloat% http://genesis2.virginia.edu/wiki/uploads/Main/gffs.jpeg
to:
The GFFS was born out of a need to access and manipulate remote resources such as file systems in a federated, secure, standardized, scalable, and transparent manner without requiring either data owners or applications developers and users to change how they store and access data in any way.
The GFFS accomplishes this by employing a global path-based namespace, e.g., /data/bio/file1. Data in existing file systems, whether they are Windows file systems, MacOS file systems, AFS, Linux, or Lustre file systems can then be exported, or linked into the global namespace. For example, a user could export a local rooted directory structure on their “C” drive, C:\work\collaboration-with-Bob, into the global namespace at /data/bio/project-Phil. Files and directories on the user’s “C” drive in \work\collaboration-with-bob would then, subject to access control, be accessible to users in the GFFS via the /data/bio/project-Bob path.
Transparent access to data (and resources more generally) is realized by using OS-specific file system drivers that understand the underlying standard security, directory, and file access protocols employed by the GFFS. These file system drivers map the GFFS global namespace onto a local file system mount. Data and other resources in the GFFS can then be accessed exactly the same way local files and directories are accessed – applications cannot tell the difference.
Three examples illustrate GFFS typical uses cases, accessing data at an NSF center from a home or campus, accessing data on a campus machine from an NSF center, and directly sharing data with a collaborator at another institution.
For each of these three examples suppose that Sarah is an XSEDE user at Big State U and her students regularly runs jobs on Ranger at TACC. She and her students run many of the same sorts of jobs (though much smaller) on their local cluster, and they do software and script development on their local cluster. The software consists of a workflow (pipeline) comprised of a number of programs that generate intermediate results used in subsequent stages of the pipeline. Further, Sarah and her students frequently need to check on the pipeline as it is executing by examining or visualizing intermediate files.
%lfloat% http://genesis2.virginia.edu/wiki/uploads/Main/gffs.jpeg
November 08, 2011, at 09:11 AM by 128.143.137.203 -
Changed lines 3-99 from:

The GFFS was born out of a need to access and manipulate remote resources such as file systems in a federated, secure, standardized, scalable, and transparent manner without requiring either data owners or applications developers and users to change how they store and access data in any way.

The GFFS accomplishes this by employing a global path-based namespace, e.g., /data/bio/file1. Data in existing file systems, whether they are Windows file systems, MacOS file systems, AFS, Linux, or Lustre file systems can then be exported, or linked into the global namespace. For example, a user could export a local rooted directory structure on their “C” drive, C:\work\collaboration-with-Bob, into the global namespace at /data/bio/project-Phil. Files and directories on the user’s “C” drive in \work\collaboration-with-bob would then, subject to access control, be accessible to users in the GFFS via the /data/bio/project-Bob path.

Transparent access to data (and resources more generally) is realized by using OS-specific file system drivers that understand the underlying standard security, directory, and file access protocols employed by the GFFS. These file system drivers map the GFFS global namespace onto a local file system mount. Data and other resources in the GFFS can then be accessed exactly the same way local files and directories are accessed – applications cannot tell the difference.
Three examples illustrate GFFS typical uses cases, accessing data at an NSF center from a home or campus, accessing data on a campus machine from an NSF center, and directly sharing data with a collaborator at another institution.
For each of these three examples suppose that Sarah is an XSEDE user at Big State U and her students regularly runs jobs on Ranger at TACC. She and her students run many of the same sorts of jobs (though much smaller) on their local cluster, and they do software and script development on their local cluster. The software consists of a workflow (pipeline), comprised of a number of programs that generate intermediate results used in subsequent stages of the pipeline. Further, Sarah and her students frequently need to check on the pipeline as it is executing by examining or visualizing intermediate files. %lfloat% http://genesis2.virginia.edu/wiki/uploads/Main/gffs.jpeg

!!!Accessing data at an NSF center from a home or campus
Using the GFFS Sarah and her students can export their home directories or scratch directories at TACC into the global namespace. They can then mount the GFFS on their Linux workstations and on their cluster nodes. This permits them to directly edit, view, and visualize application parameter files, input files, intermediate files, and final output files directly from their desktop. Further, they can start local applications that can monitor application progress (by checking for files in a directory, or scanning an output file) all in real time against the actual data at TACC. There is no need to explicitly transfer (copy) files back and forth, nor is there any need to keep track of which version of which file has been copied –consistency with the data at TACC is assured.

!!!Accessing data on a campus machine from an NSF center
Similarly, Sarah and her students can directly access files on their clusters and desktops at Big State U. directly from the centers. This means they can keep one set of sources, makefiles, and scripts at Big State U., and compile and execute against them from any of the NSF service providers. For example, suppose that Sarah’s group keeps their sources and scripts in the directory /home/Sarah/sources on her departmental file server. She, could export /home/Sarah/sources into the GFFS and access it in scripts or at the command line from any of the service providers. Any changes made to the files, either at Big State U, or at any of the service providers, will be immediately visible to GFFS users, including her own jobs and scripts running at Big State U or any of the service providers .

Next, consider the case when Sarah’s lab has an instrument that generates data files from experiments and places them in a local directory. As is so often the case, suppose the instrument in-fact comes with a Windows computer onto which the data is dumped. Sarah could export the directory in which the data is placed by the instrument, e.g., c:\labMaster-1000\outfiles into the GFFS. The data will then be directly accessible not only at her home institution, but also at the service providers, without any need to the data explicitly.

!!!Sharing data with a collaborator at another institution
Finally consider the case of a multi-institution collaboration in which Sarah is collaborating with a team led by Bob at Small-State-U. Suppose Bob’s team is developing and maintaining some of the applications used in the workflow. Suppose that Bob’s team also needs to access both Sarah’s instrument data as well as the data her team has generated at TACC. First, Bob can export his source and binary trees into the GFFS and give Sarah and her team access to the directories. Sarah can similarly give Bob and his team access to the necessary directories in the GFFS. Bob can then directly access Sarah’s data both at Big-State-U and at TACC. An interesting aspect is that, Bob accessing Sarah’s data at Big-State-U, and Sarah accessing Bob’s code at Small-State-U does not necessarily involve XSEDE at all though they are using the XSEDE-provided GFFS as a medium.

!!!Access to non file system resources
Not all resources are directories and flat files. The GFFS reflects this by facilitating the inclusion of non-file system data in much the same manner as Plan 9 [1]; any resource type can be modeled as a file or directory, compute resources, databases, running jobs, and communications channel.

!!!Computer resources
A compute resource, such as a PBS controlled cluster, can be modeled as a directory (folder). To start a job, simply drag or copy a JSDL XML file describing the job into the directory. The job will then logically begin executing. (We say logically because on some resources, such as queuing systems, it is scheduled for execution.) A directory listing of the folder will show sub-folders for each of the jobs “in” the compute resource. (A similar concept was first introduced with Choices [2, 3].) Within each job folder is a text file with the job status, e.g., Running or Finished, and a subfolder that is the current working directory of the job with all of the intermediate job files, input files, output files, stdout, stderr, etc. The user can interact with the files in the working directory while the job executes, both reading them to monitor execution and writing them to steer computation.
Figure showing screen shot of looking at a job current working directory.

Recall that in our earlier example Sarah’s group had a local compute cluster. Sarah could also export her compute cluster into the GFFS as a shared compute resource and give Bob’s group access to the cluster. Bob’s group could then use Sarah’s resource without needing special accounts, and without having to login to Sarah’s machines. If Bob too had a cluster, he could export that cluster into the GFFS. They could then create a shared Grid Queue that includes both of their clusters that would load balance jobs between the two resources – effectively creating a mini-compute grid.

!!!RDBMS
Relational databases can similarly be modeled as a folder or directory containing a set of tables . Each sub-table is itself a folder that contains sub tables (created by executing queries against it) and a text file that can be used as a CSV text representation of the table. Queries can be executed by copying or dragging a text file with a SQL query into the folder. The result of the query is itself a new sub folder.

!!!Named pipes
Often two more applications need to communicate. Traditionally applications can communicate via files in the file system, e.g., application A writes file A_output and application B reads the file, or via message passing [4, 5] or sockets of some kind, e.g., open a TCP connection to a well-known address and send bytes down the channel. In Unix for programs started on the same machine, pipes are often also used. Unfortunately, in wide-area distributed systems many resources are behind NATs and firewalls and simply opening a socket is not always an easy option.
To address this problem the GFFS supports named pipes. Named pipes are analogous to their Unix counterparts; they are buffered streams of bytes. Named pipes appear in the namespace just as any other file, and have access control like any other file. As with Unix named pipes GFFS named pipes may have many readers and writers, though the same caveats apply. Thus, an application can create a named pipe at a well-known location and then read from it, awaiting another application to write to it.

!!! An Aside on GFFS Goals and Non-Goals
The complexity of sharing resources between researchers creates a barrier to resource sharing and collaboration – an activation energy if you like. Too often, the energy barrier is too high – and valuable collaborations, that could lead to breakthrough science, do not happen, or if they do, take much longer and cost more.

One of the most common complaints about grid computing, and the national cyberinfrastructure more generally, is that it is not easy to use. We feel strongly that rather than have users adapt to the infrastructure, the infrastructure should adapt to users. In other words, the infrastructure must support interaction modalities and paradigms with which users are already familiar. Towards that end, simplicity and ease-of-use is critical.

When considering ease-of-use the first and most important observation is that most scientists do not want to become computer hackers. They view the computer as a tool that they use every day for a wide variety of tasks: reading email, saving attachments, opening documents, cruising through the directory/folder structure looking for a file, and so on. Therefore, rather than have scientists learn a whole new paradigm to search for and access data we believe the paradigm with which they are already familiar should be extended across organizational boundaries and to a wider variety of file types.

Therefore, the core, underlying goal of the GFFS is to empower science and engineering by lowering barriers to carry out computationally based research. Specifically we believe that the mechanisms used must be easy to use and learn, must not require change in existing infrastructures on campuses and labs, and must support interactions between the centers and campuses, campuses, and with other international infrastructures. We believe complexity is the major problem that must be addressed.

Ease of use is just one of many quality attributes a system such as the GFFS exhibits. Others are security, performance, availability, reliability, and so on. With respect to performance we are often asked how GFFS performance compares to parallel file systems such as Luster [6] or GPFS [7]? For us this is somewhat of a non sequitur. Competing with Luster and GPFS is not a goal – the GFFS is not designed to be high performance parallel file system. It is designed to make it easy to federate across many different organizations and make data easily accessible to users and applications.

!!!GFFS Implementation
The GFFS uses as its foundation standard protocols from the Open Grid Forum [8-20], OASIS [21-29], the W3C [30-32], and others [33]. As an open, standards-based system, any implementation can be used. The first realization of the GFFS at XSEDE is using the Genesis II implementation from the University of Virginia [34, 35]. Genesis II has been in continuous operation at the University of Virginia since 2007 in the Cross Campus Grid (XCG) [36]. In mid-2010, the XCG was extended to include FutureGrid resources at Indiana University, SDSC, and TACC.


!!! Client Side – Accessing Resources
By “client-side”, we mean the users of resources in the GFFS (the data clients in Figure 1). For example, a visualization application Sarah might run on her workstation that access files residing at an NSF service provider such as TACC.

Three mechanisms can be used to access data in the GFFS: a command line too; a graphical user interface; and an operating system specific file system driver. (http://genesisii.cs.virginia.edu/docs/Client-usage-v1.0.pdf). The first step in using any of the GFFS access mechanisms is to install the XSEDE Genesis II client. There are client installers for Windows, Linux, and MacOS (http://genesis2.virginia.edu/wiki/Main/Downloads ). The installers work like most installers. You download the installer, double click on it, and follow the directions. It is designed to be as easy to install as TurboTax®. Within two or three minutes, you will be up and ready to go.

On Linux and MacOS, we provide a GFFS-aware FUSE file system driver to map the global namespace into the local file system namespace. FUSE is a user space file system driver that requires no special permission to run. Thus, one does not have to be “root” to mount a FUSE device.
Once the client has been installed and the user is logged in, mounting the GFFS in Linux requires two simple steps: create a mount-point, and mount the file system as shown below.
mkdir XSEDE
nohup grid fuse –mount local:XSEDE &

Once mounted, the XSEDE directory can be used just like any other mounted file system.

!!! Sharing Resources

As discussed above there are many resource types that can be shared, file system resources, storage resources, relational databases, compute clusters, and running jobs. We will keep our focus here on sharing data, specifically files and directories. For all of the below we will assume the GFFS client is already installed and that we have linked the GFFS into our Unix home directory at $HOME/XSEDE. We will further assume that our user Sarah has a directory in the GFFS at /home/Sarah. Given where the GFFS is mounted, the Unix path to that directory is $HOME/XSEE/home/Sarah. We will also assume below unless otherwise noted that our current working directory is $HOME/XSEE/home/Sarah.

Before we get to sharing local data resources, lets’ look first how to create a file or directory “somewhere in the GFFS”. To create a file or directory in the GFFS is simple. For example, the script
mkdir test
echo “This is a test” >> test/newfile
creates a new file in the newly created “test” directory. Once created both the directory and the file are available throughout the GFFS subject to access control.

However, where is the data actually stored? The short answer is that the “test” directory will be created in the same place where the current working directory is located. Similarly, “newfile” will be placed in the same location as the “test” directory.

So, where is that? A bit of background here is useful. The GFFS uses a standards-based Web Services model. Most Web Services (including those that implement the GFFS) execute inside of a program called a Web Services container. A container is a program that accepts Web Services connections (in our case https connections), parses the request, and calls the appropriate function to handle the request. Web Services containers are often written in Java, and execute on Windows, Linux, or MacOS machines like any other application. The difference is that they listen for http/https connections and respond to them.

In the GFFS, files and directories are stored in different GFFS Web Service containers (just “containers” from here on.) There are GFFS containers at the NSF service providers, and there are containers wherever someone wants to share a resource. Therefore, the first step to sharing a resource is to install the GFFS container. The installation process is very similar to installing the client if one chooses to use only the default options. It can be more complicated if, for example, your resource is behind a NAT or firewall. The GFFS container requires no special permissions or privilege, though it is recommend that it be started as its own user with minimum privilege.

To get back to our earlier question, “where is the data actually stored?” The new directory and the new file will be placed on the same container as Sarah’s GFFS home directory. In a moment, we will see how to over-ride this. For now, however, we will assume that is where they will be placed.

Given that the new items will be placed on the same GFFS container as Sarah’s home directory, exactly where will they be placed? There are two possibilities . Either Sarah’s home directory is being stored a container in the containers own databases and storage space (in other words, the container is acting as a storage service), or her home directory is an export, in other words it is being stored in a local file system somewhere. Below we briefly examine each of these options. Keep in mind that either way a GFFS container is providing access to the data.

!!!File System Resources – a.k.a. exports
An export takes the specified rooted directory tree, maps it into the global namespace, and thus provides a means for non-local users to access data in the directory via the GFFS. Local access to the exported directory is un-effected. Existing scripts, cron jobs, and applications can continue to access the data. Once a GFFS container is running that can “see” the directory to be exported, it is quite simple to share data. For example, Sarah could share out using the simple command
grid export /containers/Big-State-U/Sarah-server /development/sources /home/Sarah/dev


This exports from the machine “Sarah-server”, the directory tree rooted at “/development/sources”, and links it into the global namespace at the path “/home/Sarah/dev”. Once exported data is accessible (subject to access control) until the export is terminated. The net result is a user can decide to securely share out a particular directory structure with colleagues anywhere with a network connection and this collaborator can subsequently access it with no effort.

!!!Storage Resources
Earlier we said that the new files and directories could be stored in the containers own databases and file system resources. What exactly does that mean and how can we use it?

!!!Compute Resources
Compute Sharing
Compute resources such as clusters, parallel machines, and desktop compute resources can be shard in a similar manner. For example to create an OGSA-BES resource that proxies a PBS queue e.g.,
create-resource /containers/UVA/CS/camillus/ /home/grimshaw/testPBS-resource
creates new OGSA-BES resources on the host camillus in the computer science department at UVA. It places a link to the new resource (i.e., how it can be accessed in the future) in the GFFS directory /home/grimshaw. The new resource can be passed a configuration file that tells it information it needs to know about the local queuing system, in this case that it is a PBS queue, where shared scratch space is to be found, and so forth. Access control is now at the user’s discretion.
to:
%lfloat% http://genesis2.virginia.edu/wiki/uploads/Main/gffs.jpeg
November 07, 2011, at 02:59 PM by 128.143.137.203 -
November 07, 2011, at 02:40 PM by 128.143.137.203 -
Changed line 35 from:
!!!Named pipes:
to:
!!!Named pipes
November 07, 2011, at 02:40 PM by 128.143.137.203 -
Changed line 26 from:
!!!Computer resources:
to:
!!!Computer resources
Changed line 32 from:
!!!RDBMS:
to:
!!!RDBMS
November 07, 2011, at 02:33 PM by 128.143.137.203 -
Changed lines 10-12 from:
For each of these three examples suppose that Sarah is an XSEDE user at Big State U and her students regularly runs jobs on Ranger at TACC. She and her students run many of the same sorts of jobs (though much smaller) on their local cluster, and they do software and script development on their local cluster. The software consists of a workflow (pipeline), comprised of a number of programs that generate intermediate results used in subsequent stages of the pipeline. Further, Sarah and her students frequently need to check on the pipeline as it is executing by examining or visualizing intermediate files.

%lfloat% http://genesis2.virginia.edu/wiki/uploads/Main/gffs.jpeg
to:
For each of these three examples suppose that Sarah is an XSEDE user at Big State U and her students regularly runs jobs on Ranger at TACC. She and her students run many of the same sorts of jobs (though much smaller) on their local cluster, and they do software and script development on their local cluster. The software consists of a workflow (pipeline), comprised of a number of programs that generate intermediate results used in subsequent stages of the pipeline. Further, Sarah and her students frequently need to check on the pipeline as it is executing by examining or visualizing intermediate files. %lfloat% http://genesis2.virginia.edu/wiki/uploads/Main/gffs.jpeg
November 07, 2011, at 02:32 PM by 128.143.137.203 -
Added line 45:
Added line 47:
Added line 49:
Changed lines 51-52 from:
GFFS Implementation
to:

!!!
GFFS Implementation
Added line 58:
Added line 60:
Changed lines 63-65 from:
mkdir XSEDE
nohup
grid fuse –mount local:XSEDE &
to:
mkdir XSEDE
nohup grid fuse –mount local:XSEDE &
Added line 71:
Changed lines 73-74 from:
mkdir test
echo
“This is a test” >> test/newfile
to:
mkdir test
echo “This is a test” >> test/newfile
Added line 76:
Added line 78:
Added line 80:
Added line 82:
Added line 84:
Changed lines 86-90 from:
File System Resources – a.k.a. exports
An export takes the specified rooted directory tree, maps it into the global namespace, and thus provides a means for non-local users to access data in the directory via the GFFS. Local access to the exported directory is un-effected. Existing scripts, cron jobs, and applications can continue to access the data.
grid export /containers/Big-State-U/Sarah-server /development/sources /home/Sarah/dev
Once a GFFS container is running that can “see” the directory to be exported, it is quite simple to share data. For example, Sarah could share out using the simple command
to:

!!!
File System Resources – a.k.a. exports
An export takes the specified rooted directory tree, maps it into the global namespace, and thus provides a means for non-local users to access data in the directory via the GFFS. Local access to the exported directory is un-effected. Existing scripts, cron jobs, and applications can continue to access the data. Once a GFFS container is running that can “see” the directory to be exported, it is quite simple to share data. For example, Sarah could share out using the simple command
grid export /containers/Big-State-U/Sarah-server /development/sources /home/Sarah/dev
Changed lines 93-94 from:
Storage Resources
to:

!!!
Storage Resources
Changed lines 96-99 from:
Compute Resources


!!!
Compute Sharing
to:

!!!
Compute Resources
Compute Sharing
November 07, 2011, at 02:23 PM by 128.143.137.203 -
Changed lines 28-29 from:
!!!Computer resources A compute resource, such as a PBS controlled cluster, can be modeled as a directory (folder). To start a job, simply drag or copy a JSDL XML file describing the job into the directory. The job will then logically begin executing. (We say logically because on some resources, such as queuing systems, it is scheduled for execution.) A directory listing of the folder will show sub-folders for each of the jobs “in” the compute resource. (A similar concept was first introduced with Choices [2, 3].) Within each job folder is a text file with the job status, e.g., Running or Finished, and a subfolder that is the current working directory of the job with all of the intermediate job files, input files, output files, stdout, stderr, etc. The user can interact with the files in the working directory while the job executes, both reading them to monitor execution and writing them to steer computation.
to:
!!!Computer resources:
A compute resource, such as a PBS controlled cluster, can be modeled as a directory (folder). To start a job, simply drag or copy a JSDL XML file describing the job into the directory. The job will then logically begin executing. (We say logically because on some resources, such as queuing systems, it is scheduled for execution.) A directory listing of the folder will show sub-folders for each of the jobs “in” the compute resource. (A similar concept was first introduced with Choices [2, 3].) Within each job folder is a text file with the job status, e.g., Running or Finished, and a subfolder that is the current working directory of the job with all of the intermediate job files, input files, output files, stdout, stderr, etc. The user can interact with the files in the working directory while the job executes, both reading them to monitor execution and writing them to steer computation.
Changed lines 34-36 from:
!!!RDBMS: Relational databases can similarly be modeled as a folder or directory containing a set of tables . Each sub-table is itself a folder that contains sub tables (created by executing queries against it) and a text file that can be used as a CSV text representation of the table. Queries can be executed by copying or dragging a text file with a SQL query into the folder. The result of the query is itself a new sub folder.

!!!Named pipes: Often two more applications need to communicate. Traditionally applications can communicate via files in the file system, e.g., application A writes file A_output and application B reads the file, or via message passing [4, 5] or sockets of some kind, e.g., open a TCP connection to a well-known address and send bytes down the channel. In Unix for programs started on the same machine, pipes are often also used. Unfortunately, in wide-area distributed systems many resources are behind NATs and firewalls and simply opening a socket is not always an easy option.
to:
!!!RDBMS:
Relational databases can similarly be modeled as a folder or directory containing a set of tables . Each sub-table is itself a folder that contains sub tables (created by executing queries against it) and a text file that can be used as a CSV text representation of the table. Queries can be executed by copying or dragging a text file with a SQL query into the folder. The result of the query is itself a new sub folder.

!!!Named pipes:
Often two more applications need to communicate. Traditionally applications can communicate via files in the file system, e.g., application A writes file A_output and application B reads the file, or via message passing [4, 5] or sockets of some kind, e.g., open a TCP connection to a well-known address and send bytes down the channel. In Unix for programs started on the same machine, pipes are often also used. Unfortunately, in wide-area distributed systems many resources are behind NATs and firewalls and simply opening a socket is not always an easy option.
Changed lines 42-43 from:
The complexity of sharing resources between researchers creates a barrier to resource sharing and collaboration – an activation energy if you like. Too often, the energy barrier is too high – and valuable collaborations, that could lead to breakthrough science, do not happen, or if they do, take much longer and cost more.
to:
The complexity of sharing resources between researchers creates a barrier to resource sharing and collaboration – an activation energy if you like. Too often, the energy barrier is too high – and valuable collaborations, that could lead to breakthrough science, do not happen, or if they do, take much longer and cost more.
November 07, 2011, at 02:22 PM by 128.143.137.203 -
Changed line 28 from:
!!!Computer resources: A compute resource, such as a PBS controlled cluster, can be modeled as a directory (folder). To start a job, simply drag or copy a JSDL XML file describing the job into the directory. The job will then logically begin executing. (We say logically because on some resources, such as queuing systems, it is scheduled for execution.) A directory listing of the folder will show sub-folders for each of the jobs “in” the compute resource. (A similar concept was first introduced with Choices [2, 3].) Within each job folder is a text file with the job status, e.g., Running or Finished, and a subfolder that is the current working directory of the job with all of the intermediate job files, input files, output files, stdout, stderr, etc. The user can interact with the files in the working directory while the job executes, both reading them to monitor execution and writing them to steer computation.
to:
!!!Computer resources A compute resource, such as a PBS controlled cluster, can be modeled as a directory (folder). To start a job, simply drag or copy a JSDL XML file describing the job into the directory. The job will then logically begin executing. (We say logically because on some resources, such as queuing systems, it is scheduled for execution.) A directory listing of the folder will show sub-folders for each of the jobs “in” the compute resource. (A similar concept was first introduced with Choices [2, 3].) Within each job folder is a text file with the job status, e.g., Running or Finished, and a subfolder that is the current working directory of the job with all of the intermediate job files, input files, output files, stdout, stderr, etc. The user can interact with the files in the working directory while the job executes, both reading them to monitor execution and writing them to steer computation.
November 07, 2011, at 02:22 PM by 128.143.137.203 -
Changed lines 24-25 from:
Access to non file system resources
to:

!!!
Access to non file system resources
Changed lines 27-28 from:
Computer resources: A compute resource, such as a PBS controlled cluster, can be modeled as a directory (folder). To start a job, simply drag or copy a JSDL XML file describing the job into the directory. The job will then logically begin executing. (We say logically because on some resources, such as queuing systems, it is scheduled for execution.) A directory listing of the folder will show sub-folders for each of the jobs “in” the compute resource. (A similar concept was first introduced with Choices [2, 3].) Within each job folder is a text file with the job status, e.g., Running or Finished, and a subfolder that is the current working directory of the job with all of the intermediate job files, input files, output files, stdout, stderr, etc. The user can interact with the files in the working directory while the job executes, both reading them to monitor execution and writing them to steer computation.
to:

!!!
Computer resources: A compute resource, such as a PBS controlled cluster, can be modeled as a directory (folder). To start a job, simply drag or copy a JSDL XML file describing the job into the directory. The job will then logically begin executing. (We say logically because on some resources, such as queuing systems, it is scheduled for execution.) A directory listing of the folder will show sub-folders for each of the jobs “in” the compute resource. (A similar concept was first introduced with Choices [2, 3].) Within each job folder is a text file with the job status, e.g., Running or Finished, and a subfolder that is the current working directory of the job with all of the intermediate job files, input files, output files, stdout, stderr, etc. The user can interact with the files in the working directory while the job executes, both reading them to monitor execution and writing them to steer computation.
Changed lines 32-33 from:
RDBMS: Relational databases can similarly be modeled as a folder or directory containing a set of tables . Each sub-table is itself a folder that contains sub tables (created by executing queries against it) and a text file that can be used as a CSV text representation of the table. Queries can be executed by copying or dragging a text file with a SQL query into the folder. The result of the query is itself a new sub folder.
Named pipes: Often two more applications need to communicate. Traditionally applications can communicate via files in the file system, e.g., application A writes file A_output and application B reads the file, or via message passing [4, 5] or sockets of some kind, e.g., open a TCP connection to a well-known address and send bytes down the channel. In Unix for programs started on the same machine, pipes are often also used. Unfortunately, in wide-area distributed systems many resources are behind NATs and firewalls and simply opening a socket is not always an easy option.
to:

!!!
RDBMS: Relational databases can similarly be modeled as a folder or directory containing a set of tables . Each sub-table is itself a folder that contains sub tables (created by executing queries against it) and a text file that can be used as a CSV text representation of the table. Queries can be executed by copying or dragging a text file with a SQL query into the folder. The result of the query is itself a new sub folder.

!!!
Named pipes: Often two more applications need to communicate. Traditionally applications can communicate via files in the file system, e.g., application A writes file A_output and application B reads the file, or via message passing [4, 5] or sockets of some kind, e.g., open a TCP connection to a well-known address and send bytes down the channel. In Unix for programs started on the same machine, pipes are often also used. Unfortunately, in wide-area distributed systems many resources are behind NATs and firewalls and simply opening a socket is not always an easy option.
November 07, 2011, at 02:20 PM by 128.143.137.203 -
Deleted line 11:
'''Figure 1''' Sarah's group exports data from their department file system, their instrument, and from their TACC home directory into the global namespace. The directories and files are then accessible from data clients (including their own Big State U. machines) throughout the GFFS.
November 07, 2011, at 02:20 PM by 128.143.137.203 -
November 07, 2011, at 02:17 PM by 128.143.137.203 -
Added line 20:
November 07, 2011, at 02:16 PM by 128.143.137.203 -
November 07, 2011, at 02:16 PM by 128.143.137.203 -
Changed lines 17-18 from:
Accessing data on a campus machine from an NSF center
to:

!!!
Accessing data on a campus machine from an NSF center
Changed lines 21-22 from:
Sharing data with a collaborator at another institution
to:

!!!
Sharing data with a collaborator at another institution
November 07, 2011, at 02:14 PM by 128.143.137.203 -
Changed line 13 from:
%lfloat%
to:
%lfloat% http://genesis2.virginia.edu/wiki/uploads/Main/gffs.jpeg
November 07, 2011, at 02:11 PM by 128.143.137.203 -
Changed line 13 from:
%lfloat% http://genesis2.virginia.edu/wiki/uploads/Main/gffs.jpeg
to:
%lfloat%
November 07, 2011, at 02:08 PM by 128.143.137.203 -
Changed line 12 from:
Figure 1
to:
'''Figure 1''' Sarah's group exports data from their department file system, their instrument, and from their TACC home directory into the global namespace. The directories and files are then accessible from data clients (including their own Big State U. machines) throughout the GFFS.
Changed lines 14-15 from:
Sarah's group exports data from their department file system, their instrument, and from their TACC home directory into the global namespace. The directories and files are then accessible from data clients (including their own Big State U. machines) throughout the GFFS.
Accessing data at an NSF center from a home or campus
to:

!!!Accessing
data at an NSF center from a home or campus
November 07, 2011, at 02:07 PM by 128.143.137.203 -
Changed lines 12-13 from:
Figure 1%lfloat% http://genesis2.virginia.edu/wiki/uploads/Main/gffs.jpeg
to:
Figure 1
%lfloat% http://genesis2.virginia.edu/wiki/uploads/Main/gffs.jpeg
November 07, 2011, at 02:07 PM by 128.143.137.203 -
Changed lines 12-13 from:
%lfloat% http://genesis2.virginia.edu/wiki/uploads/Main/gffs.jpeg
Figure 1 Sarah's group exports data from their department file system, their instrument, and from their TACC home directory into the global namespace. The directories and files are then accessible from data clients (including their own Big State U. machines) throughout the GFFS.
to:
Figure 1%lfloat% http://genesis2.virginia.edu/wiki/uploads/Main/gffs.jpeg
Sarah's group exports data from their department file system, their instrument, and from their TACC home directory into the global namespace. The directories and files are then accessible from data clients (including their own Big State U. machines) throughout the GFFS.
November 07, 2011, at 02:06 PM by 128.143.137.203 -
Changed line 12 from:
http://genesis2.virginia.edu/wiki/uploads/Main/gffs.jpeg
to:
%lfloat% http://genesis2.virginia.edu/wiki/uploads/Main/gffs.jpeg
November 07, 2011, at 02:03 PM by 128.143.137.203 -
Changed lines 12-13 from:
http://genesis2.virginia.edu/wiki/uploads/Main/gffs.jpeg Figure 1 Sarah's group exports data from their department file system, their instrument, and from their TACC home directory into the global namespace. The directories and files are then accessible from data clients (including their own Big State U. machines) throughout the GFFS.
to:
http://genesis2.virginia.edu/wiki/uploads/Main/gffs.jpeg
Figure 1 Sarah's group exports data from their department file system, their instrument, and from their TACC home directory into the global namespace. The directories and files are then accessible from data clients (including their own Big State U. machines) throughout the GFFS.
November 07, 2011, at 02:03 PM by 128.143.137.203 -
Changed line 12 from:
Figure 1 Sarah's group exports data from their department file system, their instrument, and from their TACC home directory into the global namespace. The directories and files are then accessible from data clients (including their own Big State U. machines) throughout the GFFS.
to:
http://genesis2.virginia.edu/wiki/uploads/Main/gffs.jpeg Figure 1 Sarah's group exports data from their department file system, their instrument, and from their TACC home directory into the global namespace. The directories and files are then accessible from data clients (including their own Big State U. machines) throughout the GFFS.
November 07, 2011, at 01:54 PM by 128.143.137.203 -
Changed lines 5-7 from:
The GFFS accomplishes this by employing a global path-based namespace, e.g., /data/bio/file1. Data in existing file systems, whether they are Windows file systems, MacOS file systems, AFS, Linux, or Lustre file systems can then be exported, or linked into the global namespace. For example, a user could export a local rooted directory structure on their “C” drive, C:\work\collaboration-with-Bob, into the global namespace at /data/bio/project-Phil. Files and directories on the user’s “C” drive in \work\collaboration-with-bob would then, subject to access control, be accessible to users in the GFFS via the /data/bio/project-Bob path.
to:

The GFFS accomplishes this by employing a global path-based namespace, e.g., /data/bio/file1. Data in existing file systems, whether they are Windows file systems, MacOS file systems, AFS, Linux, or Lustre file systems can then be exported, or linked into the global namespace. For example, a user could export a local rooted directory structure on their “C” drive, C:\work\collaboration-with-Bob, into the global namespace at /data/bio/project-Phil. Files and directories on the user’s “C” drive in \work\collaboration-with-bob would then, subject to access control, be accessible to users in the GFFS via the /data/bio/project-Bob path.
November 07, 2011, at 01:53 PM by 128.143.137.203 -
Changed line 28 from:
!! An Aside on GFFS Goals and Non-Goals
to:
!!! An Aside on GFFS Goals and Non-Goals
Changed line 38 from:
!! Client Side – Accessing Resources
to:
!!! Client Side – Accessing Resources
Changed lines 48-49 from:
!! Sharing Resources
to:
!!! Sharing Resources
Changed line 71 from:
!! Compute Sharing
to:
!!! Compute Sharing
November 07, 2011, at 01:52 PM by 128.143.137.203 -
Changed lines 1-2 from:
!! Intro
to:
(:Title GFFS – Global Federated File System:)
!
!! Intro
November 07, 2011, at 01:49 PM by 128.143.137.203 -
Changed lines 1-3 from:
Intro
to:

!! Intro
Changed lines 27-28 from:
An Aside on GFFS Goals and Non-Goals
to:

!!
An Aside on GFFS Goals and Non-Goals
Changed lines 36-37 from:
Remainder of paper
Client Side – Accessing Resources
to:


!!
Client Side – Accessing Resources
Changed lines 47-49 from:
Sharing Resources
to:

!!
Sharing Resources
Changed lines 70-72 from:
Compute Sharing. Compute resources such as clusters, parallel machines, and desktop compute resources can be shard in a similar manner. For example to create an OGSA-BES resource that proxies a PBS queue e.g.,
to:

!!
Compute Sharing
Compute resources such as clusters, parallel machines, and desktop compute resources can be shard in a similar manner. For example to create an OGSA-BES resource that proxies a PBS queue e.g.,
November 07, 2011, at 01:48 PM by 128.143.137.203 -
November 07, 2011, at 01:48 PM by 128.143.137.203 -
Added lines 1-66:
Intro
The GFFS was born out of a need to access and manipulate remote resources such as file systems in a federated, secure, standardized, scalable, and transparent manner without requiring either data owners or applications developers and users to change how they store and access data in any way.
The GFFS accomplishes this by employing a global path-based namespace, e.g., /data/bio/file1. Data in existing file systems, whether they are Windows file systems, MacOS file systems, AFS, Linux, or Lustre file systems can then be exported, or linked into the global namespace. For example, a user could export a local rooted directory structure on their “C” drive, C:\work\collaboration-with-Bob, into the global namespace at /data/bio/project-Phil. Files and directories on the user’s “C” drive in \work\collaboration-with-bob would then, subject to access control, be accessible to users in the GFFS via the /data/bio/project-Bob path.
Transparent access to data (and resources more generally) is realized by using OS-specific file system drivers that understand the underlying standard security, directory, and file access protocols employed by the GFFS. These file system drivers map the GFFS global namespace onto a local file system mount. Data and other resources in the GFFS can then be accessed exactly the same way local files and directories are accessed – applications cannot tell the difference.
Three examples illustrate GFFS typical uses cases, accessing data at an NSF center from a home or campus, accessing data on a campus machine from an NSF center, and directly sharing data with a collaborator at another institution.
For each of these three examples suppose that Sarah is an XSEDE user at Big State U and her students regularly runs jobs on Ranger at TACC. She and her students run many of the same sorts of jobs (though much smaller) on their local cluster, and they do software and script development on their local cluster. The software consists of a workflow (pipeline), comprised of a number of programs that generate intermediate results used in subsequent stages of the pipeline. Further, Sarah and her students frequently need to check on the pipeline as it is executing by examining or visualizing intermediate files.

Figure 1 Sarah's group exports data from their department file system, their instrument, and from their TACC home directory into the global namespace. The directories and files are then accessible from data clients (including their own Big State U. machines) throughout the GFFS.
Accessing data at an NSF center from a home or campus
Using the GFFS Sarah and her students can export their home directories or scratch directories at TACC into the global namespace. They can then mount the GFFS on their Linux workstations and on their cluster nodes. This permits them to directly edit, view, and visualize application parameter files, input files, intermediate files, and final output files directly from their desktop. Further, they can start local applications that can monitor application progress (by checking for files in a directory, or scanning an output file) all in real time against the actual data at TACC. There is no need to explicitly transfer (copy) files back and forth, nor is there any need to keep track of which version of which file has been copied –consistency with the data at TACC is assured.
Accessing data on a campus machine from an NSF center
Similarly, Sarah and her students can directly access files on their clusters and desktops at Big State U. directly from the centers. This means they can keep one set of sources, makefiles, and scripts at Big State U., and compile and execute against them from any of the NSF service providers. For example, suppose that Sarah’s group keeps their sources and scripts in the directory /home/Sarah/sources on her departmental file server. She, could export /home/Sarah/sources into the GFFS and access it in scripts or at the command line from any of the service providers. Any changes made to the files, either at Big State U, or at any of the service providers, will be immediately visible to GFFS users, including her own jobs and scripts running at Big State U or any of the service providers .
Next, consider the case when Sarah’s lab has an instrument that generates data files from experiments and places them in a local directory. As is so often the case, suppose the instrument in-fact comes with a Windows computer onto which the data is dumped. Sarah could export the directory in which the data is placed by the instrument, e.g., c:\labMaster-1000\outfiles into the GFFS. The data will then be directly accessible not only at her home institution, but also at the service providers, without any need to the data explicitly.
Sharing data with a collaborator at another institution
Finally consider the case of a multi-institution collaboration in which Sarah is collaborating with a team led by Bob at Small-State-U. Suppose Bob’s team is developing and maintaining some of the applications used in the workflow. Suppose that Bob’s team also needs to access both Sarah’s instrument data as well as the data her team has generated at TACC. First, Bob can export his source and binary trees into the GFFS and give Sarah and her team access to the directories. Sarah can similarly give Bob and his team access to the necessary directories in the GFFS. Bob can then directly access Sarah’s data both at Big-State-U and at TACC. An interesting aspect is that, Bob accessing Sarah’s data at Big-State-U, and Sarah accessing Bob’s code at Small-State-U does not necessarily involve XSEDE at all though they are using the XSEDE-provided GFFS as a medium.
Access to non file system resources
Not all resources are directories and flat files. The GFFS reflects this by facilitating the inclusion of non-file system data in much the same manner as Plan 9 [1]; any resource type can be modeled as a file or directory, compute resources, databases, running jobs, and communications channel.
Computer resources: A compute resource, such as a PBS controlled cluster, can be modeled as a directory (folder). To start a job, simply drag or copy a JSDL XML file describing the job into the directory. The job will then logically begin executing. (We say logically because on some resources, such as queuing systems, it is scheduled for execution.) A directory listing of the folder will show sub-folders for each of the jobs “in” the compute resource. (A similar concept was first introduced with Choices [2, 3].) Within each job folder is a text file with the job status, e.g., Running or Finished, and a subfolder that is the current working directory of the job with all of the intermediate job files, input files, output files, stdout, stderr, etc. The user can interact with the files in the working directory while the job executes, both reading them to monitor execution and writing them to steer computation.
Figure showing screen shot of looking at a job current working directory.

Recall that in our earlier example Sarah’s group had a local compute cluster. Sarah could also export her compute cluster into the GFFS as a shared compute resource and give Bob’s group access to the cluster. Bob’s group could then use Sarah’s resource without needing special accounts, and without having to login to Sarah’s machines. If Bob too had a cluster, he could export that cluster into the GFFS. They could then create a shared Grid Queue that includes both of their clusters that would load balance jobs between the two resources – effectively creating a mini-compute grid.
RDBMS: Relational databases can similarly be modeled as a folder or directory containing a set of tables . Each sub-table is itself a folder that contains sub tables (created by executing queries against it) and a text file that can be used as a CSV text representation of the table. Queries can be executed by copying or dragging a text file with a SQL query into the folder. The result of the query is itself a new sub folder.
Named pipes: Often two more applications need to communicate. Traditionally applications can communicate via files in the file system, e.g., application A writes file A_output and application B reads the file, or via message passing [4, 5] or sockets of some kind, e.g., open a TCP connection to a well-known address and send bytes down the channel. In Unix for programs started on the same machine, pipes are often also used. Unfortunately, in wide-area distributed systems many resources are behind NATs and firewalls and simply opening a socket is not always an easy option.
To address this problem the GFFS supports named pipes. Named pipes are analogous to their Unix counterparts; they are buffered streams of bytes. Named pipes appear in the namespace just as any other file, and have access control like any other file. As with Unix named pipes GFFS named pipes may have many readers and writers, though the same caveats apply. Thus, an application can create a named pipe at a well-known location and then read from it, awaiting another application to write to it.
An Aside on GFFS Goals and Non-Goals
The complexity of sharing resources between researchers creates a barrier to resource sharing and collaboration – an activation energy if you like. Too often, the energy barrier is too high – and valuable collaborations, that could lead to breakthrough science, do not happen, or if they do, take much longer and cost more.
One of the most common complaints about grid computing, and the national cyberinfrastructure more generally, is that it is not easy to use. We feel strongly that rather than have users adapt to the infrastructure, the infrastructure should adapt to users. In other words, the infrastructure must support interaction modalities and paradigms with which users are already familiar. Towards that end, simplicity and ease-of-use is critical.
When considering ease-of-use the first and most important observation is that most scientists do not want to become computer hackers. They view the computer as a tool that they use every day for a wide variety of tasks: reading email, saving attachments, opening documents, cruising through the directory/folder structure looking for a file, and so on. Therefore, rather than have scientists learn a whole new paradigm to search for and access data we believe the paradigm with which they are already familiar should be extended across organizational boundaries and to a wider variety of file types.
Therefore, the core, underlying goal of the GFFS is to empower science and engineering by lowering barriers to carry out computationally based research. Specifically we believe that the mechanisms used must be easy to use and learn, must not require change in existing infrastructures on campuses and labs, and must support interactions between the centers and campuses, campuses, and with other international infrastructures. We believe complexity is the major problem that must be addressed.
Ease of use is just one of many quality attributes a system such as the GFFS exhibits. Others are security, performance, availability, reliability, and so on. With respect to performance we are often asked how GFFS performance compares to parallel file systems such as Luster [6] or GPFS [7]? For us this is somewhat of a non sequitur. Competing with Luster and GPFS is not a goal – the GFFS is not designed to be high performance parallel file system. It is designed to make it easy to federate across many different organizations and make data easily accessible to users and applications.
GFFS Implementation
The GFFS uses as its foundation standard protocols from the Open Grid Forum [8-20], OASIS [21-29], the W3C [30-32], and others [33]. As an open, standards-based system, any implementation can be used. The first realization of the GFFS at XSEDE is using the Genesis II implementation from the University of Virginia [34, 35]. Genesis II has been in continuous operation at the University of Virginia since 2007 in the Cross Campus Grid (XCG) [36]. In mid-2010, the XCG was extended to include FutureGrid resources at Indiana University, SDSC, and TACC.
Remainder of paper
Client Side – Accessing Resources
By “client-side”, we mean the users of resources in the GFFS (the data clients in Figure 1). For example, a visualization application Sarah might run on her workstation that access files residing at an NSF service provider such as TACC.
Three mechanisms can be used to access data in the GFFS: a command line too; a graphical user interface; and an operating system specific file system driver. (http://genesisii.cs.virginia.edu/docs/Client-usage-v1.0.pdf). The first step in using any of the GFFS access mechanisms is to install the XSEDE Genesis II client. There are client installers for Windows, Linux, and MacOS (http://genesis2.virginia.edu/wiki/Main/Downloads ). The installers work like most installers. You download the installer, double click on it, and follow the directions. It is designed to be as easy to install as TurboTax®. Within two or three minutes, you will be up and ready to go.
On Linux and MacOS, we provide a GFFS-aware FUSE file system driver to map the global namespace into the local file system namespace. FUSE is a user space file system driver that requires no special permission to run. Thus, one does not have to be “root” to mount a FUSE device.
Once the client has been installed and the user is logged in, mounting the GFFS in Linux requires two simple steps: create a mount-point, and mount the file system as shown below.
mkdir XSEDE
nohup grid fuse –mount local:XSEDE &

Once mounted, the XSEDE directory can be used just like any other mounted file system.
Sharing Resources
As discussed above there are many resource types that can be shared, file system resources, storage resources, relational databases, compute clusters, and running jobs. We will keep our focus here on sharing data, specifically files and directories. For all of the below we will assume the GFFS client is already installed and that we have linked the GFFS into our Unix home directory at $HOME/XSEDE. We will further assume that our user Sarah has a directory in the GFFS at /home/Sarah. Given where the GFFS is mounted, the Unix path to that directory is $HOME/XSEE/home/Sarah. We will also assume below unless otherwise noted that our current working directory is $HOME/XSEE/home/Sarah.
Before we get to sharing local data resources, lets’ look first how to create a file or directory “somewhere in the GFFS”. To create a file or directory in the GFFS is simple. For example, the script
mkdir test
echo “This is a test” >> test/newfile
creates a new file in the newly created “test” directory. Once created both the directory and the file are available throughout the GFFS subject to access control.
However, where is the data actually stored? The short answer is that the “test” directory will be created in the same place where the current working directory is located. Similarly, “newfile” will be placed in the same location as the “test” directory.
So, where is that? A bit of background here is useful. The GFFS uses a standards-based Web Services model. Most Web Services (including those that implement the GFFS) execute inside of a program called a Web Services container. A container is a program that accepts Web Services connections (in our case https connections), parses the request, and calls the appropriate function to handle the request. Web Services containers are often written in Java, and execute on Windows, Linux, or MacOS machines like any other application. The difference is that they listen for http/https connections and respond to them.
In the GFFS, files and directories are stored in different GFFS Web Service containers (just “containers” from here on.) There are GFFS containers at the NSF service providers, and there are containers wherever someone wants to share a resource. Therefore, the first step to sharing a resource is to install the GFFS container. The installation process is very similar to installing the client if one chooses to use only the default options. It can be more complicated if, for example, your resource is behind a NAT or firewall. The GFFS container requires no special permissions or privilege, though it is recommend that it be started as its own user with minimum privilege.
To get back to our earlier question, “where is the data actually stored?” The new directory and the new file will be placed on the same container as Sarah’s GFFS home directory. In a moment, we will see how to over-ride this. For now, however, we will assume that is where they will be placed.
Given that the new items will be placed on the same GFFS container as Sarah’s home directory, exactly where will they be placed? There are two possibilities . Either Sarah’s home directory is being stored a container in the containers own databases and storage space (in other words, the container is acting as a storage service), or her home directory is an export, in other words it is being stored in a local file system somewhere. Below we briefly examine each of these options. Keep in mind that either way a GFFS container is providing access to the data.
File System Resources – a.k.a. exports
An export takes the specified rooted directory tree, maps it into the global namespace, and thus provides a means for non-local users to access data in the directory via the GFFS. Local access to the exported directory is un-effected. Existing scripts, cron jobs, and applications can continue to access the data.
grid export /containers/Big-State-U/Sarah-server /development/sources /home/Sarah/dev
Once a GFFS container is running that can “see” the directory to be exported, it is quite simple to share data. For example, Sarah could share out using the simple command

This exports from the machine “Sarah-server”, the directory tree rooted at “/development/sources”, and links it into the global namespace at the path “/home/Sarah/dev”. Once exported data is accessible (subject to access control) until the export is terminated. The net result is a user can decide to securely share out a particular directory structure with colleagues anywhere with a network connection and this collaborator can subsequently access it with no effort.
Storage Resources
Earlier we said that the new files and directories could be stored in the containers own databases and file system resources. What exactly does that mean and how can we use it?
Compute Resources

Compute Sharing. Compute resources such as clusters, parallel machines, and desktop compute resources can be shard in a similar manner. For example to create an OGSA-BES resource that proxies a PBS queue e.g.,
create-resource /containers/UVA/CS/camillus/ /home/grimshaw/testPBS-resource
creates new OGSA-BES resources on the host camillus in the computer science department at UVA. It places a link to the new resource (i.e., how it can be accessed in the future) in the GFFS directory /home/grimshaw. The new resource can be passed a configuration file that tells it information it needs to know about the local queuing system, in this case that it is a PBS queue, where shared scratch space is to be found, and so forth. Access control is now at the user’s discretion.