<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>

<!-- Copyright (C) The IETF Trust (2014) -->
<!-- Copyright (C) The Internet Society (2014) -->

<!-- XML source for the Requirement Wars internet draft document -->

<!-- To generate text with the xml2rfc tool tclsh8.3 xml2rfc.tcl 
     xml2rfc this_file.xml that_file.txt which puts the formatted 
     text into that_file.txt -->

<!-- processing instructions (for a complete list and description,
     see file http://xml.resource.org/authoring/README.html -->

<!-- try to enforce the ID-nits conventions and DTD validity -->

<?rfc strict="yes" ?>

<!-- items used when reviewing the document -->

<?rfc comments="yes" ?>  <!-- controls display of <cref> elements -->
<?rfc inline="yes" ?>    <!-- when no, put comments at end in comments section,
                                otherwise, put inline -->
<?rfc editing="no" ?>   <!-- when yes, insert editing marks -->

<!-- create table of contents (set it options).  
     Note the table of contents may be omitted
     for very short documents --> 

<?rfc toc="yes"?>
<?rfc tocompact="yes"?>
<?rfc tocdepth="3"?>

<!-- choose the options for the references. Some like
     symbolic tags in the references (and citations)
     and others prefer numbers. --> 

<?rfc symrefs="yes"?>
<?rfc sortrefs="yes" ?>

<!-- these two save paper: start new paragraphs from the same page etc. -->

<?rfc compact="yes" ?>
<?rfc subcompact="no" ?>

<!-- end of list of processing instructions -->

<rfc
    category="std"
    ipr="trust200902"
    docName="draft-ietf-nfsv4-scsi-layout-06.txt" >

<!-- Copyright (C) The IETF Trust (2014) -->
<!-- Copyright (C) The Internet Society (2014) -->

<front>
    <title abbrev="pNFS SCSI Layout">
      Parallel NFS (pNFS) SCSI Layout
    </title>

    <author fullname="Christoph Hellwig"
            initials="C."
            surname="Hellwig">
      <address>
        <email>hch@lst.de</email>
      </address>
    </author>

    <date year="2016" month="June" day="27"/>

    <area>Transport</area>
    <workgroup>NFSv4</workgroup>
    <keyword>NFSv4</keyword>

    <abstract>
      <t>
	The Parallel Network File System (pNFS) allows a separation between
	the metadata (onto a metadata server) and data (onto a storage device)
	for a file.  The SCSI Layout Type is defined in this document as an
	extension to pNFS to allow the use SCSI based block storage devices.
      </t>
    </abstract>
</front>
<!-- Copyright (C) The IETF Trust (2014) -->
<!-- Copyright (C) The Internet Society (2014) -->
<middle>
<!-- Copyright (C) The IETF Trust (2014) -->
<!-- Copyright (C) The Internet Society (2014) -->

<section anchor="sec:intro" title="Introduction">
  <t>
    <xref target="fig:pnfs_system" /> shows the overall
    architecture of a Parallel NFS (pNFS) system:
  </t>

  <figure anchor="fig:pnfs_system">
    <artwork>

    +-----------+
    |+-----------+                                 +-----------+
    ||+-----------+                                |           |
    |||           |       NFSv4.1 + pNFS           |           |
    +||  Clients  |&lt;------------------------------&gt;|   Server  |
     +|           |                                |           |
      +-----------+                                |           |
           |||                                     +-----------+
           |||                                           |
           |||                                           |
           ||| Storage        +-----------+              |
           ||| Protocol       |+-----------+             |
           ||+----------------||+-----------+  Control   |
           |+-----------------|||           |    Protocol|
           +------------------+||  Storage  |------------+
                               +|  Systems  |
                                +-----------+
    </artwork>
  </figure>

  <t>
    The overall approach is that pNFS-enhanced clients obtain
    sufficient information from the server to enable them to access
    the underlying storage (on the storage systems) directly.  See
    the Section 12 of <xref target="RFC5661" /> for more details.
    This document is concerned with access from pNFS clients to
    storage devices over block storage protocols based on
    the the SCSI Architecture Model (<xref target="SAM-4" />),
    e.g., Fibre Channel Protocol (FCP) for Fibre Channel, Internet SCSI
    (iSCSI) or Serial Attached SCSI (SAS). pNFS SCSI layout requires
    block based SCSI command sets, for example SCSI Block Commands
    (<xref target="SBC3" />).  While SCSI command set for non-block based
    access exist these are not supported by the SCSI layout type, and
    all future references to SCSI storage devices will imply a block
    based SCSI command set.
  </t>
  <t>
    The Server to Storage System protocol, called the "Control Protocol",
    is not of concern for interoperability, although it will typically be
    the same SCSI based storage protocol.
  </t>
  <t>
    This document is based on <xref target='RFC5663' /> and makes changes to
    the block layout type to provide a better pNFS layout protocol for
    SCSI based storage devices. Despite these changes,
    <xref target='RFC5663' /> remains the defining document for the existing
    block layout type. <xref target='RFC6688' /> is unnecessary in the context
    of the SCSI layout type because the new layout type provides mandatory
    disk access protection as part of the layout type definition.  In contrast
    to <xref target='RFC5663' />, this document uses SCSI protocol features
    to provide reliable fencing by using SCSI Persistent Reservations, and it
    can provide reliable and efficient device discovery
    by using SCSI device identifiers instead of having to rely on probing all
    devices potentially attached to a client for a signature.  This new layout
    type also optimizes the I/O path by reducing the size of the LAYOUTCOMMIT
    payload
  </t>

  <section anchor="ssc:intro:conv" title="Conventions Used in This Document">
    <t>
      The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
      document are to be interpreted as described in <xref target="RFC2119" />.
    </t>
  </section>

  <section anchor="ssc:intro:defs" title="General Definitions">
    <t>
      The following definitions are provided for the purpose of providing
      an appropriate context for the reader.
    </t>

    <t>
      <list style='hanging'>
        <t hangText="Byte">
          This document defines a byte as an octet, i.e., a datum exactly 8
          bits in length.
        </t>

        <t hangText="Client">
          The "client" is the entity that accesses the NFS server's
          resources.  The client may be an application that contains the
          logic to access the NFS server directly.  The client may also be
          the traditional operating system client that provides remote file
          system services for a set of applications.
        </t>

        <t hangText="Server">
          The "server" is the entity responsible for coordinating client
          access to a set of file systems and is identified by a server
          owner.
        </t>
      </list>
    </t>
  </section>

  <section anchor="ssc:intro:code" title="Code Components Licensing Notice">
    <t>
      The external data representation (XDR) description and scripts
      for extracting the XDR description are Code Components as
      described in Section 4 of <xref target="LEGAL">"Legal Provisions
      Relating to IETF Documents"</xref>.  These Code Components are
      licensed according to the terms of Section 4 of "Legal Provisions
      Relating to IETF Documents".
    </t>
  </section>

  <section anchor="ssc:intro:xdr" title="XDR Description">
    <t>
      This document contains the XDR <xref target='RFC4506' /> description
      of the NFSv4.1 SCSI layout protocol.  The XDR description is
      embedded in this document in a way that makes it simple for the
      reader to extract into a ready-to-compile form.  The reader can
      feed this document into the following shell script to produce
      the machine readable XDR description of the NFSv4.1 SCSI layout:
    </t>

    <figure>
      <artwork>
#!/bin/sh
grep '^ *///' $* | sed 's?^ */// ??' | sed 's?^ *///$??'
      </artwork>
    </figure>

    <t>
      That is, if the above script is stored in a file called "extract.sh", and
      this document is in a file called "spec.txt", then the reader can do:
    </t>

    <figure>
      <artwork>
sh extract.sh &lt; spec.txt &gt; scsi_prot.x
      </artwork>
    </figure>

    <t>
      The effect of the script is to remove leading white space from each
      line, plus a sentinel sequence of "///".
    </t>

    <t>
      The embedded XDR file header follows.
      Subsequent XDR descriptions, with the sentinel sequence are
      embedded throughout the document.
    </t>

    <t>
      Note that the XDR code contained in this document depends on
      types from the NFSv4.1 nfs4_prot.x file <xref target='RFC5662' />.
      This includes both nfs types that end with a 4, such as
      offset4, length4, etc., as well as more generic types such as
      uint32_t and uint64_t.
    </t>

    <figure>
      <artwork>
   /// /*
   ///  * This code was derived from RFCTBD10
   ///  * Please reproduce this note if possible.
   ///  */
   /// /*
   ///  * Copyright (c) 2010,2015 IETF Trust and the persons
   ///  * identified as the document authors.  All rights reserved.
   ///  *
   ///  * Redistribution and use in source and binary forms, with
   ///  * or without modification, are permitted provided that the
   ///  * following conditions are met:
   ///  *
   ///  * - Redistributions of source code must retain the above
   ///  *   copyright notice, this list of conditions and the
   ///  *   following disclaimer.
   ///  *
   ///  * - Redistributions in binary form must reproduce the above
   ///  *   copyright notice, this list of conditions and the
   ///  *   following disclaimer in the documentation and/or other
   ///  *   materials provided with the distribution.
   ///  *
   ///  * - Neither the name of Internet Society, IETF or IETF
   ///  *   Trust, nor the names of specific contributors, may be
   ///  *   used to endorse or promote products derived from this
   ///  *   software without specific prior written permission.
   ///  *
   ///  *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS
   ///  *   AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED
   ///  *   WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
   ///  *   IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
   ///  *   FOR A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO
   ///  *   EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
   ///  *   LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
   ///  *   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
   ///  *   NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
   ///  *   SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
   ///  *   INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
   ///  *   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
   ///  *   OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
   ///  *   IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
   ///  *   ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
   ///  */
   ///
   /// /*
   ///  *      nfs4_scsi_layout_prot.x
   ///  */
   ///
   /// %#include "nfsv41.h"
   ///
      </artwork>
    </figure>
  </section>
</section>
<!-- Copyright (C) The IETF Trust (2014) -->
<!-- Copyright (C) The Internet Society (2014) -->

<section anchor='sec:sld' title='SCSI Layout Description'>
  <section anchor='ssc:back' title='Background and Architecture'>
    <t>
      The fundamental storage model supported by SCSI storage devices
      is a Logical Unit (LU) consisting of a sequential series of fixed-size
      blocks. Logical units used as devices for NFS SCSI layouts,
      and the SCSI initiators used for the pNFS Metadata Server and clients
      MUST support SCSI persistent reservations.
    </t>

    <t>
      A pNFS layout for this SCSI class of storage is responsible
      for mapping from an NFS file (or portion of a file) to the blocks of
      storage volumes that contain the file.  The blocks are expressed as
      extents with 64-bit offsets and lengths using the existing NFSv4
      offset4 and length4 types.  Clients MUST be able to perform I/O to
      the block extents without affecting additional areas of storage
      (especially important for writes); therefore, extents MUST be aligned
      to 512-byte boundaries.
    </t>

    <t>
      The pNFS operation for requesting a layout (LAYOUTGET) includes the
      "layoutiomode4 loga_iomode" argument, which indicates whether the
      requested layout is for read-only use or read-write use.  A read-only
      layout may contain holes that are read as zero, whereas a read-write
      layout will contain allocated, but un-initialized storage in those
      holes (read as zero, can be written by client).  This document also
      supports client participation in copy-on-write (e.g., for file
      systems with snapshots) by providing both read-only and un-
      initialized storage for the same range in a layout.  Reads are
      initially performed on the read-only storage, with writes going to
      the un-initialized storage.  After the first write that initializes
      the un-initialized storage, all reads are performed to that now-
      initialized writable storage, and the corresponding read-only storage
      is no longer used.
    </t>

    <t>
      The SCSI layout solution expands the security responsibilities of the
      pNFS clients, and there are a number of environments where the mandatory
      to implement security properties for NFS cannot be satisfied.  The
      additional security responsibilities of the client follow, and a full
      discussion is present in <xref target='sec:security' />,
      "Security Considerations".
    </t>

    <t>
      <list style='symbols'>
        <t>
          Typically, SCSI storage devices provide access control mechanisms
	  (e.g., Logical Unit Number (LUN) mapping and/or masking), which
	  operate at the granularity of individual hosts, not individual
	  blocks.  For this reason, block-based protection must be provided
	  by the client software.
        </t>

        <t>
          Similarly, SCSI storage devices typically are not able to validate
	  NFS locks that apply to file regions.  For instance, if a file is
	  covered by a mandatory read-only lock, the server can ensure that
	  only readable layouts for the file are granted to pNFS clients.
	  However, it is up to each pNFS client to ensure that the readable
	  layout is used only to service read requests, and not to allow
	  writes to the existing parts of the file.
        </t>
      </list>
    </t>

    <t>
      Since SCSI storage devices are generally not capable of
      enforcing such file-based security, in environments where pNFS
      clients cannot be trusted to enforce such policies, pNFS SCSI
      layouts SHOULD NOT be used.
    </t>
  </section>

  <section anchor='ssc:xdr' title='layouttype4'>
    <t>
      The layout4 type defined in <xref target="RFC5662" />
      is extended with a new value as follows:
    </t>

    <figure>
      <artwork>
    enum layouttype4 {
        LAYOUT4_NFSV4_1_FILES   = 1,
        LAYOUT4_OSD2_OBJECTS    = 2,
        LAYOUT4_BLOCK_VOLUME    = 3,
        LAYOUT4_SCSI            = 0x80000005
[[RFC Editor: please modify the LAYOUT4_SCSI
  to be the layouttype assigned by IANA]]
    };
      </artwork>
    </figure>

    <t>
      This document defines structure associated with the layouttype4
      value LAYOUT4_SCSI.  <xref target="RFC5661" /> specifies
      the loc_body structure as an XDR type "opaque".  The opaque
      layout is uninterpreted by the generic pNFS client layers, but
      obviously must be interpreted by the Layout Type implementation.
    </t>
  </section>

  <section anchor='ssc:gets' title='GETDEVICEINFO'>
    <section anchor='ssc:volident' title='Volume Identification'>
      <t>
        SCSI targets implementing <xref target="SPC4" /> export unique LU
	names for each LU through the Device Identification VPD page (page code
	0x83), which can be obtained using the INQUIRY command with the EVPD
	bit set to one. This document uses a subset of this information to
	identify LUs backing pNFS SCSI layouts.  It is similar to the
	"Identification Descriptor Target Descriptor" specified in
	<xref target="SPC4" />, but limits the allowed values to those that
	uniquely identify a LU.  Device Identification VPD page descriptors
	used to identify LUs for use with pNFS SCSI layouts must adhere to
	the following restrictions:
	<list style='numbers'>
	  <t>The "ASSOCIATION" MUST be set to 0 (The DESIGNATOR field is
	     associated with the addressed logical unit).</t>
	  <t>The "DESIGNATOR TYPE" MUST be set to one of four values
	     that are required for the mandatory logical unit name in
	     section 7.7.3 of <xref target="SPC4" />, as explicitly listed
	     in the "pnfs_scsi_designator_type" enumeration:
	     <list style='hanging'>
		<t hangText='PS_DESIGNATOR_T10'>
		   T10 vendor ID based</t>
		<t hangText='PS_DESIGNATOR_EUI64'>
		   EUI-64-based</t>
		<t hangText='PS_DESIGNATOR_NAA'>
		   NAA</t>
		<t hangText='PS_DESIGNATOR_NAME'>
		   SCSI name string</t>
	     </list>
	     Any other association or designator type MUST NOT be used.
	     Use of T10 vendor IDs is discouraged when one of the other types
	     can be used.
	     </t>
	</list>
        
	The "CODE SET" VPD page field is stored in the "sbv_code_set" field of
	the "pnfs_scsi_base_volume_info4" structure, the "DESIGNATOR TYPE" is
	stored in "sbv_designator_type", and the DESIGNATOR is stored in
	"sbv_designator".  Due to the use of a XDR array the "DESIGNATOR LENGTH"
	field does not need to be set separately.  Only certain combinations
	of "sbv_code_set" and "sbv_designator_type" are valid, please refer to
	<xref target="SPC4" /> for details, and note that ASCII may be used
	as the code set for UTF-8 text that contains only printable
	ASCII characters.

	Note that a Device Identification VPD page MAY contain multiple
	descriptors with the same association, code set and designator type.
	NFS clients thus MUST check all the descriptors for a possible match
	to "sbv_code_set", "sbv_designator_type" and "sbv_designator".
      </t>

      <t>
        Storage devices such as storage arrays can have multiple physical
	network ports that need not be connected to a common network,
	resulting in a pNFS client having simultaneous multipath access to
	the same storage volumes via different ports on different networks.
	Selection of one or multiple ports to access the storage device
	is left up to the client.
      </t>

      <t>
	Additionally the server returns a Persistent Reservation key in
	the "sbv_pr_key" field.  See <xref target="ssc:fencing" /> for more
	details on the use of Persistent Reservations.
      </t>
    </section>

    <section anchor='ssc:voltopo' title='Volume Topology'>
      <t>
        The pNFS SCSI layout volume topology is expressed in terms of the
	volume types described below.  The individual components of the
	topology are contained in an array and components may refer to
	other components by using array indices.
      </t>

      <figure>
        <artwork>
 /// enum pnfs_scsi_volume_type4 {
 ///     PNFS_SCSI_VOLUME_SLICE  = 1,  /* volume is a slice of
 ///                                      another volume */
 ///     PNFS_SCSI_VOLUME_CONCAT = 2,  /* volume is a
 ///                                      concatenation of
 ///                                      multiple volumes */
 ///     PNFS_SCSI_VOLUME_STRIPE = 3   /* volume is striped across
 ///                                      multiple volumes */
 ///     PNFS_SCSI_VOLUME_BASE   = 4,  /* volume maps to a single
 ///                                      LU */
 /// };
 ///
        </artwork>
      </figure>

    <figure>
      <artwork>
 /// /*
 ///  * Code sets from SPC-4.
 ///  */
 /// enum pnfs_scsi_code_set {
 ///     PS_CODE_SET_BINARY     = 1,
 ///     PS_CODE_SET_ASCII      = 2,
 ///     PS_CODE_SET_UTF8       = 3
 /// };
 ///
 /// /*
 ///  * Designator types from taken from SPC-4.
 ///  *
 ///  * Other values are allocated in SPC-4, but not mandatory to
 ///  * implement or aren't Logical Unit names.
 ///  */
 /// enum pnfs_scsi_designator_type {
 ///     PS_DESIGNATOR_T10      = 1,
 ///     PS_DESIGNATOR_EUI64    = 2,
 ///     PS_DESIGNATOR_NAA      = 3,
 ///     PS_DESIGNATOR_NAME     = 8
 /// };
 ///
 /// /*
 ///  * Logical Unit name + reservation key.
 ///  */
 /// struct pnfs_scsi_base_volume_info4 {
 ///     pnfs_scsi_code_set             sbv_code_set;
 ///     pnfs_scsi_designator_type      sbv_designator_type;
 ///     opaque                         sbv_designator&lt;&gt;;
 ///     uint64_t                       sbv_pr_key;
 /// };
 ///
      </artwork>
    </figure>

      <figure>
        <artwork>
 /// struct pnfs_scsi_slice_volume_info4 {
 ///     offset4  ssv_start;            /* offset of the start of
 ///                                       the slice in bytes */
 ///     length4  ssv_length;           /* length of slice in
 ///                                       bytes */
 ///     uint32_t ssv_volume;           /* array index of sliced
 ///                                       volume */
 /// };
 ///
        </artwork>
      </figure>

      <figure>
        <artwork>
 ///
 /// struct pnfs_scsi_concat_volume_info4 {
 ///     uint32_t  scv_volumes&lt;&gt;;       /* array indices of volumes
 ///                                       which are concatenated */
 /// };
        </artwork>
      </figure>

      <figure>
        <artwork>
 ///
 /// struct pnfs_scsi_stripe_volume_info4 {
 ///     length4  ssv_stripe_unit;      /* size of stripe in bytes */
 ///     uint32_t ssv_volumes&lt;&gt;;        /* array indices of
 ///                                       volumes which are striped
 ///                                       across -- MUST be same
 ///                                       size */
 /// };
        </artwork>
      </figure>

      <figure>
        <artwork>
 ///
 /// union pnfs_scsi_volume4 switch (pnfs_scsi_volume_type4 type) {
 ///     case PNFS_SCSI_VOLUME_BASE:
 ///         pnfs_scsi_base_volume_info4 sv_simple_info;
 ///     case PNFS_SCSI_VOLUME_SLICE:
 ///         pnfs_scsi_slice_volume_info4 sv_slice_info;
 ///     case PNFS_SCSI_VOLUME_CONCAT:
 ///         pnfs_scsi_concat_volume_info4 sv_concat_info;
 ///     case PNFS_SCSI_VOLUME_STRIPE:
 ///         pnfs_scsi_stripe_volume_info4 sv_stripe_info;
 /// };
 ///
        </artwork>
      </figure>

      <figure>
        <artwork>
 /// /* SCSI layout-specific type for da_addr_body */
 /// struct pnfs_scsi_deviceaddr4 {
 ///     pnfs_scsi_volume4 sda_volumes&lt;&gt;; /* array of volumes */
 /// };
 ///
        </artwork>
      </figure>

      <t>
        The "pnfs_scsi_deviceaddr4" data structure is a structure that
        allows arbitrarily complex nested volume structures to be encoded.
        The types of aggregations that are allowed are stripes,
        concatenations, and slices.  Note that the volume topology expressed
        in the pnfs_scsi_deviceaddr4 data structure will always resolve to a
        set of pnfs_scsi_volume_type4 PNFS_SCSI_VOLUME_BASE.  The array
        of volumes is ordered such that the root of the volume hierarchy is
        the last element of the array.  Concat, slice, and stripe volumes
        MUST refer to volumes defined by lower indexed elements of the array.
      </t>

      <t>
        The "pnfs_scsi_device_addr4" data structure is returned by the
        server as the storage-protocol-specific opaque field da_addr_body in
        the "device_addr4" structure by a successful GETDEVICEINFO operation
        <xref target='RFC5661' />.
      </t>

      <t>
        As noted above, all device_addr4 structures eventually resolve to a
        set of volumes of type PNFS_SCSI_VOLUME_BASE.
        Complicated volume hierarchies may be composed of dozens of volumes
        each with several signature components; thus, the device address may
        require several kilobytes.  The client SHOULD be prepared to allocate
        a large buffer to contain the result.  In the case of the server
        returning NFS4ERR_TOOSMALL, the client SHOULD allocate a buffer of at
        least gdir_mincount_bytes to contain the expected result and retry
        the GETDEVICEINFO request.
      </t>
    </section>
  </section>

  <section anchor='ssc:extents' title='Data Structures: Extents and Extent Lists'>
    <t>
      A pNFS SCSI layout is a list of extents within a flat array of data
      blocks in a volume.  The details of the volume topology can
      be determined by using the GETDEVICEINFO operation.  The SCSI layout
      describes the individual block extents on the volume that make up the
      file.  The offsets and length contained in an extent are specified in
      units of bytes.
    </t>

    <figure>
      <artwork>
 /// enum pnfs_scsi_extent_state4 {
 ///     PNFS_SCSI_READ_WRITE_DATA = 0, /* the data located by
 ///                                       this extent is valid
 ///                                       for reading and
 ///                                       writing. */
 ///     PNFS_SCSI_READ_DATA      = 1,  /* the data located by this
 ///                                       extent is valid for
 ///                                       reading only; it may not
 ///                                       be written. */
 ///     PNFS_SCSI_INVALID_DATA   = 2,  /* the location is valid; the
 ///                                       data is invalid.  It is a
 ///                                       newly (pre-) allocated
 ///                                       extent.  The client MUST
 ///                                       not read from this
 ///                                       space */
 ///     PNFS_SCSI_NONE_DATA      = 3   /* the location is invalid.
 ///                                       It is a hole in the file.
 ///                                       The client MUST NOT read
 ///                                       from or write to this
 ///                                       space */
 /// };
      </artwork>
    </figure>

    <figure>
      <artwork>
 ///
 /// struct pnfs_scsi_extent4 {
 ///     deviceid4    se_vol_id;         /* id of the volume on
 ///                                        which extent of file is
 ///                                        stored. */
 ///     offset4      se_file_offset;    /* starting byte offset
 ///                                        in the file */
 ///     length4      se_length;         /* size in bytes of the
 ///                                        extent */
 ///     offset4      se_storage_offset; /* starting byte offset
 ///                                        in the volume */
 ///     pnfs_scsi_extent_state4 se_state;
 ///                                     /* state of this extent */
 /// };
 ///
      </artwork>
    </figure>

    <figure>
      <artwork>
 /// /* SCSI layout-specific type for loc_body */
 /// struct pnfs_scsi_layout4 {
 ///     pnfs_scsi_extent4 sl_extents&lt;&gt;;
 ///                                    /* extents which make up this
 ///                                       layout. */
 /// };
 ///
      </artwork>
    </figure>

    <t>
      The SCSI layout consists of a list of extents that map the regions
      of the file to locations on a volume.  The "se_storage_offset" field
      within each extent identifies a location on the volume specified by
      the "se_vol_id" field in the extent.
      The se_vol_id itself is shorthand for the whole topology of the
      volume on which the file is stored.  The client is responsible for
      translating this volume-relative offset into an offset on the
      appropriate underlying SCSI LU.
    </t>

    <t>
      Each extent maps a region of the file onto a portion of the
      specified LU.  The se_file_offset, se_length, and se_state fields for
      an extent returned from the server are valid for all extents.  In
      contrast, the interpretation of the se_storage_offset field depends on
      the value of se_state as follows (in increasing order):
    </t>

    <t>
      <list style='hanging'>
        <t hangText='PNFS_SCSI_READ_WRITE_DATA'>
          means that se_storage_offset is valid, and points to
	  valid/initialized data that can be read and written.
        </t>

        <t hangText='PNFS_SCSI_READ_DATA'>
          means that se_storage_offset is valid and points to valid/initialized
	  data that can only be read.  Write operations are prohibited; the
	  client may need to request a read-write layout.
        </t>

        <t hangText='PNFS_SCSI_INVALID_DATA'>
          means that se_storage_offset is valid, but points to invalid
	  un-initialized data.  This data must not be read from the
	  disk until it has been initialized.  A read request for a
	  PNFS_SCSI_INVALID_DATA extent must fill the user buffer with zeros,
	  unless the extent is covered by a PNFS_SCSI_READ_DATA extent of a
	  copy-on-write file system.  Write requests must write whole
	  server-sized blocks to the disk; bytes not initialized by the user
	  must be set to zero.  Any write to storage in a
	  PNFS_SCSI_INVALID_DATA extent changes the written portion of the
	  extent to PNFS_SCSI_READ_WRITE_DATA; the pNFS client is responsible
	  for reporting this change via LAYOUTCOMMIT.
        </t>

        <t hangText='PNFS_SCSI_NONE_DATA'>
          means that se_storage_offset is not valid, and this extent may not
	  be used to satisfy write requests.  Read requests may be satisfied
	  by zero-filling as for PNFS_SCSI_INVALID_DATA.  PNFS_SCSI_NONE_DATA
	  extents may be returned by requests for readable extents; they are
	  never returned if the request was for a writable extent.
        </t>
      </list>
    </t>

    <t>
      An extent list contains all relevant extents in increasing order of
      the se_file_offset of each extent; any ties are broken by increasing
      order of the extent state (se_state).
    </t>

    <section anchor='ssc:layouts' title='Layout Requests and Extent Lists'>
      <t>
        Each request for a layout specifies at least three parameters: file
        offset, desired size, and minimum size.  If the status of a request
        indicates success, the extent list returned must meet the following
        criteria:
      </t>

      <t>
        <list style='symbols'>
          <t>
            A request for a readable (but not writable) layout returns only
            PNFS_SCSI_READ_DATA or PNFS_SCSI_NONE_DATA extents (but not
            PNFS_SCSI_INVALID_DATA or PNFS_SCSI_READ_WRITE_DATA extents).
          </t>

          <t>
            A request for a writable layout returns PNFS_SCSI_READ_WRITE_DATA
            or PNFS_SCSI_INVALID_DATA extents (but not PNFS_SCSI_NONE_DATA
            extents).  It may also return PNFS_SCSI_READ_DATA extents only
            when the offset ranges in those extents are also covered by
            PNFS_SCSI_INVALID_DATA extents to permit writes.
          </t>

          <t>
            The first extent in the list MUST contain the requested starting
            offset.
          </t>

          <t>
            The total size of extents within the requested range MUST cover at
            least the minimum size.  One exception is allowed: the total size
            MAY be smaller if only readable extents were requested and EOF is
            encountered.
          </t>

          <t>
            Extents in the extent list MUST be logically contiguous for a
            read-only layout.  For a read-write layout, the set of writable
            extents (i.e., excluding PNFS_SCSI_READ_DATA extents) MUST be
            logically contiguous.  Every PNFS_SCSI_READ_DATA extent in a
            read-write layout MUST be covered by one or more
            PNFS_SCSI_INVALID_DATA extents.  This overlap of
            PNFS_SCSI_READ_DATA and PNFS_SCSI_INVALID_DATA extents is the
            only permitted extent overlap.
          </t>

          <t>
            Extents MUST be ordered in the list by starting offset, with
            PNFS_SCSI_READ_DATA extents preceding PNFS_SCSI_INVALID_DATA
            extents in the case of equal se_file_offsets.
          </t>
        </list>
      </t>

      <t>
        According to <xref target='RFC5661' />,  if the minimum requested
	size, loga_minlength, is zero, this is an indication to the
	metadata server that the client desires any layout at offset
	loga_offset or less that the metadata server has "readily
	available".  Given the lack of a clear definition of this phrase,
	in the context of the SCSI layout type, when loga_minlength is
	zero, the metadata server SHOULD:

        <list style='symbols'>
          <t>
	    when processing requests for readable layouts, return all
	    such, even if some extents are in the PNFS_SCSI_NONE_DATA
	    state.
	  </t>
	  <t>
	     when processing requests for writable layouts, return
	     extents which can be returned in the PNFS_SCSI_READ_WRITE_DATA
	     state.
	  </t>
        </list>
      </t>
    </section>

    <section anchor='ssc:commits' title='Layout Commits'>
      <figure>
        <artwork>
 ///
 /// /* SCSI layout-specific type for lou_body */
 ///
 /// struct pnfs_scsi_range4 {
 ///     offset4      sr_file_offset;   /* starting byte offset
 ///                                       in the file */
 ///     length4      sr_length;        /* size in bytes */
 /// };
 ///
 /// struct pnfs_scsi_layoutupdate4 {
 ///     pnfs_scsi_range4 slu_commit_list&lt;&gt;;
 ///                                    /* list of extents which
 ///                                     * now contain valid data.
 ///                                     */
 /// };
        </artwork>
      </figure>

      <t>
        The "pnfs_scsi_layoutupdate4" structure is used by the client as the
        SCSI layout-specific argument in a LAYOUTCOMMIT operation.  The
        "slu_commit_list" field is a list covering regions of the file layout
	that were previously in the PNFS_SCSI_INVALID_DATA state, but have
	been written by the client and should now be considered in the
	PNFS_SCSI_READ_WRITE_DATA state. The extents in the commit list MUST
	be disjoint and MUST be sorted by sr_file_offset.  Implementors should
	be aware that a server may be unable to commit regions at a granularity
        smaller than a file-system block (typically 4 KB or 8 KB).  As noted
        above, the block-size that the server uses is available as an NFSv4
        attribute, and any extents included in the "slu_commit_list" MUST be
        aligned to this granularity and have a size that is a multiple of
        this granularity.
	Since the block in question is in state PNFS_SCSI_INVALID_DATA,
	byte ranges not written should be filled with zeros.  This applies
	even if it appears that the area being written is beyond what the
	client believes to be the end of file.
      </t>
    </section>

    <section anchor='ssc:returns' title='Layout Returns'>
      <t>
        A LAYOUTRETURN operation represents an explicit release of
	resources by the client.  This may be done in response to a
	CB_LAYOUTRECALL or before any recall, in order to avoid a future
	CB_LAYOUTRECALL. When the LAYOUTRETURN operation specifies a
	LAYOUTRETURN4_FILE return type, then the layoutreturn_file4 data
	structure specifies the region of the file layout that is no
	longer needed by the client.
      </t>
      <t>
        The LAYOUTRETURN operation is done without any SCSI layout
	specific data.  The opaque "lrf_body" field of the
	"layoutreturn_file4" data structure MUST have length zero.
      </t>
    </section>

    <section anchor='ssc:revoke' title='Layout Revocation'>
      <t>
        Layouts may be unilaterally revoked by the server, due to the
	client's lease time expiring, or the client failing to return a
	layout which has been recalled in a timely manner.  For the SCSI
	layout type this is accomplished by fencing off the client from
	access to storage as described in <xref target="ssc:fencing" />.
	When this is done, it is necessary that all I/Os issued by the
	fenced-off client be rejected by the storage This includes any
	in-flight I/Os that the client issued before the layout was
	revoked.
      </t>

      <t>
        Note, that the granularity of this operation can only be at the
	host/LU level.  Thus, if one of a client's layouts is
	unilaterally revoked by the server, it will effectively render
	useless *all* of the client's layouts for files located on the
	storage units comprising the volume.  This may render
	useless the client's layouts for files in other file systems.
	See <xref target="ssc:fencing:recovery" /> for a discussion of
	recovery from from fencing.
      </t>
    </section>

    <section anchor='ssc:copywrite' title='Client Copy-on-Write Processing'>
      <t>
        Copy-on-write is a mechanism used to support file and/or file system
        snapshots.  When writing to unaligned regions, or to regions smaller
        than a file system block, the writer must copy the portions of the
        original file data to a new location on disk.  This behavior can
        either be implemented on the client or the server.  The paragraphs
        below describe how a pNFS SCSI layout client implements access to a
        file that requires copy-on-write semantics.
      </t>

      <t>
        Distinguishing the PNFS_SCSI_READ_WRITE_DATA and
        PNFS_SCSI_READ_DATA extent types in combination with the allowed
        overlap of PNFS_SCSI_READ_DATA extents with PNFS_SCSI_INVALID_DATA
        extents allows copy-on-write processing to be done by pNFS clients.
        In classic NFS, this operation would be done by the server.  Since
        pNFS enables clients to do direct block access, it is useful for
        clients to participate in copy-on-write operations.  All SCSI
        pNFS clients MUST support this copy-on-write processing.
      </t>

      <t>
        When a client wishes to write data covered by a PNFS_SCSI_READ_DATA
        extent, it MUST have requested a writable layout from the server;
        that layout will contain PNFS_SCSI_INVALID_DATA extents to cover all
        the data ranges of that layout's PNFS_SCSI_READ_DATA extents.  More
        precisely, for any se_file_offset range covered by one or more
        PNFS_SCSI_READ_DATA extents in a writable layout, the server MUST
        include one or more PNFS_SCSI_INVALID_DATA extents in the layout
        that cover the same se_file_offset range.  When performing a write
        to such an area of a layout, the client MUST effectively copy the
        data from the PNFS_SCSI_READ_DATA extent for any partial blocks of
        se_file_offset and range, merge in the changes to be written, and
        write the result to the PNFS_SCSI_INVALID_DATA extent for the blocks
        for that se_file_offset and range.  That is, if entire blocks of
        data are to be overwritten by an operation, the corresponding
        PNFS_SCSI_READ_DATA blocks need not be fetched, but any partial-
        block writes must be merged with data fetched via
        PNFS_SCSI_READ_DATA extents before storing the result via
        PNFS_SCSI_INVALID_DATA extents.  For the purposes of this
        discussion, "entire blocks" and "partial blocks" refer to the
        server's file-system block size.  Storing of data in a
        PNFS_SCSI_INVALID_DATA extent converts the written portion of the
        PNFS_SCSI_INVALID_DATA extent to a PNFS_SCSI_READ_WRITE_DATA
        extent; all subsequent reads MUST be performed from this extent; the
        corresponding portion of the PNFS_SCSI_READ_DATA extent MUST NOT be
        used after storing data in a PNFS_SCSI_INVALID_DATA extent.  If a
        client writes only a portion of an extent, the extent may be split at
        block aligned boundaries.
      </t>

      <t>
        When a client wishes to write data to a PNFS_SCSI_INVALID_DATA
        extent that is not covered by a PNFS_SCSI_READ_DATA extent, it MUST
        treat this write identically to a write to a file not involved with
        copy-on-write semantics.  Thus, data must be written in at least
        block-sized increments, aligned to multiples of block-sized offsets,
        and unwritten portions of blocks must be zero filled.
      </t>
    </section>

    <section anchor='ssc:extperms' title='Extents are Permissions'>
      <t>
        Layout extents returned to pNFS clients grant permission to read or
        write; PNFS_SCSI_READ_DATA and PNFS_SCSI_NONE_DATA are read-only
        (PNFS_SCSI_NONE_DATA reads as zeroes), PNFS_SCSI_READ_WRITE_DATA
        and PNFS_SCSI_INVALID_DATA are read/write, (PNFS_SCSI_INVALID_DATA
        reads as zeros, any write converts it to PNFS_SCSI_READ_WRITE_DATA).
        This is the only means a client has of obtaining permission to
        perform direct I/O to storage devices; a pNFS client MUST NOT perform
        direct I/O operations that are not permitted by an extent held by the
        client.  Client adherence to this rule places the pNFS server in
        control of potentially conflicting storage device operations,
        enabling the server to determine what does conflict and how to avoid
        conflicts by granting and recalling extents to/from clients.
      </t>

      <t>
        If a client makes a layout request that conflicts with an existing
        layout delegation, the request will be rejected with the error
        NFS4ERR_LAYOUTTRYLATER.  This client is then expected to retry the
        request after a short interval.  During this interval, the server
        SHOULD recall the conflicting portion of the layout delegation from
        the client that currently holds it.  This reject-and-retry approach
        does not prevent client starvation when there is contention for the
        layout of a particular file.  For this reason, a pNFS server SHOULD
        implement a mechanism to prevent starvation.  One possibility is that
        the server can maintain a queue of rejected layout requests.  Each
        new layout request can be checked to see if it conflicts with a
        previous rejected request, and if so, the newer request can be
        rejected.  Once the original requesting client retries its request,
        its entry in the rejected request queue can be cleared, or the entry
        in the rejected request queue can be removed when it reaches a
        certain age.
      </t>

      <t>
        NFSv4 supports mandatory locks and share reservations.  These are
        mechanisms that clients can use to restrict the set of I/O operations
        that are permissible to other clients.  Since all I/O operations
        ultimately arrive at the NFSv4 server for processing, the server is
        in a position to enforce these restrictions.  However, with pNFS
        layouts, I/Os will be issued from the clients that hold the layouts
        directly to the storage devices that host the data.  These devices
        have no knowledge of files, mandatory locks, or share reservations,
        and are not in a position to enforce such restrictions.  For this
        reason the NFSv4 server MUST NOT grant layouts that conflict with
        mandatory locks or share reservations.  Further, if a conflicting
        mandatory lock request or a conflicting open request arrives at the
        server, the server MUST recall the part of the layout in conflict
        with the request before granting the request.
      </t>
    </section>

    <section anchor='ssc:partial' title='Partial-Block Updates'>
      <t>
        SCSI storage devices do not provide byte granularity access and can
	only perform read and write operations atomically on a block
	granularity. WRITES to SCSI storage devices thus require
	read-modify-write cycles to write data smaller than the block size
	or which is otherwise not block-aligned.

        Write operations from multiple clients to the same block can thus
	lead to data corruption even if the byte range written by the
	applications does not overlap.

	When there are multiple clients who wish to access the same
	block, a pNFS server MUST avoid these conflicts by implementing a
	concurrency control policy of single writer XOR multiple readers for
	a given data block.
      </t>
    </section>

    <section anchor='ssc:eof' title='End-of-file Processing'>
      <t>
        The end-of-file location can be changed in two ways: implicitly as
        the result of a WRITE or LAYOUTCOMMIT beyond the current end-of-file,
        or explicitly as the result of a SETATTR request.  Typically,
        when a file is truncated by an NFSv4 client via the SETATTR call,
        the server frees any disk blocks belonging to the file that are
        beyond the new end-of-file byte, and MUST write zeros to the
        portion of the new end-of-file block beyond the new end-of-file
        byte.  These actions render any pNFS layouts that refer to the
        blocks that are freed or written semantically invalid.  Therefore,
        the server MUST recall from clients the portions of any pNFS
        layouts that refer to blocks that will be freed or written by
        the server before effecting the file truncation.  These recalls
        may take time to complete; as explained in <xref target='RFC5661' />,
        if the server cannot respond to the client SETATTR request
        in a reasonable amount of time, it SHOULD reply to the client
        with the error NFS4ERR_DELAY.
      </t>

      <t>
        Blocks in the PNFS_SCSI_INVALID_DATA state that lie beyond the new
        end-of-file block present a special case.  The server has reserved
        these blocks for use by a pNFS client with a writable layout for the
        file, but the client has yet to commit the blocks, and they are not
        yet a part of the file mapping on disk.  The server MAY free these
        blocks while processing the SETATTR request.  If so, the server MUST
        recall any layouts from pNFS clients that refer to the blocks before
        processing the truncate.  If the server does not free the
        PNFS_SCSI_INVALID_DATA blocks while processing the SETATTR request,
        it need not recall layouts that refer only to the
	PNFS_SCSI_INVALID_DATA blocks.
      </t>

      <t>
        When a file is extended implicitly by a WRITE or LAYOUTCOMMIT beyond
        the current end-of-file, or extended explicitly by a SETATTR request,
        the server need not recall any portions of any pNFS layouts.
      </t>
    </section>

    <section anchor='ssc:hints' title='Layout Hints'>
      <t>
        The layout hint attribute specified in  <xref target='RFC5661' />
	is not supported by the SCSI layout, and the pNFS server MUST
	reject setting a layout hint attribute with a loh_type value
	of LAYOUT4_SCSI_VOLUME during OPEN or SETATTR operations. On a
	file system only supporting the SCSI layout a server MUST NOT 
	report the layout_hint attribute in the supported_attrs attribute.
      </t>
    </section>

    <section anchor='ssc:fencing' title='Client Fencing'>
      <t>
        The pNFS SCSI protocol must handle situations in which a system
        failure, typically a network connectivity issue, requires the server
        to unilaterally revoke extents from a client after the client fails
	to respond to a CB_LAYOUTRECALL request.  This is implemented by
	fencing off a non-responding client from access to the storage
	device.
      </t>

      <t>
        The pNFS SCSI protocol implements fencing using Persistent
	Reservations (PRs), similar to the fencing method used by existing
	shared disk file systems.  By placing a PR of type
	"Exclusive Access – Registrants Only" on each SCSI LU exported to
	pNFS clients the MDS prevents access from any client that
	does not have an outstanding device device ID that gives the client
	a reservation key to access the LU, and allows the MDS to
	revoke access to the logic unit at any time.
      </t>

      <section anchor='ssc:fencing:keys'
		title='PRs - Key Generation'>
      <t>
	To allow fencing individual systems, each system must use a unique
	Persistent Reservation key.  <xref target="SPC4" /> does not specify
	a way to generate keys.  This document assigns the burden to generate
	unique keys to the MDS, which must generate a key for itself before
	exporting a volume, and a key for each client that accesses
	SCSI layout volumes. Individuals keys for each volume that a client
	can access are permitted but not required.
      </t>
      </section>

      <section anchor='ssc:fencing:mds'
		title='PRs - MDS Registration and Reservation'>
      <t>
	Before returning a PNFS_SCSI_VOLUME_BASE volume to the client, the
	MDS needs to prepare the volume for fencing using PRs.
	This is done by registering the reservation generated for the MDS with
	the device using the "PERSISTENT RESERVE OUT" command with a service
	action of "REGISTER", followed by a "PERSISTENT RESERVE OUT" command,
	with a service action of "RESERVE" and the type field set to 8h
	(Exclusive Access – Registrants Only).
	To make sure all I_T nexuses (see section 3.1.45 of <xref target="SAM-4" />)
	are registered, the MDS SHOULD set the
	"All Target Ports" (ALL_TG_PT) bit when registering the key, or
	otherwise ensure the registration is performed for each initiator port.
      </t>
      </section>

      <section anchor='ssc:fencing:client'
		title='PRs - Client Registration'>
        <t>
	  Before performing the first I/O to a device returned from a GETDEVICEINFO
	  operation the client will register the registration key
	  returned in sbv_pr_key with the storage device
	  by issuing a "PERSISTENT RESERVE OUT" command with a service action
	  of REGISTER with the "SERVICE ACTION RESERVATION KEY" set to the
	  reservation key returned in sbv_pr_key.
	  To make sure all I_T nexuses are registered, the client SHOULD set the
	  "All Target Ports" (ALL_TG_PT) bit when registering the key, or
	  otherwise ensure the registration is performed for each initiator port.
        </t>
        <t>
	  When a client stops using a device earlier returned by
	  GETDEVICEINFO it MUST unregister the earlier registered key by
	  issuing a "PERSISTENT RESERVE OUT" command with a service action of
	  "REGISTER" with the "RESERVATION KEY" set to the earlier registered
	  reservation key.
        </t>
      </section>

      <section anchor='ssc:fencing:fence'
		title='PRs - Fencing Action'>
        <t>
	  In case of a non-responding client the MDS fences the client
	  by issuing a "PERSISTENT RESERVE OUT" command with the service
	  action set to "PREEMPT" or "PREEMPT AND ABORT", the reservation key
	  field set to the server's reservation key, the service action
	  reservation key field set to the reservation key associated with
	  the non-responding client, and the type field set to 8h (Exclusive
	  Access – Registrants Only).
        </t>

	<t>
	  After the MDS preempts a client, all client I/O to the LU
	  fails.  The client should at this point return any layout that
	  refers to the device ID that points to the LU.  Note that
	  the client can distinguish I/O errors due to fencing from other
	  errors based on the "RESERVATION CONFLICT" SCSI status.  Refer to
	  <xref target="SPC4" /> for details.
        </t>
      </section>

      <section anchor='ssc:fencing:recovery'
		title='Client Recovery After a Fence Action'>
        <t>
	  A client that detects a "RESERVATION CONFLICT" SCSI status
	  (I/O error) on the storage devices MUST commit  all layouts that
	  use the storage device through the MDS, return all outstanding
	  layouts for the device, forget the device ID and unregister the
	  reservation key.
	  Future GETDEVICEINFO calls may refer to the storage device
	  again, in which case the client will perform a new registration
	  based on the key provided (via sbv_pr_key) at that time.
        </t>
      </section>
    </section>
  </section>

  <section anchor='ssc:recovery' title='Crash Recovery Issues'>
    <t>
      A critical requirement in crash recovery is that both the client and
      the server know when the other has failed.  Additionally, it is
      required that a client sees a consistent view of data across
      server restarts.  These requirements and a full discussion of
      crash recovery issues are covered in the "Crash Recovery" section
      of the NFSv41 specification <xref target='RFC5661' />.  This
      document contains additional crash recovery material specific
      only to the SCSI layout.
    </t>

    <t>
      When the server crashes while the client holds a writable layout, and
      the client has written data to blocks covered by the layout, and the
      blocks are still in the PNFS_SCSI_INVALID_DATA state, the client has
      two options for recovery.  If the data that has been written to these
      blocks is still cached by the client, the client can simply re-write
      the data via NFSv4, once the server has come back online.  However,
      if the data is no longer in the client's cache, the client MUST NOT
      attempt to source the data from the data servers.  Instead, it should
      attempt to commit the blocks in question to the server during the
      server's recovery grace period, by sending a LAYOUTCOMMIT with the
      "loca_reclaim" flag set to true.  This process is described in detail
      in Section 18.42.4 of <xref target='RFC5661' />.
    </t>
  </section>

  <section anchor='ssc:cb_recall' title='Recalling Resources: CB_RECALL_ANY'>
    <t>
      The server may decide that it cannot hold all of the state for
      layouts without running out of resources.  In such a case, it is free
      to recall individual layouts using CB_LAYOUTRECALL to reduce the
      load, or it may choose to request that the client return any layout.
    </t>
   
    <t>
      The NFSv4.1 spec <xref target='RFC5661' /> defines the following types:
    </t>

    <figure>
      <artwork>
   const RCA4_TYPE_MASK_BLK_LAYOUT = 4;

   struct CB_RECALL_ANY4args {
          uint32_t      craa_objects_to_keep;
          bitmap4       craa_type_mask;
   };
      </artwork>
    </figure>

    <t>
      When the server sends a CB_RECALL_ANY request to a client specifying
      the RCA4_TYPE_MASK_BLK_LAYOUT bit in craa_type_mask, the client
      should immediately respond with NFS4_OK, and then asynchronously
      return complete file layouts until the number of files with layouts
      cached on the client is less than craa_object_to_keep.
    </t>
  </section>

  <section anchor='ssc:errors' title='Transient and Permanent Errors'>
    <t>
      The server may respond to LAYOUTGET with a variety of error statuses.
      These errors can convey transient conditions or more permanent
      conditions that are unlikely to be resolved soon.
    </t>

    <t>
      The error NFS4ERR_RECALLCONFLICT indicates that the server has
      recently issued a CB_LAYOUTRECALL to the requesting client, making it
      necessary for the client to respond to the recall before processing
      the layout request.  A client can wait for that recall to be receive
      and processe or it can retry as for NFS4ERR_TRYLATER, as described
      below.
    </t>

    <t>
      The error NFS4ERR_TRYLATER is used to indicate that the server cannot
      immediately grant the layout to the client. This may be due to
      constraints on writable sharing of blocks by multiple clients or to a
      conflict with a recallable lock (e.g. a delegation). In either case, a
      reasonable approach for the client is to wait several milliseconds
      and retry the request.  The client SHOULD track the number of retries,
      and if forward progress is not made, the client should abandon the
      attempt to get a layout and perform READ and WRITE operations by
      sending them to the server
    </t>
    
    <t>
      The error NFS4ERR_LAYOUTUNAVAILABLE may be returned by the server
      if layouts are not supported for the requested file or its containing
      file system.  The server may also return this error code if the server
      is the progress of migrating the file from secondary storage, there is
      a conflicting lock that would prevent the layout from being granted,
      or for any other reason that causes the server to be unable to supply
      the layout.  As a result of receiving NFS4ERR_LAYOUTUNAVAILABLE, the
      client should abandon the attempt to get a layout and perform READ and
      WRITE operations by sending them to the MDS.
      It is expected that a client will not cache the file's layoutunavailable
      state forever.  In particular, when the file is closed or opened by the
      client, issuing a new LAYOUTGET is appropriate.
    </t>
  </section>
  <section anchor='ssc:caches' title='Volatile write caches'>
    <t>
      Many storage devices implement volatile write caches that require an
      explicit flush to persist the data from write operations to stable
      storage. Storage devices implementing <xref target="SBC3" /> should
      indicate a volatile write cache by setting the WCE bit to 1 in the
      Caching mode page.
      When a volatile write cache is used, the pNFS server must ensure
      the volatile write cache has been committed to stable storage
      before the LAYOUTCOMMIT operation returns by using one of the
      SYNCHRONIZE CACHE commands.
    </t>
  </section>
</section>
<!-- Copyright (C) The IETF Trust (2014) -->
<!-- Copyright (C) The Internet Society (2014) -->

<section anchor="sec:semantics" title="Enforcing NFSv4 Semantics">
  <t>
   The functionality provided by SCSI Persistent Reservations makes it
   possible for the MDS to control access by individual client machines
   to specific LUs.  Individual client machines may be allowed to or
   prevented from reading or writing to certain block devices.
   Finer-grained access control methods are not generally available.
  </t>

  <t>
   For this reason, certain responsibilities for enforcing NFSv4
   semantics, including security and locking, are delegated to pNFS
   clients when SCSI layouts are being used.  The metadata server's role
   is to only grant layouts appropriately and the pNFS clients have to be
   trusted to only perform accesses allowed by the layout extents they
   currently hold (e.g., and not access storage for files on which a
   layout extent is not held).  In general, the server will not be able
   to prevent a client that holds a layout for a file from accessing
   parts of the physical disk not covered by the layout.  Similarly, the
   server will not be able to prevent a client from accessing blocks
   covered by a layout that it has already returned.  The pNFS client
   must respect the layout model for this mapping type to appropriately
   respect NFSv4 semantics.
  </t>

  <t>
   Furthermore, there is no way for the storage to determine the specific
   NFSv4 entity (principal, openowner, lockowner) on whose behalf the I/O
   operation is being done.  This fact may limit the functionality to be
   supported and require the pNFS client to implement server policies
   other than those describable by layouts.

   In cases in which layouts previously granted become invalid, the
   server has the option of recalling them.  In situations in which
   communication difficulties prevent this from happening, layouts may be
   revoked by the server.  This revocation is accompanied by changes in
   persistent reservation which have the effect of preventing SCSI access
   to the LUs in question by the client.
  </t>

  <section anchor='ssc:semantics:stateid'
	   title='Use of Open Stateids'>
    <t>
     The effective implementation of these NFSv4 semantic constraints is
     complicated by the different granularities of the actors for the
     different types of the functionality to be enforced:

     <list style='symbols'>
     <t>
	To enforce security constraints for particular principals.
     </t>
     <t>
        To enforce locking constraints for particular owners (openowners
	and lockowners)
     </t>
     </list>

     Fundamental to enforcing both of these sorts of constraints is the
     principle that a pNFS client must not issue a SCSI I/O operation
     unless it possesses both:

     <list style='symbols'>
     <t>
	A valid open stateid for the file in question, performing the I/O
	that allows I/O of the type in question, which is associated with the
	openowner and principal on whose behalf the I/O is to be done.
     </t>
     <t>
        A valid layout stateid for the file in question that covers the
	byte range on which the I/O is to be done and that allows I/O of that
	type to be done.
     </t>
     </list>

     As a result, if the equivalent of I/O with an anonymous or write-bypass
     stateid is to be done, it MUST NOT by done using the pNFS SCSI layout
     type.  The client MAY attempt such I/O using READs and WRITEs that do
     not use pNFS and are directed to the MDS.
    </t>
    <t>
     When open stateids are revoked, due to lease expiration or any form of
     administrative revocation, the server MUST recall all layouts that
     allow I/O to be done on any of the files for which open revocation
     happens.  When there is a failure to successfully return those
     layouts, the client MUST be fenced.
    </t>
  </section>

  <section anchor='ssc:semantics:security'
	   title='Enforcing Security Restrictions'>
    <t>
     The restriction noted above provides adequate enforcement of
     appropriate security restriction when the principal issuing the I/O is
     the same as that opening the file.  The server is responsible for
     checking that the I/O mode requested by the open is allowed for the
     principal doing the OPEN.  If the correct sort of I/O is done on behalf
     of the same principal, then the security restriction is thereby
     enforced.
    </t>

    <t>
     If I/O is done by a principal different from the one that opened the
     file, the client SHOULD send the I/O to be performed by the metadata
     server rather than doing it directly to the storage device.
    </t>
  </section>

  <section anchor='ssc:semantics:locking'
	   title='Enforcing Locking Restrictions'>
    <t>
     Mandatory enforcement of whole-file locking by means of share
     reservations is provided when the pNFS client obeys the requirement
     set forth in Section 2.1 above.  Since performing I/O requires a valid
     open stateid an I/O that violates an existing share reservation would
     only be possible when the server allows conflicting open stateids to
     exist.
    </t>
    <t>
     The nature of the SCSI layout type is such implementation/enforcement of
     mandatory byte-range locks is very difficult. Given that layouts are
     granted to clients rather than owners, the pNFS client is in no position to
     successfully arbitrate among multiple lockowners on the same client. Suppose
     lockowner A is doing a write and, while the I/O is pending, lockowner B
     requests a mandatory byte-range for a byte range potentially overlapping
     the pending I/O. In such a situation, the lock request cannot be granted
     while the I/O is pending. In a non-pNFS environment, the server would have
     to wait for pending I/O before granting the mandatory byte-range lock. In
     the pNFS environment the server does not issue the I/O and is thus in no
     position to wait for its completion. The server may recall such layouts but
     in doing so, it has no way of distinguishing those being used by lockowners
     A and B, making it difficult to allow B to perform I/O while forbidding A
     from doing so. Given this fact, the MDS need to successfully recall all
     layouts that overlap the range being locked before returning a successful
     response to the LOCK request. While the lock is in effect, the server
     SHOULD respond to requests for layouts which overlap a currently locked
     area with NFS4ERR_LAYOUTUNAVAILABLE. To simplify the required logic a
     server MAY do this for all layout requests on the file in question as long
     as there are any byte-range locks in effect.
    </t>
    <t>
     Given these difficulties it may be difficult for servers supporting
     mandatory byte-range locks to also support SCSI layouts. Servers can
     support advisory byte-range locks instead. The NFSv4 protocol currently has
     no way of determining whether byte-range lock support on a particular file
     system will be mandatory or advisory, except by trying operation which
     would conflict if mandatory locking is in effect. Therefore, to avoid
     confusion, servers SHOULD NOT switch between mandatory and advisory
     byte-range locking based on whether any SCSI layouts have been obtained or
     whether a client that has obtained a SCSI layout has requested a byte-range
     lock.
    </t>
  </section>
</section>
<!-- Copyright (C) The IETF Trust (2014) -->
<!-- Copyright (C) The Internet Society (2014) -->

<section anchor="sec:security" title="Security Considerations">
  <t>
   Access to SCSI storage devices is logically at a lower layer of the
   I/O stack than NFSv4, and hence NFSv4 security is not directly
   applicable to protocols that access such storage directly.  Depending
   on the protocol, some of the security mechanisms provided by NFSv4
   (e.g., encryption, cryptographic integrity) may not be available or
   may be provided via different means.  At one extreme, pNFS with
   SCSI layouts can be used with storage access protocols (e.g., serial
   attached SCSI (<xref target='SAS3' />) that provide essentially no
   security functionality. At the other extreme, pNFS may be used with
   storage protocols such as iSCSI (<xref target='RFC7143' />) that can
   provide significant security functionality.  It is the responsibility
   of those administering and deploying pNFS with a SCSI storage access
   protocol to ensure that appropriate protection is provided to that
   protocol (physical security is a common means for protocols not based
   on IP).  In environments where the security requirements for the storage
   protocol cannot be met, pNFS SCSI layouts SHOULD NOT be used.
  </t>

  <t>
   When security is available for a storage protocol, it is generally at
   a different granularity and with a different notion of identity than
   NFSv4 (e.g., NFSv4 controls user access to files, iSCSI controls
   initiator access to volumes).  The responsibility for enforcing
   appropriate correspondences between these security layers is placed
   upon the pNFS client.  As with the issues in the first paragraph of
   this section, in environments where the security requirements are
   such that client-side protection from access to storage outside of
   the layout is not sufficient, pNFS SCSI layouts
   SHOULD NOT be used.
  </t>
</section>
<!-- Copyright (C) The IETF Trust (2014) -->
<!-- Copyright (C) The Internet Society (2014) -->

<section anchor="sec:iana" title="IANA Considerations">
  <t>
    IANA is requested to assign a new pNFS layout type in the pNFS Layout
    Types Registry as follows (the value 5 is suggested):

    Layout Type Name: LAYOUT4_SCSI
    Value: 0x00000005
    RFC: RFCTBD10
    How: L (new layout type)
    Minor Versions: 1
  </t>
</section>
<!-- Copyright (C) The IETF Trust (2014) -->
<!-- Copyright (C) The Internet Society (2014) -->

</middle>
<!-- Copyright (C) The IETF Trust (2014) -->
<!-- Copyright (C) The Internet Society (2014) -->

<back>
<!-- Copyright (C) The IETF Trust (2014) -->
<!-- Copyright (C) The Internet Society (2014) -->

  <references title="Normative References">

    <reference anchor='RFC2119'>
      <front>
	<title abbrev='RFC Key Words'>Key words for use in RFCs to Indicate Requirement Levels</title>
	<author initials='S.' surname='Bradner' fullname='Scott Bradner'>
	  <organization>Harvard University</organization>
	  <address>
	    <postal>
	      <street>1350 Mass. Ave.</street>
	      <street>Cambridge</street>
	    <street>MA 02138</street></postal>
	    <phone>- +1 617 495 3864</phone>
     	<email>sob@harvard.edu</email></address></author>
     	<date year='1997' month='March' />
      </front>
    </reference>

    <reference anchor='LEGAL'
               target='http://trustee.ietf.org/docs/IETF-Trust-License-Policy.pdf'>
      <front>
      <title abbrev='Legal Provisions'>Legal Provisions Relating to IETF Documents</title>
        <author>
          <organization>IETF Trust</organization>
        </author>
        <date month="November" year="2008"/>
      </front>
      <format type="PDF" octets="44498"
       target="http://trustee.ietf.org/docs/IETF-Trust-License-Policy.pdf"/>
    </reference>

  <reference anchor='RFC4506'>
    <front>
    <title abbrev='XDR'>XDR: External Data Representation Standard</title>
    <author initials='M.' surname='Eisler' fullname='Mike Eisler'>
    <organization>Network Appliance, Inc.</organization>
    </author>
    <date month='May' year='2006'/>
    </front>
    <seriesInfo name='STD' value='67' />
    <seriesInfo name="RFC" value="4506"/>
  </reference>

  <reference anchor='RFC5661'>
    <front>
      <title>Network File System (NFS) Version 4 Minor Version 1 Protocol</title>
      <author initials="S." surname="Shepler" fullname="Spencer Shepler" role="editor">
        <organization>Sun Microsystems, Inc.</organization>
      </author>
      <author initials="M." surname="Eisler" fullname="Mike Eisler" role="editor">
        <organization>Network Appliance, Inc.</organization>
      </author>
      <author initials="D." surname="Noveck" fullname="David Noveck" role="editor">
        <organization>Network Appliance, Inc.</organization>
      </author>
      <date month="January" year="2010"/>
    </front>
    <seriesInfo name="RFC" value="5661"/>
  </reference>

  <reference anchor='RFC5662'>
    <front>
      <title>Network File System (NFS) Version 4 Minor Version 1 External Data Representation Standard (XDR) Description</title>
      <author initials="S." surname="Shepler" fullname="Spencer Shepler" role="editor">
        <organization>Sun Microsystems, Inc.</organization>
      </author>
      <author initials="M." surname="Eisler" fullname="Mike Eisler" role="editor">
        <organization>Network Appliance, Inc.</organization>
      </author>
      <author initials="D." surname="Noveck" fullname="David Noveck" role="editor">
        <organization>Network Appliance, Inc.</organization>
      </author>
      <date month="January" year="2010"/>
    </front>
    <seriesInfo name="RFC" value="5662"/>
  </reference>
  
  <reference anchor='RFC5663'>
    <front>
      <title>Parallel NFS (pNFS) Block/Volume Layout</title>
      <author initials="D." surname="Black" fullname="David L. Black" role="editor">
        <organization>EMC Corporation</organization>
      </author>
      <author initials="S." surname="Fridella" fullname="Stephen Fridella" role="editor">
        <organization>Nasuni Inc</organization>
      </author>
      <author initials="J." surname="Glasgow" fullname="Jason Glasgow" role="editor">
        <organization>Google</organization>
      </author>
      <date month="January" year="2010"/>
    </front>
    <seriesInfo name="RFC" value="5663"/>
  </reference>

  <reference anchor='RFC6688'>
    <front>
      <title>Parallel NFS (pNFS) Block Disk Protection</title>
      <author initials="D." surname="Black" fullname="David L. Black" role="editor">
        <organization>EMC Corporation</organization>
      </author>
      <author initials="J." surname="Glasgow" fullname="Jason Glasgow">
        <organization>Google</organization>
      </author>
      <author initials="S." surname="Faibish" fullname="Sorin Faibish">
        <organization>EMC Corporation</organization>
      </author>
      <date month="July" year="2012"/>
    </front>
    <seriesInfo name="RFC" value="6688"/>
  </reference>


  <reference anchor='RFC7143'>
    <front>
      <title>Internet Small Computer System Interface (iSCSI) Protocol (Consolidated)</title>
      <author initials="M." surname="Chadalapaka" fullname="Mallikarjun Chadalapaka">
        <organization>Microsoft</organization>
      </author>
      <author initials="K." surname="Meth" fullname="Kalman Meth">
        <organization>IBM Haifa Research Lab</organization>
      </author>
      <author initials="D." surname="Black" fullname="David L. Black">
        <organization>EMC Corporation</organization>
      </author>
      <date month="April" year="2014"/>
    </front>
    <seriesInfo name="RFC" value="RFC7143"/>
  </reference>

  <reference anchor='SAM-4'>
    <front>
      <title>SCSI Architecture Model - 4 (SAM-4)</title>
      <author>
         <organization>INCITS Technical Committee T10</organization>
      </author>
      <date year="2008"/>
    </front>
    <seriesInfo name="ANSI INCITS" value="447-2008"/>
    <seriesInfo name="ISO/IEC" value="14776-414"/>
  </reference>

  <reference anchor='SPC4'>
    <front>
      <title>SCSI Primary Commands-4</title>
      <author>
         <organization>INCITS Technical Committee T10</organization>
      </author>
      <date year="2015"/>
    </front>
    <seriesInfo name="ANSI INCITS" value="513-2015"/>
  </reference>

  <reference anchor='SBC3'>
    <front>
      <title>SCSI Block Commands-3</title>
      <author>
         <organization>INCITS Technical Committee T10</organization>
      </author>
      <date year="2014"/>
    </front>
    <seriesInfo name="ANSI INCITS" value="INCITS 514-2014"/>
    <seriesInfo name="ISO/IEC" value="14776-323"/>
  </reference>

  <reference anchor='SAS3'>
    <front>
      <title>Serial Attached Scsi-3</title>
      <author>
         <organization>INCITS Technical Committee T10</organization>
      </author>
      <date year="2014"/>
    </front>
    <seriesInfo name="ANSI INCITS" value="ANSI INCITS 519-2014"/>
    <seriesInfo name="ISO/IEC" value="14776-154"/>
  </reference>

</references>
<!-- Copyright (C) The IETF Trust (2014) -->
<!-- Copyright (C) The Internet Society (2014) -->

<section title="Acknowledgments">
<t>
Large parts of this document were copied verbatim, and others were inspired by
<xref target='RFC5663' />.  Thank to David Black, Stephen Fridella and
Jason Glasgow for their work on the pNFS block/volume layout protocol.
</t>
<t>
David Black, Robert Elliott and Tom Haynes provided a throughout
review of early drafts of this document, and their input led to
the current form of the document.
</t>
<t>
David Noveck provided ample feedback to various drafts of this document,
wrote the section on enforcing NFSv4 semantics and rewrote various
sections to better catch the intent.
</t>
</section>

<section title="RFC Editor Notes">

<t>
[RFC Editor: please remove this section prior to publishing
this document as an RFC]

</t>

<t>
[RFC Editor: prior to publishing this document as an RFC, please replace all occurrences of RFCTBD10
with RFCxxxx where xxxx is the RFC number of this document]
</t>

</section>
<!-- Copyright (C) The IETF Trust (2014) -->
<!-- Copyright (C) The Internet Society (2014) -->

</back>
<!-- Copyright (C) The IETF Trust (2014) -->
<!-- Copyright (C) The Internet Society (2014) -->

</rfc>
