<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE rfc>
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt'?>

<rfc
 category='std'
 docName='draft-ietf-nfsv4-layrec-03'
 ipr='trust200902'
 obsoletes=''
 scripts='Common,Latin'
 sortRefs='true'
 submissionType='IETF'
 symRefs='true'
 tocDepth='3'
 tocInclude='true'
 consensus='true'
 version='3'
 xml:lang='en'>

<front>
  <title abbrev='LAYOUT_RECOVERY'>
    Reporting of Errors via LAYOUTRETURN in NFSv4.2
  </title>
  <seriesInfo name='Internet-Draft' value='draft-ietf-nfsv4-layrec-03'/>
  <author fullname='Thomas Haynes' initials='T.' surname='Haynes'>
    <organization abbrev='Hammerspace'>Hammerspace</organization>
    <address>
      <email>loghyr@gmail.com</email>
    </address>
  </author>
  <author fullname='Trond Myklebust' initials='T.' surname='Myklebust'>
    <organization abbrev='Hammerspace'>Hammerspace</organization>
    <address>
      <email>trondmy@hammerspace.com</email>
    </address>
  </author>
  <date year='2024' month='November' day='20'/>
  <area>Transport</area>
  <workgroup>Network File System Version 4</workgroup>
  <keyword>NFSv4</keyword>
  <abstract>
    <t>
      The Parallel Network File System (pNFS) allows
      for a file's metadata (MDS) and data (DS) to be on different
      servers. When the metadata server is restarted, the client
      can still modify the data file component.  During the
      recovery phase of startup, the metadata server and the
      data servers work together to recover state (which files
      are open, last modification time, size, etc.). If the client
      has not encountered errors with the data files, then the state can be
      recovered, avoiding resilvering of the data files. With any
      errors, there is no means by which the client can report errors to the
      metadata server. As such, the metadata server has to
      assume that file needs resilvering. This document presents an
      extension to RFC8435 to allow the client to update the metadata
      and avoid the resilvering.
    </t>
  </abstract>

  <note removeInRFC='true'>
    <t>
      Discussion of this draft takes place
      on the NFSv4 working group mailing list (nfsv4@ietf.org),
      which is archived at
      <eref target='https://mailarchive.ietf.org/arch/browse/nfsv4/'/>.
      Working Group information can be found at
      <eref target='https://datatracker.ietf.org/wg/nfsv4/about/'/>.
    </t>
  </note>
</front>

<middle>

<section anchor='sec_intro' numbered='true' removeInRFC='false' toc='default'>
  <name>Introduction</name>
  <t>
    In the Network File System version4 (NFSv4) with a Parallel NFS
    (pNFS) Flexible File Layout (<xref target='RFC8435' format='default'
    sectionFormat='of'/>) server, during recovery after a restart,
    there is no mechanism for the client
    to inform the metadata server about an error which occurred during a
    WRITE (see Section 18.32 of <xref target='RFC8881' format='default'
    sectionFormat='of'/>) operation to the data servers in the period of
    the outage.
  </t>

  <t>
    Using the process detailed in <xref target='RFC8178' format='default'
    sectionFormat='of'/>, the revisions in this document become an
    extension of NFSv4.2 <xref target='RFC7862' format='default'
    sectionFormat='of'/>. They are built on top of the external data
    representation (XDR) <xref target='RFC4506' format='default'
    sectionFormat='of'/> generated from <xref target='RFC7863'
    format='default' sectionFormat='of'/>.
  </t>

  <section anchor='sec_defs' numbered='true' removeInRFC='false' toc='default'>
    <name>Definitions</name>
    <t>
      See Section 1.1 of <xref target='RFC8435' format='default'
      sectionFormat='of'/> for a set of definitions.
    </t>

    <dl newline="false" spacing="normal">
      <dt>resilvering:</dt>
      <dd>
        the act of rebuilding a mirrored copy of a layout segment from a
        known good copy of the layout segment.  Note that this can also
        be done to create a new mirrored copy of the layout segment.
      </dd>
    </dl>

  </section>
  <section numbered='true' removeInRFC='false' toc='default'>
    <name>Requirements Language</name>
    <t>
      The key words '<bcp14>MUST</bcp14>', '<bcp14>MUST NOT</bcp14>',
      '<bcp14>REQUIRED</bcp14>', '<bcp14>SHALL</bcp14>', '<bcp14>SHALL
      NOT</bcp14>', '<bcp14>SHOULD</bcp14>', '<bcp14>SHOULD NOT</bcp14>',
      '<bcp14>RECOMMENDED</bcp14>', '<bcp14>NOT RECOMMENDED</bcp14>',
      '<bcp14>MAY</bcp14>', and '<bcp14>OPTIONAL</bcp14>' in this
      document are to be interpreted as described in BCP 14 <xref
      target='RFC2119' format='default' sectionFormat='of'/> <xref
      target='RFC8174' format='default' sectionFormat='of'/> when,
      and only when, they appear in all capitals, as shown here.
    </t>
  </section>
</section>

<section anchor='layout_state_recovery' numbered='true' removeInRFC='false' toc='default'>
  <name>Layout State Recovery</name>
  <t>
    When a metadata server restarts, clients are provided a grace recovery period where
    they are allowed to recover any state that
    they had established. With open files, the client can send an OPEN (see
    Section 18.16 of <xref target='RFC8881' format='default' sectionFormat='of'/>)
    operation with a claim type of CLAIM_PREVIOUS (see Section 9.11 of
    <xref target='RFC8881' format='default' sectionFormat='of'/>). The client
    uses the RECLAIM_COMPLETE (see Section 18.51
    of <xref target='RFC8881' format='default' sectionFormat='of'/>) operation
    to notify the metadata server that it is done reclaiming state.
  </t>
  <t>
    The NFSv4 Flexible File Layout Type allows for the client to mirror files
    (see Section 8 of <xref target='RFC8435' format='default' sectionFormat='of'/>).
    With client side mirroring, it is important for the client to inform
    the metadata server of any I/O errors encountered with one of the mirrors.
    This is the only way for the metadata server to determine one or more
    of the mirrors is corrupt and then repair the mirrors via resilvering.
    The client can use LAYOUTRETURN (see
    Section 18.44 of <xref target='RFC8881' format='default' sectionFormat='of'/>)
    and the ff_ioerr4 (see Section 9.1.1 of <xref target='RFC8435' format='default' sectionFormat='of'/>) structure to inform
    the metadata server of I/O errors.
  </t>
  <t>
    A problem is that when the metadata server restarts and the client has
    errors it needs to report, it can not do so. Section 12.7.4 of
    <xref target='RFC8881' format='default' sectionFormat='of'/> requires
    that the client <bcp14>MUST</bcp14> stop using layouts. While the
    intent there is that the client <bcp14>MUST</bcp14> stop doing I/O
    to the storage devices, it is also true that the layout stateids
    are no longer valid.  The LAYOUTRETURN needs
    a layout stateid to proceed and the client can not get a layout
    during grace recovery (see Section 12.7.4 of
    <xref target='RFC8881' format='default' sectionFormat='of'/>) to
    recover layout state. As such, clients have no choice but to not recover
    files with I/O errors. In turn, the metadata server <bcp14>MUST</bcp14>
    assume that the mirrors are inconsistent and pick one for resilvering.
    It is a <bcp14>MUST</bcp14> because even if the metadata server can
    determine that the client did modify data during the outage, it <bcp14>MUST NOT</bcp14>
    assume those modifications were consistent.
  </t>
  <t>
    To fix this issue, the metadata server <bcp14>MUST</bcp14> accept
    for the lrf_stateid in LAYOUTRETURN (see Section 18.44.1 of
    <xref target='RFC8881' format='default' sectionFormat='of'/>)
    the anonymous stateid of all zeros
    (see Section 8.2.3 of <xref target='RFC8881' format='default' sectionFormat='of'/>).
    The client can use this anonymous stateid to
    inform the metadata server of errors
    encountered. The metadata server can then
    accurately resilver the file by picking the mirror(s) that do not
    have any associated errors.
  </t>
  <t>
    During the grace period, if the client sends a lrf_stateid
    in the LAYOUTRETURN with any value other than the
    anonymous stateid of all zeros, then the metadata server
    <bcp14>MUST</bcp14> now respond with an error of
    NFS4ERR_GRACE (see Section of 15.1.9.2 <xref target='RFC8881' format='default' sectionFormat='of'/>).
    After the grace period, if the client sends a lrf_stateid
    in the LAYOUTRETURN with a value of the anonymous stateid of all zeros, then the metadata server
    <bcp14>MUST</bcp14> now respond with an error of
    NFS4ERR_NO_GRACE (see Section 15.1.9.3 of <xref target='RFC8881' format='default' sectionFormat='of'/>).
  </t>
  <t>
    Also, when the metadata server builds the reply to the LAYOUTRETURN
    when a lrf_stateid with the value of the anonymous stateid of all zeros
    it <bcp14>MUST NOT</bcp14> bump the seqid of the lorr_stateid.
  </t>
  <t>
    If the metadata server detects that the layout being returned in
    the LAYOUTRETURN does not match the current mirror instances found
    for the file, then it <bcp14>MUST</bcp14> ignore the LAYOUTRETURN and resilver the
    file in question.
  </t>
  <t>
    The metadata server <bcp14>MUST</bcp14> resilver any files
    which are neither explicitly recovered with a CLAIM_PREVIOUS nor
    have a reported error via a LAYOUTRETURN.
    The client has most likely restarted and lost any state.
  </t>
  <section anchor='sec_when_to_resilver' numbered='true' removeInRFC='false' toc='default'>
    <name>When to Resilver</name>
    <t>
      A write intent occurs when a client opens a file and gets
      a LAYOUTIOMODE4_RW from the metadata server. The metadata server
      <bcp14>MUST</bcp14> track outstanding write intents and when it
      restarts, it <bcp14>MUST</bcp14> track recovery of those
      write intents.
      The method that the metadata server uses to track write intents is
      implementation specific, i.e., outside of the scope of this document.
    </t>
    <t>
      The decision to resilver a file depends on how the client recovers the
      file before the grace period ends. If the client reclaims the file
      and reports no errors, the metadata server <bcp14>MUST NOT</bcp14>
      resilver the file. If the client reports an error on the file,
      then the file <bcp14>MUST</bcp14> be resilvered. If the client
      does not reclaim or report an error before the grace period ends,
      then under the old behavior, the metadata server <bcp14>MUST</bcp14>
      resilver the file.
    </t>
    <t>
      The resilvering process is broadly to:
    </t>
    <ol>
      <li>
        fence the file (see Section 2.2
        of <xref target='RFC8435' format='default' sectionFormat='of'/>),
      </li>
      <li>
        record the need to resilver,
      </li>
      <li>
        release the write intent, and
      </li>
      <li>
        once there are no write intents on the file, start the resilvering process.
      </li>
    </ol>
    <t>
      The metadata server <bcp14>MUST NOT</bcp14> resilver a file if there
      are clients with outstanding write intents. I.e., multiple clients
      might have the file open with write intents.  As it <bcp14>MUST</bcp14>
      track write intents, it <bcp14>MUST</bcp14> also track the need to
      resilver. I.e., if the metadata server restarts during the grace
      period, it <bcp14>MUST</bcp14> restart the file recovery if it
      replays the write intent else it <bcp14>MUST</bcp14> start
      the resilvering if it replays the resilvering intent.
    </t>

    <t>
      Whether the metadata server prevents all I/O to
      the file until the resilvering is done or forces all I/O to go through
      the metadata server or allows a proxy server to update the new data
      file as it is being reslivered is all an implementation choice. The
      constraint is that the metadata server is responsible for the
      reconstruction of the data file and for the consistency of the
      mirrors.
    </t>

    <t>
      If the metadata server does allow the client access to the
      file during the resilvering, then the client <bcp14>MUST</bcp14> have
      the same layout (set of mirror instances) after the metadata server
      as before. One way that such a resilvering can occur is for a proxy
      server to be inserted into the layout. That server will be copying
      a good mirror instance to a new instance. As it gets I/O via the
      layout, it will be responsible for updating the copy it is performing.
      This requirement is that the proxy server <bcp14>MUST</bcp14>
      stay in the layout until the grace period is finished.
    </t>
  </section>

  <section anchor='sec_vers_mismatch' numbered='true' removeInRFC='false' toc='default'>
    <name>Version Mismatch Considerations</name>
    <t>
      The metadata server has no expectations for the client to use this
      new functionality. Therefore, if the client does not use it, the
      metadata server will function normally.
    </t>
    <t>
      If the client does use the new functionality and the metadata server does
      not support it, then the metadata server <bcp14>MUST</bcp14> reply with
      a NFS4ERR_BAD_STATEID to the LAYOUTRETURN. If the client detects
      a NFS4ERR_BAD_STATEID error in this scenario, it should fall back to
      the old behavior of not reporting errors.
    </t>
  </section>
</section>

<section anchor='sec_security' numbered='true' removeInRFC='false' toc='default'>
  <name>Security Considerations</name>
  <t>
    There are no new security considerations beyond those in
    <xref target='RFC7862' format='default' sectionFormat='of'/>.
  </t>
</section>

<section anchor='sec_iana' numbered='true' removeInRFC='false' toc='default'>
  <name>IANA Considerations</name>
  <t>
    There are no IANA considerations for this document.
  </t>
</section>

</middle>

<back>

<references>
  <name>References</name>

  <references>
  <name>Normative References</name>
    <xi:include xmlns:xi='http://www.w3.org/2001/XInclude'
       href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.2119.xml'/>
    <xi:include xmlns:xi='http://www.w3.org/2001/XInclude'
       href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.4506.xml'/>
    <xi:include xmlns:xi='http://www.w3.org/2001/XInclude'
       href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.7862.xml'/>
    <xi:include xmlns:xi='http://www.w3.org/2001/XInclude'
       href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.7863.xml'/>
    <xi:include xmlns:xi='http://www.w3.org/2001/XInclude'
       href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.8174.xml'/>
    <xi:include xmlns:xi='http://www.w3.org/2001/XInclude'
       href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.8178.xml'/>
    <xi:include xmlns:xi='http://www.w3.org/2001/XInclude'
       href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.8435.xml'/>
    <xi:include xmlns:xi='http://www.w3.org/2001/XInclude'
       href='https://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.8881.xml'/>

  </references>
</references>

<section numbered='true' removeInRFC='false' toc='default'>
      <name>Acknowledgments</name>
      <t>
        Tigran Mkrtchyan, Jeff Layton, and Rick Macklem provided reviews of the document.
      </t>
    </section>

</back>

</rfc>
