<?xml version='1.0' encoding='UTF-8'?>

<!DOCTYPE rfc [
 <!ENTITY nbsp    "&#160;">
 <!ENTITY zwsp   "&#8203;">
 <!ENTITY nbhy   "&#8209;">
 <!ENTITY wj     "&#8288;">
]> 

<rfc xmlns:xi="http://www.w3.org/2001/XInclude" category='std' docName='draft-ietf-nfsv4-layrec-04' number="9737" ipr='trust200902' obsoletes='' updates="" sortRefs='true' submissionType='IETF' symRefs='true' tocDepth='3' tocInclude='true' consensus='true' version='3' xml:lang='en'>

  <front>
   
  <title abbrev='Reporting Errors via LAYOUTRETURN'>Reporting Errors in NFSv4.2 via LAYOUTRETURN</title>
  <seriesInfo name='RFC' value='9737'/>
  <author fullname='Thomas Haynes' initials='T.' surname='Haynes'>
    <organization abbrev='Hammerspace'>Hammerspace</organization>
    <address>
      <email>loghyr@gmail.com</email>
    </address>
  </author>
  <author fullname='Trond Myklebust' initials='T.' surname='Myklebust'>
    <organization abbrev='Hammerspace'>Hammerspace</organization>
    <address>
      <email>trondmy@hammerspace.com</email>
    </address>
  </author>
  <date year='2025' month='February'/>
  <area>WIT</area>
  <workgroup>nfsv4</workgroup>
  <keyword>NFSv4</keyword>

  <abstract>
    <t>
      The Parallel Network File System (pNFS) allows
      for a file's metadata and data to be on different
      servers (i.e., the metadata server (MDS) and the data server (DS)). When the MDS is restarted, the client
      can still modify the data file component.
      During the
      recovery phase of startup, the MDS and the
      DSs work together to recover state. If the client
      has not encountered errors with the data files, then the state can be
      recovered and the resilvering of the data files can be avoided. With any
      errors, there is no means by which the client can report errors to the
      MDS. As such, the MDS has to
      assume that a file needs resilvering. This document presents an
      extension to RFC 8435 to allow the client to update the metadata
      via LAYOUTRETURN and avoid the resilvering.
    </t>
  </abstract>
</front>

<middle>

<section anchor='sec_intro' numbered='true'  toc='default'>
  <name>Introduction</name>
  <t>
    In the Network File System version 4 (NFSv4) with a Parallel NFS
    (pNFS) flexible file layout <xref target='RFC8435' format='default'
    sectionFormat='of'/> server, during recovery after a restart,
    there is no mechanism for the client
    to inform the metadata server (MDS) about an error that occurred during a
    WRITE operation (see <xref section="18.32" target='RFC8881' format='default'
    sectionFormat='of'/>) to the data servers (DSs) in the period of
    the outage.
  </t>

  <t>
    Using the process detailed in <xref target='RFC8178' format='default'
    sectionFormat='of'/>, the revisions in this document become an
    extension of NFSv4.2 <xref target='RFC7862' format='default'
    sectionFormat='of'/>. They are built on top of the External Data
    Representation (XDR) <xref target='RFC4506' format='default'
    sectionFormat='of'/> generated from <xref target='RFC7863'
    format='default' sectionFormat='of'/>.
  </t>

  <section anchor='sec_defs' numbered='true'  toc='default'>
    <name>Definitions</name>
    <t>
      See <xref section="1.1" target='RFC8435' format='default'
      sectionFormat='of'/> for a set of definitions.
    </t>
  </section>
  <section numbered='true'  toc='default'>
    <name>Requirements Language</name>
        <t>
    The key words "<bcp14>MUST</bcp14>", "<bcp14>MUST NOT</bcp14>", "<bcp14>REQUIRED</bcp14>", "<bcp14>SHALL</bcp14>", "<bcp14>SHALL
    NOT</bcp14>", "<bcp14>SHOULD</bcp14>", "<bcp14>SHOULD NOT</bcp14>", "<bcp14>RECOMMENDED</bcp14>", "<bcp14>NOT RECOMMENDED</bcp14>",
    "<bcp14>MAY</bcp14>", and "<bcp14>OPTIONAL</bcp14>" in this document are to be interpreted as
    described in BCP&nbsp;14 <xref target="RFC2119"/> <xref target="RFC8174"/> 
    when, and only when, they appear in all capitals, as shown here.
        </t>
  </section>
</section>

<section anchor='layout_state_recovery' numbered='true'  toc='default'>
  <name>Layout State Recovery</name>
  <t>
    When an MDS restarts, clients are provided a grace recovery period where
    they are allowed to recover any state that
    they had established. With open files, the client can send an OPEN operation (see
    <xref section="18.16" target='RFC8881' format='default' sectionFormat='of'/>)
    with a claim type of CLAIM_PREVIOUS (see <xref section="9.11" target='RFC8881' format='default' sectionFormat='of'/>). The client
    uses the RECLAIM_COMPLETE operation (see <xref section="18.51" target='RFC8881' format='default' sectionFormat='of'/>) 
    to notify the MDS that it is done reclaiming state.
  </t>
  <t>
    The NFSv4 flexible file layout type allows for the client to mirror files
    (see <xref section="8" target='RFC8435' format='default' sectionFormat='of'/>).
    With client-side mirroring, it is important for the client to inform
    the MDS of any I/O errors encountered with one of the mirrors.
    This is the only way for the MDS to determine if one or more
    of the mirrors are corrupt and then repair the mirrors via resilvering
    (see <xref section="1.1" target='RFC8435' format='default' sectionFormat='of'/>).
    The client can use LAYOUTRETURN (see
    <xref section="18.44" target='RFC8881' format='default' sectionFormat='of'/>)
    and the ff_ioerr4 structure (see <xref section="9.1.1" target='RFC8435' format='default' sectionFormat='of'/>) to inform
    the MDS of I/O errors.
  </t>
  <t>
    A problem arises when the MDS restarts and the client has
    errors it needs to report but cannot do so. <xref section="12.7.4" target='RFC8881' format='default' sectionFormat='of'/> requires
    that the client <bcp14>MUST</bcp14> stop using layouts. While the
    intent there is that the client <bcp14>MUST</bcp14> stop doing I/O
    to the storage devices, it is also true that the layout stateids
    are no longer valid. The LAYOUTRETURN needs
    a layout stateid to proceed, and the client cannot get a layout
    during grace recovery (see <xref section="12.7.4" target='RFC8881' format='default' sectionFormat='of'/>) to
    recover layout state. As such, clients have no choice but to not recover
    files with I/O errors. In turn, the MDS <bcp14>MUST</bcp14>
    assume that the mirrors are inconsistent and pick one for resilvering.
    It is a <bcp14>MUST</bcp14> because even if the MDS can
    determine that the client did modify data during the outage, it <bcp14>MUST NOT</bcp14>
    assume those modifications were consistent.
  </t>
  <t>
    To fix this issue, the MDS <bcp14>MUST</bcp14> accept
    the anonymous stateid of all zeros (see <xref section="8.2.3" target='RFC8881' format='default' sectionFormat='of'/>) for the lrf_stateid in LAYOUTRETURN (see <xref section="18.44.1" target='RFC8881' format='default' sectionFormat='of'/>).
    The client can use this anonymous stateid to
    inform the MDS of errors
    encountered. The MDS can then
    accurately resilver the file by picking the mirror(s) that does not
    have any associated errors.
  </t>
  <t>
    During the grace period, if the client sends an lrf_stateid
    in the LAYOUTRETURN with any value other than the
    anonymous stateid of all zeros, then the MDS
    <bcp14>MUST</bcp14> respond with an error of
    NFS4ERR_GRACE (see <xref section="15.1.9.2" target='RFC8881' format='default' sectionFormat='of'/>).
    After the grace period, if the client sends an lrf_stateid
    in the LAYOUTRETURN with a value of the anonymous stateid of all zeros, then the MDS
    <bcp14>MUST</bcp14> respond with an error of
    NFS4ERR_NO_GRACE (see <xref section="15.1.9.3" target='RFC8881' format='default' sectionFormat='of'/>).
  </t>
  <t>
    
    Also, when the MDS builds the reply to the LAYOUTRETURN
    with an lrf_stateid with the value of the anonymous stateid of all zeros,
    it <bcp14>MUST NOT</bcp14> bump the seqid of the lorr_stateid.
  </t>
  <t>
    If the MDS detects that the layout being returned in
    the LAYOUTRETURN does not match the current mirror instances found
    for the file, then it <bcp14>MUST</bcp14> ignore the LAYOUTRETURN and resilver the
    file in question.
  </t>
  <t>
    The MDS <bcp14>MUST</bcp14> resilver any files
    that are neither explicitly recovered with a CLAIM_PREVIOUS nor
    have a reported error via a LAYOUTRETURN.
    The client has most likely restarted and lost any state.
  </t>
  <section anchor='sec_when_to_resilver' numbered='true'  toc='default'>
    <name>When to Resilver</name>
    <t>
      A write intent occurs when a client opens a file and gets
      a LAYOUTIOMODE4_RW from the MDS. The MDS
      <bcp14>MUST</bcp14> track outstanding write intents, and when it
      restarts, it <bcp14>MUST</bcp14> track recovery of those
      write intents.
      The method that the MDS uses to track write intents is
      implementation specific, i.e., outside the scope of this document.
    </t>
    <t>
      The decision to resilver a file depends on how the client recovers the
      file before the grace period ends. If the client reclaims the file
      and reports no errors, the MDS <bcp14>MUST NOT</bcp14>
      resilver the file. If the client reports an error on the file,
      then the file <bcp14>MUST</bcp14> be resilvered. If the client
      does not reclaim or report an error before the grace period ends,
      then under the old behavior, the MDS <bcp14>MUST</bcp14>
      resilver the file.
    </t>
    <t>
      The resilvering process is broadly to:
    </t>
    <ol>
      <li>
        fence the file (see <xref section="2.2" target='RFC8435' format='default' sectionFormat='of'/>),
      </li>
      <li>
        record the need to resilver,
      </li>
      <li>
        release the write intent, and
      </li>
      <li>
        once there are no write intents on the file, start the resilvering process.
      </li>
    </ol>
    <t>
      The MDS <bcp14>MUST NOT</bcp14> resilver a file if there
      are clients with outstanding write intents, i.e., multiple clients
      might have the file open with write intents.  As the MDS <bcp14>MUST</bcp14>
      track write intents, it <bcp14>MUST</bcp14> also track the need to
      resilver, i.e., if the MDS restarts during the grace
      period, it <bcp14>MUST</bcp14> restart the file recovery if it
      replays the write intent, or else it <bcp14>MUST</bcp14> start
      the resilvering if it replays the resilvering intent.
    </t>

    <t>
      Whether the MDS prevents all I/O to
      the file until the resilvering is done, forces all I/O to go through
      the MDS, or allows a proxy server to update the new data
      file as it is being resilvered is all an implementation choice. The
      constraint is that the MDS is responsible for the
      reconstruction of the data file and for the consistency of the
      mirrors.
    </t>

    <t>
      If the MDS does allow the client access to the
      file during the resilvering, then the client <bcp14>MUST</bcp14> have
      the same layout (set of mirror instances) after the MDS
      as before. One way that such a resilvering can occur is for a proxy
      server to be inserted into the layout. That server will be copying
      a good mirror instance to a new instance. As it gets I/O via the
      layout, it will be responsible for updating the copy it is performing.
      This requirement is that the proxy server <bcp14>MUST</bcp14>
      stay in the layout until the grace period is finished.
    </t>
  </section>

  <section anchor='sec_vers_mismatch' numbered='true'  toc='default'>
    <name>Version Mismatch Considerations</name>
    <t>
      The MDS has no expectations for the client to use this
      new functionality. Therefore, if the client does not use it, the
      MDS will function normally.
    </t>
    <t>
      If the client does use the new functionality and the MDS does
      not support it, then the MDS <bcp14>MUST</bcp14> reply with
      a NFS4ERR_BAD_STATEID to the LAYOUTRETURN. If the client detects
      a NFS4ERR_BAD_STATEID error in this scenario, it should fall back to
      the old behavior of not reporting errors.
    </t>
  </section>
</section>

<section anchor='sec_security' numbered='true'  toc='default'>
  <name>Security Considerations</name>
  <t>
    There are no new security considerations beyond those in
    <xref target='RFC7862' format='default' sectionFormat='of'/>.
  </t>
</section>

<section anchor='sec_iana' numbered='true'  toc='default'>
  <name>IANA Considerations</name>
  <t>
 This document has no IANA actions.
  </t>
</section>

</middle>

<back>

  <references>
  <name>Normative References</name>
    <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.2119.xml"/>
    <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.4506.xml"/>
    <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.7862.xml"/>
    <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.7863.xml"/>
    <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8174.xml"/>
    <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8178.xml"/>
    <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8435.xml"/>
    <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8881.xml"/>

</references>

<section numbered='false' toc='default'>
      <name>Acknowledgments</name>
      <t><contact fullname="Tigran Mkrtchyan"/>, <contact fullname="Jeff
      Layton"/>, and <contact fullname="Rick Macklem"/> provided reviews of
      the document.</t>
    </section>  
</back>
</rfc>
