<?xml version="1.0"?>
<?rfc compact="yes" ?>
<?rfc symrefs="yes" ?>
<?rfc sortrefs="yes" ?>
<?rfc toc="yes" ?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!ENTITY rfc1034 PUBLIC '' 'http://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.1034.xml'>
<!ENTITY rfc1035 PUBLIC '' 'http://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.1035.xml'>
<!ENTITY rfc3225 PUBLIC '' 'http://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.3225.xml'>
<!ENTITY rfc5966 PUBLIC '' 'http://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.5966.xml'>
	<!ENTITY rfc6891 PUBLIC '' 'http://xml2rfc.ietf.org/public/rfc/bibxml/reference.RFC.6891.xml'>
	]>
	<rfc ipr="trust200902" category="info" docName="draft-andrews-dns-no-response-issue-10.txt">
	  <front>
	    <title abbrev="Failure to respond">A Common Operational Problem in DNS Servers - Failure To Respond.</title>
	    <author initials="M." surname="Andrews" fullname="M. Andrews">
	      <organization abbrev="ISC">Internet Systems Consortium</organization>
	      <address>
		<postal>
		  <street>950 Charter Street</street>
		  <city>Redwood City</city>
		  <region>CA</region>
		  <code>94063</code>
		  <country>US</country>
		</postal>
		<email>marka@isc.org</email>
	      </address>
	    </author>
	    <date month="August" year="2015"/>
	    <abstract>
	      <t>
		The DNS is a query / response protocol.  Failure to respond to
		queries causes both immediate operational problems and long term
		problems with protocol development.
	      </t>
	      <t>
		This document identifies a number of common classes of
		queries that some servers fail to respond too.  This document
		also suggests procedures for TLD and other similar zone
		operators to apply to help reduce / eliminate the problem.
	      </t>
	    </abstract>
	  </front>
	  <middle>
	    <section anchor="intro" title="Introduction">
	      <t>
		The DNS <xref target="RFC1034"/>, <xref target="RFC1035"/>
	is a query / response protocol.  Failure to respond to
	queries causes both immediate operational problems and long
	term problems with protocol development.
      </t>
      <t>
	Failure to respond to a query is indistinguishable from a
	packet loss without doing a analysis of query response
	patterns and results in unnecessary additional queries being
	made by DNS clients and unnecessary delays being introduced
	to the resolution process.
      </t>
      <t>
	Due to the inability to distinguish between packet loss and
	nameservers dropping EDNS <xref target="RFC6891"/> queries,
	packet loss is sometimes misclassified as lack of EDNS
	support which can lead to DNSSEC validation failures.
      </t>
      <t>
	Allowing servers which fail to respond to queries to remain
	results in developers being afraid to deploy implementations
	of recent standards.  Such servers need to be identified and
	corrected / replaced.
      </t>
      <t>
	The DNS has response codes that cover almost any conceivable
	query response.  A nameserver should be able to respond to
	any conceivable query using them.
      </t>
      <t>
	Unless a nameserver is under attack, it should respond to
	all queries directed to it as a result of following
	delegations.  Additionally code should not assume that there
	isn't a delegation to the server even if it is not configured
	to serve the zone.  Broken delegation are a common occurrence
	in the DNS and receiving queries for zones that you are not
	configured for is not a necessarily a indication that you
	are under attack.  Parent zone operators are supposed to
	regularly check that the delegating NS records are consistent
	with those of the delegated zone and to correct them when
	they are not <xref target="RFC1034"/>.  If this was being done
	regularly the instances of broken delegations would be much lower.
      </t>
      <t>
	When a nameserver is under attack it may wish to drop packets.
	A common attack is to use a nameserver as a amplifier by
	sending spoofed packets.  This is done because response
	packets are bigger than the queries and big amplification
	factors are available especially if EDNS is supported.
	Limiting the rate of responses is reasonable when this
	is occurring and the client should retry.  This however only
	works if legitimate clients are not being forced to guess
	whether EDNS queries are accept or not.  While there is
	still a pool of servers that don't respond to EDNS requests,
	clients have no way to know if the lack of response is due
	to packet loss, EDNS packets not being supported or rate
	limiting due to the server being under attack.  Mis-classifications
	of server characteristics are unavoidable when rate limiting
	is done.
      </t>
    </section>
    <section anchor="query-classes" title="Common queries class that result in non responses.">
      <t>
	There are three common query classes that result in non
	responses today.  These are EDNS queries, queries for unknown
	(unallocated) or unsupported types and filtering of TCP
	queries.
      </t>
      <section anchor="edns-independent" title="EDNS Queries - Version Independent">
	<t>
	  Identifying servers that fail to respond to EDNS queries
	  can be done by first identifying that the server responds
	  to regular DNS queries then making a series of otherwise
	  identical responses using EDNS, then making the original
	  query again.  A series of EDNS queries is needed as at least
	  one DNS implementation responds to the first EDNS query
	  with FORMERR but fails to respond to subsequent queries
	  from the same address for a period until a regular DNS
	  query is made.  The EDNS query should specify a UDP buffer
	  size of 512 bytes to avoid false classification of not
	  supporting EDNS due to response packet size.
	</t>
	<t>
	  If the server responds to the first and last queries but
	  fails to respond to most or all of the EDNS queries it
	  is probably faulty.  The test should be repeated a number
	  of times to eliminate the likelihood of a false positive
	  due to packet loss.
	</t>
	<t>
	  Firewalls may also block larger EDNS responses but there
	  is no easy way to check authoritative servers to see if
	  the firewall is misconfigured.
	</t>
      </section>
      <section anchor="edns-specific" title="EDNS Queries - Version Specific">
	<t>
	  Some servers respond correctly to EDNS version 0 queries
	  but fail to respond to EDNS queries with version numbers
	  that are higher than zero.  Servers should respond
	  with BADVERS to EDNS queries with version numbers that
	  they do not support.
	</t>
	<t>
	  Some servers respond correctly to EDNS version 0 queries
	  but fail to set QR=1 when responding to EDNS versions
	  they do not support.  Such answer are discarded or treated
	  as requests.
	</t>
      </section>
      <section anchor="edns-options" title="EDNS Options">
	<t>
	  Some servers fail to respond to EDNS queries with EDNS options
	  set.  Unknown EDNS options are supposed to be ignored by the
	  server <xref target="RFC6891"/>.
	</t>
      </section>
      <section anchor="edns-flags" title="EDNS Flags">
	<t>
	  Some servers fail to respond to EDNS queries with EDNS flags set.
	  Server should ignore EDNS flags there do not understand and not
	  add them to the response <xref target="RFC6891"/>.
	</t>
      </section>
      <section anchor="dns-flags" title="DNS Flags">
	<t>
	  Some servers fail to respond to DNS queries with various
	  DNS flags set, regardless of whether they are defined
	  or still reserved.  At the time of writing there are
	  servers that fail to respond to queries with the AD
	  bit set to 1 and servers that fail to respond to queries
	  with the last reserved flag bit set.
	</t>
      </section>
      <section anchor="unknown" title="Unknown / Unsupported Type Queries">
	<t>
	  Identifying servers that fail to respond to unknown or
	  unsupported types can be done by making an initial DNS
	  query for an A record, making a number of queries for
	  an unallocated type, them making a query for an A record
	  again.  IANA maintains a registry of allocated types.
	</t>
	<t>
	  If the server responds to the first and last queries but
	  fails to respond to the queries for the unallocated type
	  it is probably faulty.  The test should be repeated a
	  number of times to eliminate the likelihood of a false
	  positive due to packet loss.
	</t>
      </section>
      <section anchor="opcode" title="Unknown DNS opcodes">
	<t>
	  The use of previously undefined opcodes is to be expected.
	  Since the DNS was first defined two new opcodes have been
	  added, UPDATE and NOTIFY.
	</t>
	<t>
	  NOTIMP is the expected rcode to a unknown / unimplemented
	  opcode.
	</t>
	<t>
	  Note: while new opcodes will most probably use the current
	  layout structure for the rest of the message there is no
	  requirement than anything other than the DNS header match.
	</t>
      </section>
      <section anchor="tcp" title="TCP Queries">
	<t>
	  All DNS servers are supposed to respond to queries over
	  TCP <xref target="RFC5966"/>.  Firewalls that drop TCP
	  connection attempts rather that resetting the connect
	  attempt or send a ICMP/ICMPv6 administratively prohibited
	  message introduce excessive delays to the resolution
	  process.
	</t>
	<t>
	  Whether a server accepts TCP connections can be tested
	  by first checking that it responds to UDP queries to
	  confirm that it is up and operating then attempting
	  the same query over TCP.  An additional query should
	  be made over UDP if the TCP connection attempt fails
	  to confirm that the server under test is still operating.
	</t>
      </section>
    </section>
    <section anchor="remediation" title="Remediating">
      <t>
	While the first step in remediating this problem is to get
	the offending nameserver code corrected, there is a very
	long tail problem with DNS servers in that it can often
	take over a decade between the code being corrected and a
	nameserver being upgraded with corrected code.  With that
	in mind it is requested that TLD, and other similar zone
	operators, take steps to identify and inform their customers,
	directly or indirectly through registrars, that they are
	running such servers and that the customers need to correct
	the problem.
      </t>
      <t>
	TLD operators should construct a list of servers child zones
	are delegated to along with a delegated zone name.  This
	name shall be the query name used to test the server as it
	is supposed to exist.
      </t>
      <t>
	For each server the TLD operator shall make an SOA query the
	delegated zone name.  This should result in the SOA record
	being returned in the answer section.  If the SOA record is
	not return but some other response is returned this is a
	indication of a bad delegation and the TLD operator should
	take whatever steps it normally takes to rectify a bad
	delegation.  If more that one zone is delegated to the server
	it should choose another zone until it finds a zone which
	responds correctly or it exhausts the list of zones delegated
	to the server.
      </t>
      <t>
	If the server fails to get a response to a SOA query the
	TLD operator should make a A query as some nameservers fail
	to respond to SOA queries but respond to A queries.  If it
	gets no response to the A query another delegated zone
	should be queried for as some nameservers fail to respond
	to zones they are not configured for.  If subsequent queries
	find a responding zone all delegation to this server need
	to be checked and rectified using the TLD's normal procedures.
      </t>
      <t>
	Having identified a working &lt;server, query name&gt; tuple the
	TLD operator should now check that the server responds to 
	EDNS, Unknown Query Type and TCP tests as described above.
	If the TLD operator finds that server fails any of the
	tests, the TLD operator shall take steps to inform the
	operator of the server that they are running a faulty
	nameserver and that they need to take steps to correct the
	matter.  The TLD operator shall also record the &lt;server,
	query name&gt; for followup testing.
      </t>
      <t>
	If repeated attempts to inform and get the customer
	to correct / replace the faulty server are unsuccessful
	the TLD operator shall remove all delegations to said
	server from the zone.
      </t>
      <t>
	It will also be necessary for TLD operators to repeat
	the scans periodically.  It is recommended that this
	be performed monthly backing off to bi-annually once
	the numbers of faulty servers found drops off to less
	than 1 in 100000 servers tested.  Follow up tests for
	faulty servers still need to be performed monthly.
      </t>
      <t>
	Some operators claim that they can't perform checks at
	registration time.  If a check is not performed at registration
	time it needs to be performed within a week of registration
	in order to detect faulty servers swiftly.
      </t>
      <t>
	Checking of delegations by TLD operators should be nothing
	new as they have been required from the very beginnings of
	DNS to do this <xref target="RFC1034"/>.  Checking for
	compliance of nameserver operations should just be a extension
	of such testing.
      </t>
      <t>
	It is recommended that TLD operators setup a test web page
	which performs the tests the TLD operator performs as part
	of their regular audits to allow nameserver operators to
	test that they have correctly fixed their servers.  Such
	tests should be rate limited to avoid these pages being
	a denial of service vector.
      </t>
    </section>
    <section title="Firewalls and Load Balancers">
      <t>
	Firewalls and load balancers can affect the externally
	visible behaviour of a nameserver.  Tests for conformance
	need to be done from outside of any firewall so that the
	system as a whole is tested.
      </t>
      <t>
	Firewalls and load balancers should not drop DNS packets
	that they don't understand.  They should either pass through
	the packets or generate a appropriate error response.
      </t>
      <t>
	Requests for unknown query types are not attacks and
	should not be treated as such.
      </t>
      <t>
	Requests with unassigned flags set (DNS or EDNS) are not
	attacks and should not be treated as such.  The behaviour
	for unassigned is to ignore them in the request and to not
	set them in the response.  All dropping DNS / EDNS packets
	with unassigned flags does is make it harder to deploy
	extension that make use of them due to the need to reconfigure
	/ update firewalls.
      </t>
      <t>
	Requests with unknown EDNS options are not a attack and
	should not be treated as such.  The correct behaviour for
	unknown EDNS options is to ignore them.
      </t>
      <t>
	Requests with unknown EDNS versions are not a attack and
	should not be treated as such.  The correct behaviour for
	unknown EDNS versions is to return BADVERS along with the
	highest EDNS version the server supports.  All dropping
	EDNS packets does is break EDNS version negotiation.
      </t>
      <t>
	Firewalls should not assume that there will only be a
	single response message to a requests.  There have been
	proposals to use EDNS to signal that multiple DNS messages
	be returned rather than a single UDP message that is
	fragmented at the IP layer.
      </t>
    </section>
    <section anchor="scrubbing" title="Scrubbing Services">
      <t>
	Scrubbing services, like firewalls, can affect the externally
	visible behaviour of a nameserver.  If you use a scrubbing
	service you should check that legitimate queries are not
	being blocked.
      </t>
      <t>
	Scrubbing services, unlike firewalls, are also turned on
	and off in response to denial of service attacks.  One needs
	to take care when choosing a scrubbing service and ask
	questions like:
      </t>
      <t>
	<list>
	  <t>
	    do they pass unknown DNS query types.
	  </t>
	  <t>
	    do they pass unknown EDNS versions.
	  </t>
	  <t>
	    do they pass unknown EDNS options.
	  </t>
	  <t>
	    do they pass unknown EDNS flags.
	  </t>
	  <t>
	    do they pass requests with unknown DNS opcodes.
	  </t>
	  <t>
	    do they pass requests with the remaining reserved DNS header flag bit set.
	  </t>
	</list>
      </t>
      <t>
	All of these are not attack vectors but some scrubbing
	services treat them as such.
      </t>
    </section>
    <section anchor="response" title="Response Code Selection">
      <t>
	Choosing the correct response code when fixing a nameserver
	is important.  Just because a type is not implemented
	does not mean that NOTIMP is the correct response code to
	return.  Response codes need to be chosen considering how
	clients will handle them.
      </t>
      <t>
	For unimplemented opcodes NOTIMP is the expected response code.
	Additionally a new opcode could change the message format by
	extending the header or changing the structure of the records
	etc.  This may result in FORMERR being returned though NOTIMP
	would be more correct.
      </t>
      <t>
	In general, for unimplemented type codes Name Error (NXDOMAIN)
	and NOERROR (no data) are the expected response codes.  A
	server is not supposed to serve a zone which contains
	unsupported types (<xref target="RFC1034"/>) so the only
	thing left is return if the QNAME exists or not.  NOTIMP
	and REFUSED are not useful responses as they force the
	clients to try all the authoritative servers for a zone
	looking for a server which will answer the query.
      </t>
      <t>
	Meta queries type may be the exception but these need to
	be thought about on a case by case basis.
      </t>
      <t>
	If you support EDNS and get a query with a unsupported EDNS version
	the correct response is BADVERS <xref target="RFC6891"/>.
      </t>
      <t>
	If you do not support EDNS at all FORMERR and NOTIMP are the expected
	error codes.  That said a minimal EDNS server implementation just
	requires parsing the OPT records and responding with a empty OPT
	record.  There is no need to interpret any EDNS options present in
	the request as unsupported options are expected to be ignored
	<xref target="RFC6891"/>.
      </t>
    </section>
    <section anchor="testing" title="Testing">
      <t>
	Verify the server is configured for the zone:
        <figure>
	  <artwork>
dig +noedns +noad +norec soa $zone @$server

expect: status: NOERROR
expect: SOA record
	  </artwork>
        </figure>
      </t>
      <t>
	Check that TCP queries work:
        <figure>
	  <artwork>
dig +noedns +noad +norec +tcp soa $zone @$server

expect: status: NOERROR
expect: SOA record
	  </artwork>
        </figure>
      </t>
      <t>
	Check that queries for a unknown type to work:
        <figure>
	  <artwork>
dig +noedns +noad +norec type1000 $zone @$server

expect: status: NOERROR
expect: a empty answer section.
	  </artwork>
        </figure>
      </t>
      <t>
	Check that queries the CD=1 work:
        <figure>
	  <artwork>
dig +noedns +noad +norec +cd soa $zone @$server

expect: status: NOERROR
expect: SOA record to be present
	  </artwork>
        </figure>
      </t>
      <t>
	Check that queries the AD=1 work:
        <figure>
	  <artwork>
dig +noedns +norec +ad soa $zone @$server

expect: status: NOERROR
expect: SOA record to be present
	  </artwork>
        </figure>
      </t>
      <t>
	Check that queries with the last unassigned DNS header flag to work:
        <figure>
	  <artwork>
dig +noedns +noad +norec +zflag soa $zone @$server

expect: status: NOERROR
expect: SOA record to be present
expect: MBZ to not be in the response
	  </artwork>
        </figure>
      </t>
      <t>
	Check that plain EDNS queries work:
        <figure>
	  <artwork>
dig +edns=0 +noad +norec soa $zone @$server

expect: status: NOERROR
expect: SOA record to be present
expect: OPT record to be present
expect: EDNS Version 0 in response
	  </artwork>
        </figure>
      </t>
      <t>
	Check that EDNS version 1 queries work (EDNS supported):
        <figure>
	  <artwork>
dig +edns=1 +noednsneg +noad +norec soa $zone @$server

expect: status: BADVERS
expect: SOA record to not be present
expect: OPT record to be present
expect: EDNS Version 0 in response
(this will change when EDNS version 1 is defined)
	  </artwork>
        </figure>
      </t>
      <t>
	Check that EDNS queries with a unknown option work (EDNS supported):
        <figure>
	  <artwork>
dig +edns=0 +noad +norec +ednsopt=100 soa $zone @$server

expect: status: NOERROR
expect: SOA record to be present
expect: OPT record to be present
expect: OPT=100 to not be present
expect: EDNS Version 0 in response
	  </artwork>
        </figure>
      </t>
      <t>
	Check that EDNS queries with a unknown flags work (EDNS supported):
        <figure>
	  <artwork>
dig +edns=0 +noad +norec +ednsflags=0x40 soa $zone @$server

expect: status: NOERROR
expect: SOA record to be present
expect: OPT record to be present
expect: MBZ not to be present
expect: EDNS Version 0 in response
	  </artwork>
        </figure>
      </t>
      <t>
	Check that EDNS version 1 queries with a unknown flags work (EDNS supported):
        <figure>
	  <artwork>
dig +edns=1 +noednsneg +noad +norec +ednsflags=0x40 soa \
    $zone @$server

expect: status: BADVERS
expect: SOA record to NOT be present
expect: OPT record to be present
expect: MBZ not to be present
expect: EDNS Version 0 in response
	  </artwork>
        </figure>
      </t>
      <t>
	Check that EDNS version 1 queries with a unknown options work (EDNS supported):
        <figure>
	  <artwork>
dig +edns=1 +noednsneg +noad +norec +ednsopt=100 soa $zone @$server

expect: status: BADVERS
expect: SOA record to NOT be present
expect: OPT record to be present
expect: OPT=100 to NOT be present
expect: EDNS Version 0 in response
	  </artwork>
        </figure>
      </t>
      <t>
	Check that a DNSSEC queries work (EDNS supported):
        <figure>
	  <artwork>
dig +edns=0 +noad +norec +dnssec soa $zone @$server

expect: status: NOERROR
expect: SOA record to be present
expect: OPT record to be present
expect: DO=1 to be present if a RRSIG is in the response
expect: EDNS Version 0 in response
	  </artwork>
        </figure>
	DO=1 as per <xref target="RFC3225"/>.
      </t>
      <t>
	Check that EDNS version 1 DNSSEC queries work (EDNS supported):
        <figure>
	  <artwork>
dig +edns=1 +noednsneg +noad +norec +dnssec soa \
    $zone @$server

expect: status: BADVERS
expect: SOA record to not be present
expect: OPT record to be present
expect: DO=1 to be present if the EDNS version 0 DNSSEC query test
        returned DO=1
expect: EDNS Version 0 in response
	  </artwork>
        </figure>
      </t>
      <t>
	Check that new opcodes are handled.
        <figure>
	  <artwork>
dig +noedns +noad +opcode=15 +norec soa $zone @$server

expect: status: NOTIMP
expect: SOA record to not be present
	  </artwork>
        </figure>
      </t>
      <t>
	If EDNS is not supported by the nameserver we expect a response to
	all the above queries.  That response may be a FORMERR or NOTIMP
	error response or the OPT record may just be ignored.
      </t>
      <t>
	It is advisable to run all the above tests in parallel so as to
	minimise the delays due to multiple timeouts when the servers
	do not respond.
      </t>
      <t>
	The above tests use dig from BIND 9.11.0 which is still in development.
      </t>
    </section>
    <section anchor="seccon" title="Security Considerations">
      <t>
	Testing protocol compliance can potentially result false
	reports of attempts to break services from Intrusion Detection
	Services and firewalls.  None of the tests listed above
	should break nominally EDNS compliant servers.  None of the
	tests above should break non EDNS servers.  All the tests
	above are well formed, though not necessarily common, DNS
	queries.
      </t>
      <t>
	Relaxing firewall settings to ensure EDNS compliance could
	potentially expose a critical implementation flaw in the
	nameserver.  Nameservers should be tested for conformance
	before relaxing firewall settings.
      </t>
    </section>
    <section anchor="iana" title="IANA Considerations">
      <t>
	IANA / ICANN needs to consider what tests, if any, from
	above that it should add to the zone maintenance procedures
	for zones under its control including pre-delegation checks.
	Otherwise this document has no actions for IANA.
      </t>
    </section>
  </middle>
  <back>
    <references title="Normative References">
      &rfc1034; &rfc1035; &rfc3225; &rfc5966; &rfc6891;
    </references>
  </back>
</rfc>
