INTERNET-DRAFT Charles H. Lindsey Usenet Format Working Group University of Manchester February 2003 News Article Format Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC 2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This Draft is intended as a standards track document, obsoleting RFC 1036, which itself dates from 1987. This Standard defines the format of Netnews articles and specifies the requirements to be met by software which originates, distributes, stores and displays them. Since the 1980s, Usenet has grown explosively, and many Internet and non-Internet sites now participate. In addition, the Netnews technology is now in widespread use for other purposes. Backward compatibility has been a major goal of this endeavour, but where this standard and earlier documents or practices conflict, this standard should be followed. In most such cases, current practice is already compatible with these changes. [The use of the words "this standard" within this document when referring to itself does not imply that this draft yet has pretensions to be a standard, but rather indicates what will become the case if and when it is accepted as an RFC with the status of a proposed or draft standard.] [Remarks enclosed in square brackets and aligned with the left margin, such as this one, are not part of this draft, but are editorial notes to explain matters amongst ourselves, or to point out alternatives, or to assist the RFC Editor.] [In this draft, references to [NNTP] are to be replaced by [RFC 977], or else by references to the RFC arising from the series of drafts draft- ietf-nntpext-base-*.txt, in the event that such RFC has been accepted at the time this document is published. Likewise, if may be possible to replace references to [RFC 2279] by references to [RFC 2279bis].] [Please note that this Draft is subject to change as regards the extent to which the charset UTF-8 is to be used in headers, or even whether it is used at all. It is published at this time to give a snapshot of the current state of the project. It is also intended to split it into two documents, one the standard proper and the other an informational document setting out best current practice.] Table of Contents 1. Introduction .................................................. 0 1.1. Basic Concepts ............................................ 0 1.2. Objectives ................................................ 0 1.3. Historical Outline ........................................ 0 1.4. Transport ................................................. 0 2. Definitions, Notations and Conventions ........................ 0 2.1. Definitions ............................................... 0 2.2. Textual Notations ......................................... 0 2.3. Relation To Email and MIME ................................ 0 2.4. Syntax .................................................... 0 2.4.1. Syntax Notation ....................................... 0 2.4.2. Syntax adapted from Email and MIME .................... 0 2.4.3. Syntax copied from other standards .................... 0 2.5. Language .................................................. 0 3. Changes to the existing protocols ............................. 0 3.1. Principal Changes ......................................... 0 3.2. Transitional Arrangements ................................. 0 4. Basic Format .................................................. 0 4.1. Syntax of News Articles ................................... 0 4.2. Headers ................................................... 0 4.2.1. Naming of Headers ..................................... 0 4.2.2. MIME-style Parameters ................................. 0 4.2.3. White Space and Continuations ......................... 0 4.2.4. Comments .............................................. 0 4.2.5. Header Properties ..................................... 0 4.2.5.1. Experimental Headers .............................. 0 4.2.5.2. Inheritable Headers ............................... 0 4.2.5.3. Variant Headers ................................... 0 4.2.6. Undesirable Headers ................................... 0 4.3. Body ...................................................... 0 4.3.1. Body Format Issues .................................... 0 4.3.2. Body Conventions ...................................... 0 4.4. Characters and Character Sets ............................. 0 4.4.1. Character Sets within Article Headers ................. 0 4.4.2. Character Sets within Article Bodies .................. 0 4.4.3. The NEWS-8BIT-HEADERS IMAP Extension .................. 0 4.5. Size Limits ............................................... 0 4.6. Example ................................................... 0 5. Mandatory Headers ............................................. 0 5.1. Date ...................................................... 0 5.1.1. Examples .............................................. 0 5.2. From ...................................................... 0 5.2.1. Examples: ............................................ 0 5.3. Message-ID ................................................ 0 5.4. Subject ................................................... 0 5.4.1. Examples .............................................. 0 5.5. Newsgroups ................................................ 0 5.5.1. Forbidden newsgroup-names ............................. 0 5.5.2. Encoded newsgroup-names ............................... 0 5.6. Path ...................................................... 0 5.6.1. Format ................................................ 0 5.6.2. Adding a path-identity to the Path-header ............. 0 5.6.3. The tail-entry ........................................ 0 5.6.4. Path-Delimiter Summary ................................ 0 5.6.5. Suggested Verification Methods ........................ 0 5.6.6. Example ............................................... 0 6. Optional Headers .............................................. 0 6.1. Reply-To .................................................. 0 6.1.1. Examples .............................................. 0 6.2. Sender .................................................... 0 6.3. Organization .............................................. 0 6.4. Keywords .................................................. 0 6.5. Summary ................................................... 0 6.6. Distribution .............................................. 0 6.7. Followup-To ............................................... 0 6.8. Mail-Copies-To ............................................ 0 6.9. Posted-And-Mailed ......................................... 0 6.10. References ............................................... 0 6.10.1. Examples ............................................. 0 6.11. Expires .................................................. 0 6.12. Archive .................................................. 0 6.13. Control .................................................. 0 6.14. Approved ................................................. 0 6.15. Supersedes ............................................... 0 6.16. Xref ..................................................... 0 6.17. Lines .................................................... 0 6.18. User-Agent ............................................... 0 6.18.1. Examples ............................................. 0 6.19. Injector-Info ............................................ 0 6.19.1. Usage of Injector-Info-parameters .................... 0 6.19.1.1. The posting-host-parameter ....................... 0 6.19.1.2. The posting-account-parameter .................... 0 6.19.1.3. The posting-sender-parameter ..................... 0 6.19.1.4. The posting-logging-parameter .................... 0 6.19.1.5. The posting-date-parameter ....................... 0 6.19.2. Example .............................................. 0 6.20. Complaints-To ............................................ 0 6.21. MIME headers ............................................. 0 6.21.1. Syntax ............................................... 0 6.21.2. Content-Type ......................................... 0 6.21.2.1. Message/partial .................................. 0 6.21.2.2. Message/rfc822 ................................... 0 6.21.2.3. Message/external-body ............................ 0 6.21.2.4. Multipart types .................................. 0 6.21.3. Content-Transfer-Encoding ............................ 0 6.21.4. Character Sets ....................................... 0 6.21.5. Content Disposition .................................. 0 6.21.6. Definition of some new Content-Types ................. 0 6.21.6.1. Application/news-transmission .................... 0 6.21.6.2. Message/news obsoleted ........................... 0 6.22. Obsolete Headers ......................................... 0 7. Control Messages .............................................. 0 7.1. Digital Signature of Headers .............................. 0 7.2. Group Control Messages .................................... 0 7.2.1. The 'newgroup' Control Message ........................ 0 7.2.1.1. The Body of the 'newgroup' Control Message ........ 0 7.2.1.2. Application/news-groupinfo ........................ 0 7.2.1.3. Initial Articles .................................. 0 7.2.1.4. Example ........................................... 0 7.2.2. The 'rmgroup' Control Message ......................... 0 7.2.2.1. Example ........................................... 0 7.2.3. The 'mvgroup' Control Message ......................... 0 7.2.3.1. Example ........................................... 0 7.2.4. The 'checkgroups' Control Message ..................... 0 7.2.4.1. Application/news-checkgroups ...................... 0 7.3. Cancel .................................................... 0 7.4. Ihave, sendme ............................................. 0 7.5. Obsolete control messages. ............................... 0 8. Duties of Various Agents ...................................... 0 8.1. General principles to be followed ......................... 0 8.2. Duties of an Injecting Agent .............................. 0 8.2.1. Proto-articles ........................................ 0 8.2.2. Procedure to be followed by Injecting Agents .......... 0 8.3. Duties of a Relaying Agent ................................ 0 8.4. Duties of a Serving Agent ................................. 0 8.5. Duties of a Posting Agent ................................. 0 8.6. Duties of a Followup Agent ................................ 0 8.7. Duties of a Moderator ..................................... 0 8.8. Duties of a Gateway ....................................... 0 8.8.1. Duties of an Outgoing Gateway ......................... 0 8.8.1.1. Gatewaying into email ............................. 0 8.8.2. Duties of an Incoming Gateway ......................... 0 8.8.3. Example ............................................... 0 9. Security and Related Considerations ........................... 0 9.1. Leakage ................................................... 0 9.2. Attacks ................................................... 0 9.2.1. Denial of Service ..................................... 0 9.2.2. Compromise of System Integrity ........................ 0 9.3. Liability ................................................. 0 10. IANA Considerations .......................................... 0 11. References ................................................... 0 12. Acknowledgements ............................................. 0 13. Contact Address .............................................. 0 Appendix A.1 - A-News Article Format .............................. 0 Appendix A.2 - Early B-News Article Format ........................ 0 Appendix A.3 - Obsolete Headers ................................... 0 Appendix A.4 - Obsolete Control Messages .......................... 0 Appendix B - Collected Syntax ..................................... 0 Appendix B.1 - Characters, Atoms and Folding ...................... 0 Appendix B.2 - Basic Forms ........................................ 0 Appendix B.3 - Headers ............................................ 0 Appendix B.3.1 - Header outlines .................................. 0 Appendix B.3.2 - Control-message outlines ......................... 0 Appendix B.3.3 - Other header rules ............................... 0 Appendix C - Notices .............................................. 0 1. Introduction 1.1. Basic Concepts "Netnews" is a set of protocols for generating, storing and retrieving news "articles" (which resemble email messages) and for exchanging them amongst a readership which is potentially widely distributed. It is organized around "newsgroups", with the expectation that each reader will be able to see all articles posted to each newsgroup in which he participates. These protocols most commonly use a flooding algorithm which propagates copies throughout a network of participating servers. Typically, only one copy is stored per server, and each server makes it available on demand to readers able to access that server. An important characteristic of Netnews is the lack of any requirement for a central administration or for the establishment of any controlling host to manage the network. A network which limits participation to some restricted set of hosts (within some company, for example) is a "closed" network; otherwise it is an "open" network. A set of hosts within a network which, by mutual arrangement, operates some variant (whether more or less restrictive) of the Netnews protocols is a "cooperating subnet". "Usenet" is a particular worldwide open network based upon the Netnews protocols, with the newsgroups being organized into recognized "hierarchies". Anybody can join (it is simply necessary to negotiate an exchange of articles with one or more other participating hosts). Usenet "belongs" to those who administer the hosts of which it is comprised. There is no Cabal with overall authority to direct what is to be be allowed. Nevertheless, there do exist agencies within Usenet that have authority to establish policies and to perform administrative functions, but such authority derives solely from the consent of those sites which choose to recognize it (and who can decline to exchange articles with sites which choose not to recognize it). Usually, the authority of such an agency is restricted to a particular hierarchy, or group of hierarchies. A "policy" is a rule intended to facilitate the smooth operation of a network by establishing parameters which restrict behaviour that, whilst technically unexceptionable, would nevertheless contravene some accepted standard of "Good Netkeeping". Since the ultimate beneficiaries of a network are its human readers, who will be less tolerant of poorly designed interfaces than mere computers, articles in breach of established policy can cause considerable annoyance to their recipients. Policies may well vary from network to network, from hierarchy to hierarchy within one network, and even between individual newsgroups within one hierarchy. It is assumed, for the purposes of this standard, that agencies with varying degrees of authority to establish such policies will exist, and that where they do not, policy will be established by mutual agreement. For the benefit of networks and hierarchies without such established agencies, and to provide a basis upon which all agencies can build, this present standard often provides default policy parameters, usually introducing them by a phrase such as "As a matter of policy ...". 1.2. Objectives The purpose of this present standard is to define the format of articles and the protocols to be used for Netnews in general, and for Usenet in particular, and to set standards to be followed by software that implements those protocols. It is NOT the purpose of this standard to define how the authority of various agencies to exercise control or oversight of the various parts of Usenet is established (that is itself a matter of policy). Nevertheless, it is assumed that such authorities will exist, and tools are provided within the protocols for their use. 1.3. Historical Outline Network news originated as the medium of communication for Usenet, circa 1980. Since then, Usenet has grown explosively, and many Internet and non-Internet sites participate in it. In addition, the news technology is now in widespread use for other purposes, on the Internet and elsewhere. The earliest news interchange used the so-called "A News" article format. Shortly thereafter, an article format vaguely resembling Internet Mail was devised and used briefly. Both of those formats are completely obsolete; they are documented in Appendix A.1 and Appendix A.2 for historical reasons only. With publication of [RFC 850] in 1983, news articles came to closely resemble Internet Mail messages, with some restrictions and some additional headers. [RFC 1036] in 1987 updated [RFC 850] without making major changes. A Draft popularly referred to as "Son of 1036" [Son-of-1036] was written in 1994 by Henry Spencer. That document formed the original basis for this standard, and its author has endorsed this standard as its successor. Much is taken directly from Son of 1036, and it is hoped that we have followed its spirit and intentions. It is anticipated that [Son-of-1036] will be published as an Historic RFC, in a suitably relabelled form, following the publication of this standard. 1.4. Transport As in this standard's predecessors, the exact means used to transmit articles from one host to another is not specified. NNTP [NNTP] is the most common transmission method on the Internet, but much transmission takes place entirely independent of the Internet. Other methods in use include the UUCP protocol [RFC 976] extensively used in the early days of Usenet, FTP, downloading via satellite, tape archives, and physically delivered magnetic and optical media. 2. Definitions, Notations and Conventions 2.1. Definitions An "article" is the unit of news, analogous to an [RFC 2822] "message". A "proto-article" is one that has not yet been injected into the news system. A "message identifier" (5.3) is a unique identifier for an article, usually supplied by the "posting agent" which posted it or, failing that, by the "injecting agent". It distinguishes the article from every other article ever posted anywhere. Articles with the same message identifier are treated as if they are the same article regardless of any differences in the body or headers. A "newsgroup" is a single news forum, a logical bulletin board, having a name and nominally intended for articles on a specific topic. An article is "posted to" a single newsgroup or several newsgroups. When an article is posted to more than one newsgroup, it is said to be "crossposted"; note that this differs from posting the same text as part of each of several articles, one per newsgroup. A newsgroup may be "moderated", in which case submissions are not posted directly, but mailed to a "moderator" for consideration and possible posting. Moderators are typically human but may be implemented partially or entirely in software. A "hierarchy" is the set of all newsgroups whose names share a first component (as defined in 5.5). The term "sub-hierarchy" is also used where several initial components are shared. A "poster" is the person or software that composes and submits a possibly compliant article to a "posting agent". The poster is analogous to [RFC 2822]'s author(s). A "posting agent" is the software that assists posters to prepare proto-articles, in compliance with this standard. The proto-article is then passed on to an "injecting agent" for final checking and injection into the news stream. If the article is not compliant, or is rejected by the injecting agent, then the posting agent informs the poster with an explanation of the error. A "reader" is the person or software reading news articles. A "reading agent" is software which presents articles to a reader. A "followup" is an article containing a response to the contents of an earlier article (the followup's "precursor"). A "followup agent" is a combination of reading agent and posting agent that aids in the preparation and posting of a followup. An (email) "address" is the mailbox [RFC 2822] (or more particularly the addr-spec within that mailbox) which directs the delivery of an email to its intended recipient, who is said to "own" that address. An article's "reply address" is the address to which mailed replies should be sent. This is the address specified in the article's From- header (5.2), unless it also has a Reply-To-header (6.1). A "sender" is the person or software (usually, but not always, the same as the poster) responsible for the operation of the posting agent or, which amounts to the same thing, for passing the article to the injecting agent. The sender is analogous to [RFC 2822]'s sender. An "injecting agent" takes the finished article from the posting agent (often via the NNTP "post" command) performs some final checks and passes it on to a relaying agent for general distribution. A "relaying agent" is software which receives allegedly compliant articles from injecting agents and/or other relaying agents, and possibly passes copies on to other relaying agents and serving agents. A "news database" is the set of articles and related structural information stored by a serving agent and made available for access by reading agents. A "serving agent" receives an article from a relaying agent and files it in a news database. It also provides an interface for reading agents to access the news database. A "control message" is an article which is marked as containing control information; a relaying or serving agent receiving such an article may (subject to the policies observed at that site) take actions beyond just filing and passing on the article. A "gateway" is software which receives news articles and converts them to messages of some other kind (e.g. mail to a mailing list), or vice versa; in essence it is a translating relaying agent that straddles boundaries between different methods of message exchange. The most common type of gateway connects newsgroup(s) to mailing list(s), either unidirectionally or bidirectionally, but there are also gateways between news networks using this standard's news format and those using other formats. 2.2. Textual Notations This standard contains explanatory NOTEs using the following format. These may be skipped by persons interested solely in the content of the specification. The purpose of the notes is to explain why choices were made, to place them in context, or to suggest possible implementation techniques. NOTE: While such explanatory notes may seem superfluous in principle, they often help the less-than-omniscient reader grasp the purpose of the specification and the constraints involved. Given the limitations of natural language for descriptive purposes, this improves the probability that implementors and users will understand the true intent of the specification in cases where the wording is not entirely clear. "US-ASCII" is short for "the ANSI X3.4 character set" [ANSI X3.4]. While "ASCII" is often misused to refer to various character sets somewhat similar to X3.4, in this standard "US-ASCII" is used to mean X3.4 and only X3.4. US-ASCII is a 7 bit character set. Please note that this standard requires that all agents be 8 bit clean; that is, they must accept and transmit data without changing or omitting the 8th bit. Certain words, when capitalized, are used to define the significance of individual requirements. The key words "MUST", "REQUIRED", "SHOULD", "RECOMMENDED", "MAY" and "OPTIONAL", and any of those words associated with the word "NOT", are to be interpreted as described in [RFC 2119]. In addition, the word "Ought", when applied to a poster, or to actions of posting and similar agents which a poster may easily override, indicates a recommendation whose violation would do no more than breach established policy, or accepted best practice. NOTE: The use of "MUST" or "SHOULD" implies a requirement that would or could lead to interoperability problems if not followed. Although not following an "Ought" recommendation might do no worse than cause extreme irritation to other readers, particularly in the case of the publicly distributed Usenet, that is no reason not to take it seriously. The essential distinction is that enforcement of a "MUST" or "SHOULD" is a matter of ensuring correct implementation, whereas enforcement of an "Ought" is more a matter of sensible design or of social pressure (whose effectiveness should not be underestimated, even though it cannot be prescribed by this standard). NOTE: A requirement imposed on a relaying or serving agent regarding some particular article should be understood as applying only if that article is actually accepted for processing (since any agent may always reject any article entirely, for reasons of site policy). Wherever the context permits, use of the masculine includes the feminine and use of the singular includes the plural, and vice versa. Throughout this standard we will give examples of various definitions, headers and other specifications. It needs to be remembered that these samples are for the aid of the reader only and do NOT define any specification themselves. In order to prevent possible conflict with "Real World" entities and people the top level domain ".example" is used in all sample domains and addresses. The hierarchy "example.*" is also used as a sample hierarchy. Information on the ".example" top level domain is in [RFC 2606]. 2.3. Relation To Email and MIME The primary intent of this standard is to describe the news article format. Insofar as news articles are a subset of the email message format augmented by some new headers, this standard incorporates many (though not all) of the provisions of [RFC 2822], with the aim of enabling news articles to pass through email systems and vice versa, provided only that they contain the minimum headers required for the mode of transport being used. Unfortunately, the match is not perfect, but it is the intention of this standard that gateways between Email and Netnews should be able to operate with the minimum of tinkering. Likewise, this standard incorporates many (though not all) of the provisions of the MIME standards [RFC 2045] et seq which, though designed with Email in mind, are mostly applicable to Netnews. 2.4. Syntax The complete syntax defined in this standard is repeated, for convenience, in Appendix B. 2.4.1. Syntax Notation This standard uses the Augmented Backus Naur Form described in [RFC 2234]. In particular, it makes significant use of the "incremental alternative" feature of that notation. For example, the two rules header = other-header header =/ Date-header are equivalent to the single rule header = other-header / Date-header 2.4.2. Syntax adapted from Email and MIME Much of the syntax of Netnews Articles is based on the corresponding syntax defined in [RFC 2822] or in the MIME specifications [RFC 2045] et seq, which are deemed to have been incorporated into this standard as required. However, there are some important differences arising from the fact that [RFC 2822] does not recognize anything beyond US- ASCII characters, that it does not recognize the MIME headers [RFC 2045], and that it includes much syntax described as "obsolete" (which is excluded from this standard, as detailed below). NOTE: Netnews parsers historically have been much less permissive than Email parsers, and this is reflected in the modifications referred to, and in some further specific rules. The following syntactic rules therefore supersede the corresponding rules given in [RFC 2822] and [RFC 2045], thus allowing UTF-8 characters [RFC 2279] to appear in certain contexts (the five rules beginning with "strict-" reflect the corresponding original rules from [RFC 2822]). UTF8-2 = %xC2-DF UTF8-tail UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2(UTF8-tail) / %xED %x80-9F UTF8-tail / %xEE-EF 2(UTF8-tail) UTF8-4 = %xF0 %x90-BF 2(UTF8-tail) / %xF1-F7 3(UTF8-tail) UTF8-5 = %xF8 %x88-BF 3(UTF8-tail) / %xF9-FB 4(UTF8-tail) UTF8-6 = %xFC %x84-BF 4(UTF8-tail) / %xFD 5(UTF8-tail) UTF8-tail = %x80-BF UTF8-xtra-char = UTF8-2 / UTF8-3 / UTF8-4 / UTF8-5 / UTF8-6 text = %d1-9 / ; all UTF-8 characters except %d11-12 / ; US-ASCII NUL, CR and LF %d14-127 / UTF8-xtra-char ctext = NO-WS-CTL / ; all of except %d33-39 / ; SP, HTAB, "(", ")" %d42-91 / ; "\" and DEL %d93-126 / UTF8-xtra-char qtext = NO-WS-CTL / ; all of except %d33 / ; SP, HTAB, "\" DQUOTE %d35-91 / ; and DEL %d93-126 / UTF8-xtra-char utext = NO-WS-CTL / ; Non white space controls %d33-126 / ; The rest of UTF-8 UTF8-xtra-char strict-text = %d1-9 / ; text restricted to %d11-12 / ; US-ASCII %d14-127 strict-qtext = NO-WS-CTL / ; qtext restricted to %d33 / ; US-ASCII %d35-91 / %d93-126 strict-quoted-pair = "\" strict-text strict-qcontent = strict-qtext / strict-quoted-pair strict-quoted-string = [CFWS] DQUOTE *( [FWS] strict-qcontent ) [FWS] DQUOTE [CFWS] unstructured = 1*( [FWS] utext ) [FWS] The syntax for UTF8-xtra-char excludes those redundant sequences of octets which cannot occur in UTF-8, as defined by [RFC 2279], either because they would not be the shortest possible encodings of some UCS character [ISO/IEC 10646], or they would represent one of the characters D800 through DFFF, disallowed in UCS because of their surrogate use in the UTF-16 encoding. These sequences MUST NOT be generated by posting agents. Where they occur inadvertently, they SHOULD be passed on untouched by other agents, but attempts to interpret them as malformed UTF-8 MUST NOT be made. However, if there is reason to suppose they are representations of some other character set they MAY, as suggested in section 4.4.1, be interpreted as such. The syntax also includes, for completeness, the cases UTF8-5 and UTF8-6 which cannot, in fact, arise in [UNICODE 3.2] (though they might conceivably arise in some future extension). Observe, in contradistinction to [RFC 2822], that an unstructured header MUST contain at least one non-whitespace character (see also remarks about empty headers in 4.2.6). Wherever in this standard the syntax is stated to be taken from [RFC 2822], it is to be understood as the syntax defined by [RFC 2822] after making the above changes, but NOT including any syntax defined in section 4 ("Obsolete syntax") of [RFC 2822]. Software compliant with this standard MUST NOT generate any of the syntactic forms defined in that Obsolete Syntax, although it MAY accept such syntactic forms. Certain syntax from the MIME specifications [RFC 2045] et seq is also considered a part of this standard (see 6.21). 2.4.3. Syntax copied from other standards The following syntactic forms, taken from [RFC 2234] or from [RFC 2822], are repeated here for convenience only: ALPHA = %x41-5A / ; A-Z %x61-7A ; a-z CR = %x0D ; carriage return CRLF = CR LF DIGIT = %x30-39 ; 0-9 HTAB = %x09 ; horizontal tab LF = %x0A ; line feed SP = %x20 ; space NO-WS-CTL = %d1-8 / ; US-ASCII control characters %d11 / ; which do not include the %d12 / ; carriage return, line feed, %d14-31 / ; and whitespace characters %d127 specials = "(" / ")" / ; Special characters used in "<" / ">" / ; other parts of the syntax "[" / "]" / ":" / ";" / "@" / "\" / "," / "." / DQUOTE WSP = SP / HTAB ; whitespace characters FWS = ([*WSP CRLF] 1*WSP); folding whitespace ccontent = ctext / quoted-pair / comment comment = "(" *([FWS] ccontent) [FWS] ")" CFWS = *([FWS] comment) ( ([FWS] comment) / FWS ) DQUOTE = %d34 ; quote mark quoted-pair = "\" text atext = ALPHA / DIGIT / "!" / "#" / ; Any US-ASCII character except "$" / "%" / ; controls, SP, and specials. "&" / "'" / ; Used for atoms "*" / "+" / "-" / "/" / "=" / "?" / "^" / "_" / "`" / "{" / "|" / "}" / "~" atom = [CFWS] 1*atext [CFWS] dot-atom = [CFWS] dot-atom-text [CFWS] dot-atom-text = 1*atext *( "." 1*atext ) qcontent = qtext / quoted-pair quoted-string = [CFWS] DQUOTE *( [FWS] qcontent ) [FWS] DQUOTE [CFWS] word = atom / quoted-string phrase = 1*word NOTE: Following [RFC 2234], literal text included in the syntax is to be regarded as case-insensitive. However, in contradistinction to [RFC 2822], the Netnews protocols are sensitive to case in some instances (as in newsgroup-names, some header parameters, etc.). Care has been taken to indicate this explicitly where required. As in [RFC 2822], where any quoted-pair appears it is to be interpreted as its text character alone. That is to say, the "\" character that appears as part of a quoted-pair is semantically "invisible". Again, as in [RFC 2822], strings of characters that include characters not syntactically allowed in some particular context may be incorporated into a quoted-string by "encapsulating" them between quote (DQUOTE, US-ASCII 34) characters, prefixing every quote and backslash character (and possibly other characters too) with a "\" so as to form a quoted-pair, and possibly introducing folding by prefixing some WSP with CRLF. The semantic value of a quoted-string (i.e. the result of reversing the encapsulation) is a string of characters which includes neither the optional CFWS outside of the quote characters, nor the quote characters themselves, nor any CRLF contained within any FWS between the two quote characters, nor the "\" which introduces any quoted- pair. 2.5. Language Various constant strings in this standard, such as header names and month names, are derived from English words. Despite their derivation, these words do NOT change when the poster or reader employing them is interacting in a language other than English. Posting and reading agents MAY translate as appropriate in their interaction with the poster or reader, but the forms that actually appear in articles as transmitted MUST be the English-derived ones defined in this standard. 3. Changes to the existing protocols This standard prescribes many changes, clarifications and new features since the protocols described in [RFC 1036] and [Son-of- 1036]. It is the intention that they can be assimilated into Usenet as it presently operates without major interruption to the service, though some of the new features may not begin to show benefit until they become widely implemented. This section summarizes the main changes, and comments on some features of the transition. 3.1. Principal Changes o The [RFC 2822] conventions for parenthesis-enclosed comments in headers are supported. o Whitespace is permitted in Newsgroups-headers, permitting folding of such headers. Indeed, all headers can now be folded. o An enhanced syntax for the Path-header enables the injection point of and the route taken by an article to be determined with certainty. o Netnews is firmly established as an 8bit medium and all headers are deemed to be in the UTF-8 character set (thus permitting, in particular, the use of non-ASCII newsgroup-names). o Large parts of MIME are recognized as an integral part of Netnews. o There is a new Control message 'mvgroup' to facilitate moving a group to a different place (name) in a hierarchy. o There are several new headers defined, notably Archive, Complaints-To, Injector-Info, Mail-Copies-To, Posted-And-Mailed and User-Agent, leading to increased functionality. o Provision has been made for almost all headers to have MIME-style parameters (to be ignored if not recognized), thus facilitating extension of those headers in future standards. o Certain headers and Control messages (Appendix A.3 and Appendix A.4) have been made obsolete. o Distributions are expected to be checked at the receiving end, as well as the sending end, of a relaying link. o There are numerous other small changes, clarifications and enhancements. 3.2. Transitional Arrangements An important distinction must be made between serving and relaying agents, which are responsible for the distribution and storage of news articles, and user agents, which are responsible for interactions with users. It is important that the former should be upgraded to conform to this standard as soon as possible to provide the benefit of the enhanced facilities. Fortunately, the number of distinct implementations of such agents is rather small, at least so far as the main "backbone" of Usenet is concerned, and many of the new features are already supported. Contrariwise, there are a great number of implementations of user agents, installed on a vastly greater number of small sites. Therefore, the new functionality has been designed so that existing agents may continue to be used, although the full benefits may not be realised until a substantial proportion of them have been upgraded. In the list which follows, care has been taken to distinguish the implications for both kinds of agent. o [RFC 2822] style comments in headers do not affect serving and relaying agents (note that the Message-ID-, Newsgroups-, Distribution- and Path-headers do not contain them). They are unlikely to hinder their proper display in existing reading agents except in the case of the References-header in agents which thread articles. Therefore, it is provided that they SHOULD NOT be generated except where permitted by the previous standards. o Because of its importance to all serving agents, the extension permitting whitespace and folding in Newsgroups-headers SHOULD NOT be used until it has been widely deployed amongst relaying agents. User agents are unaffected. o The new style of Path-header is already consistent with the previous standards. However, the intention is that relaying agents should eventually reject articles in the old style, and so this possibility should be offered as a configurable option in relaying agents. User agents are unaffected. o The vast majority of serving, relaying and transport agents are believed to be already 8bit clean (in the slightly restricted sense in which that term is used in the MIME standards). User agents that do not implement MIME may be disadvantaged, but no more so than at present when faced with 8bit characters (which currently abound in spite of the previous standards). o The introduction of MIME reflects a practice that is already widespread. Articles in strict compliance with the previous standards (using strict US-ASCII) will be unaffected. Many user agents already support it, at least to the extent of widely used charsets such as ISO-8859-1. Users expecting to read articles using other charsets will need to acquire suitable reading agents. It is not intended, in general, that any single user agent will be able to display every charset known to IANA, but all such agents MUST support US-ASCII. Serving and relaying agents are not affected. o The use of the UTF-8 charset for headers will not affect any existing usage that complies with the previous standards, since US-ASCII is a strict subset of UTF-8. Insofar as newsgroup-names containing non-ASCII characters can now be expected to arise, some support from serving and relaying agents will be desirable, although it has been established that most current serving agents can already cope with such names without modification (although perhaps not in an ideal manner). Note that it is not necessary for serving and relaying agents to understand all the characters available in UTF-8, though it is desirable for them to be displayable for diagnostic purposes via some escape mechanism using, for example, the visible subset of US-ASCII. o Users expecting to read headers containing non-ASCII characters expressed in UTF-8 will need to acquire suitable reading agents (it is not anticipated that current reading agents will fail to display such articles, but those non-ASCII characters will likely appear in some illegible form). The same will be true for users reading such headers in email on the far side of a news-to-email gateway; in that case, it will be necessary for the gateway to be upgraded (see 8.8.1.1). o The new Control: mvgroup command will need to be implemented in serving agents. For the benefit of older serving agents it is therefore RECOMMENDED that it be followed shortly by a corresponding newgroup command and it MUST always be followed by a rmgroup command for the old group after a reasonable overlap period. An implementation of the mvgroup command as an alias for the newgroup command would thus be minimally conforming. User agents are unaffected. o All the headers newly introduced by this standard can safely be ignored by existing software, albeit with loss of the new functionality. 4. Basic Format 4.1. Syntax of News Articles The overall syntax of a news article is: article = 1*( header CRLF ) separator body header = other-header other-header = header-name ":" 1*SP other-content header-name = 1*name-character *( "-" 1*name-character ) name-character = ALPHA / DIGIT other-content = separator = CRLF body = *( *998text CRLF ) However, the rule given above for header is incomplete. Further alternatives will be added incrementally as the various Netnews headers are introduced in this standard (or in future extensions), using the "=/" notation defined in [RFC 2234]. For example, a typical USENET-header would be defined as follows: header =/ USENET-header USENET-header = "USENET" ":" SP USENET-content *( ";" ( USENET-parameter / extension-parameter ) ) USENET-content = USENET-parameter = where the USENET-parameter, which MUST always be of the same syntactic form as an extension-parameter (see below), is not provided in all headers, and even the extension-parameter is omitted in some cases cases (see 4.2.2). Observe that "USENET" is (and MUST be) of the syntactic form of a header-name. extension-parameter= parameter = attribute "=" value attribute = [CFWS] token [CFWS] x-token = "x-" token token = 1* tspecials = "(" / ")" / "<" / ">" / "@" / "," / ";" / ":" / "\" / DQUOTE / "/" / "[" / "]" / "?" / "=" value = [CFWS] token [CFWS] / quoted-string An article consists of some headers followed by a body. An empty line separates the two. The headers contain structured information about the article and its transmission. A header begins with a header-name identifying it, and can be continued onto subsequent lines as described in section 4.2.3. The body is largely unstructured text significant only to the poster and the readers. NOTE: Terminology here follows the current custom in the news community, rather than the [RFC 2822] convention of referring to what is here called a "header" as a "header-field" or "field". Note that the separator line MUST be truly empty, not just a line containing white space. Further empty lines following it are part of the body, as are empty lines at the end of the article. NOTE: The syntax above defines the canonical form of a news article as a sequence of lines each terminated by CRLF. This does not prevent serving agents or transport agents from storing or handling the article in other formats (e.g. using a single LF in place of CRLF) so long as the overall effects achieved are as defined by this standard when operating on the canonical form. 4.2. Headers The order of headers in an article is not significant. However, posting agents are encouraged to put mandatory headers (section 5) first, followed by optional headers (section 6), followed by experimental headers and headers not defined in this standard or its extensions. Relaying agents MUST NOT change the order of the headers in an article. 4.2.1. Naming of Headers Despite the restrictions on header-name syntax imposed by the grammar, relaying and reading agents SHOULD tolerate header-names containing any US-ASCII printable character other than colon (":", US-ASCII 58). Whilst relaying agents MUST accept, and pass on unaltered, any non- variant header whose header-name is syntactically correct, and reading agents MUST enable them to be displayed, at least optionally, posting and injecting agents SHOULD NOT generate headers other than o headers established by this standard or any extension to it; o those recognized by other IETF-established standards, notably the Email standard [RFC 2822] and its extensions, excluding any explicitly deprecated for Netnews (e.g. see section 9.2.1 for the deprecated Disposition-Notification-To-header); or, alternatively, those listed in some future IANA registry of recognized headers; o experimental headers beginning with "X-" (as defined in 4.2.5.1); o on a provisional basis only, headers related to new protocols under development which are the subject of (or intended to be the subject of) some IETF-approved RFC (whether Informational, Experimental or Standards-Track). However, software SHOULD NOT attempt to interpret headers not specifically intended to be meaningful in the Netnews environment. Header-names are case-insensitive. There is a preferred case convention, which posters and posting agents Ought to use: each hyphen-separated "word" has its initial letter (if any) in uppercase and the rest in lowercase, except that some abbreviations have all letters uppercase (e.g. "Message-ID" and "MIME-Version"). The forms given in the various rules defining headers in this standard are the preferred forms for them. Relaying and reading agents MUST, however, tolerate articles not obeying this convention. 4.2.2. MIME-style Parameters A few header-specific MIME-style parameters are defined in this standard, but there is also provision for generic extension- parameters to appear in most headers for the purpose of allowing future extensions to those headers. Observe that such parameters do not, in general, occur in headers defined in other standards, except for the MIME standards [RFC 2045] et seq. and their extensions. Extension-parameters, other than those using x-tokens, MUST NOT be used unless they have first been defined in an IETF-approved RFC (whether Informational, Experimental or Standards-Track) or, on a provisional basis only, in relation to new protocols under development which are the subject of (or intended to be the subject of) some such IETF-approved RFC. They MUST ONLY be defined for use in those headers where the syntax of this standard so allows. They SHOULD NOT, at present, be defined for use in headers in widespread use prior to the introduction of this standard (this restriction is likely to be removed in a future version of this standard). Nevertheless, compliant software MUST accept such parameters wherever syntactically allowed in this standard (ignoring them if their meaning is unknown) and SHOULD accept (and ignore) them in all structured headers wherever defined. [We could go further, and establish an IANA registry for these parameters, preloaded with the ones already defined in this standard. A good model for setting up such a registry is to be found in RFC 2183 (Content-Disposition).] NOTE: The syntax does not permit extension-parameters in unstructured headers (where they are unnecessary) or in certain headers (notably the Date-, From-, Message-ID-, Reply-To-, Sender-, Keywords-, Mail-Copies-To-, References-, Supersedes- and Complaints-To-headers) which are the same (or similar to) headers already existing in the Email standards. Each header-specific MIME-style parameter introduced in this standard is described by specifying (a) the token to be used in its attribute, and (b) the syntax rule(s) defining the object(s) permitted in its value. If a value object is not of the syntactic form of a token, it MUST (and otherwise MAY) be encapsulated in a quoted-string (see 2.4.3). Observe that the syntax of a parameter also allows additional WSP, folding and comments. The semantics of a parameter is always to associate the token in its attribute with the object represented by the token, or the semantic value (2.4.3) of the quoted-string, contained in its value. For example, the posting-sender-parameter (6.19) is defined to be where sender-value = mailbox / "verified" A valid posting-sender-parameter would be sender = "\"Joe D. Bloggs\" " (authinfo) The comment (syntactically part of the quoted-string) is irrelevant. The actual mailbox (to be used, for example, if email is to be sent to the sender) is "Joe D. Bloggs" 4.2.3. White Space and Continuations Each header is logically a single line of characters comprising the header-name, the colon with its following SP, the content, and any parameters. For convenience, however, the content and parameters can be "folded" into a multiple line representation by inserting a CRLF before any WSP contained within any FWS or CFWS (but not any other SP or HTAB) allowed by this standard. For example, the header: Approved: modname@modsite.example (Moderator of example.foo.bar) can be represented as: Approved: modname@modsite.example (Moderator of example.foo.bar) FWS occurs at many places in the syntax (usually within a CFWS) in order to allow the inclusion of comments, whitespace and folding. The syntax is in fact ambiguous insofar as it sometimes allows two consecutive instantiations of FWS (as least one of which is always optional), or of an optional FWS followed by an explicit CRLF. However, all such cases MUST be treated as if the optional instantiation (or one of them) had not been allowed. It is thus precluded that any line of a header should be made up of whitespace characters and nothing else (for such a line might otherwise have been interpreted by a non-compliant agent as the separator between the headers and the body of the article). NOTE: This does not lead to semantic ambiguity because, unless specifically stated otherwise, the presence or absence of folding, a comment or additional WSP has no semantic meaning and, in particular, it is a matter of indifference whether it forms a part of the syntactic construct preceding it or the one following it. NOTE: It may be observed that the content part of every header begins and ends with an optional CFWS (or FWS in the case of a few headers). Moreover, every parameter also begins and ends with an optional CFWS. NOTE: Although contents are defined in such a way that folding can take place between many of the lexical tokens (and even within some of them), folding should be limited to placing the CRLF at higher-level syntactic breaks, and should also avoid leaving trailing WSP on the preceding line. For instance, if a header-content is defined as comma-separated values, it is recommended that folding occur after the comma separating the values, even if it is allowed elsewhere. In accordance with the syntax, the header-name on the first line MUST be followed by a SP (even if the rest of the header is empty, but see 4.2.6). Even though the syntax allows otherwise, at least some of the content MUST appear on that first line (to avoid the possibility of harm by any non-compliant agent that might eliminate a trailing WSP). Although posting agents are REQUIRED to enforce these restrictions, relaying and serving agents SHOULD accept articles that violate them. NOTE: This standard differs from [RFC 2822] in requiring that SP following the colon (it was also an [RFC 1036] requirement). Posters and posting agents SHOULD use SP, not HTAB, where white space is desired in headers (some existing software expects this). Relaying and serving agents SHOULD accept HTAB in all such cases, however. 4.2.4. Comments Strings of characters which are treated as comments may be included in headers wherever the syntactic element CFWS occurs. They consist of characters enclosed in parentheses. Comments may be nested. NOTE: Although CFWS occurs wherever whitespace is allowed in almost all headers, there are exceptions where only FWS is permitted (hence folding but no comments). Notably, this happens in the case of the Message-ID-, Newsgroups-, Distribution-, Path- and Followup-To-headers, and within the Date-header except right at the end. A comment is normally used to provide some human readable informational text, except at the end of a mailbox which contains no phrase, as in fred@foo.bar.example (Fred Bloggs) as opposed to "Fred Bloggs" . The former is a deprecated, but commonly encountered, usage and reading agents SHOULD take special note of such comments as indicating the name of the person whose mailbox it is. In all other situations a comment is semantically interpreted as a single SP. Since a comment is allowed to contain FWS, folding is permitted within it as well as immediately preceding and immediately following it. Also note that, since quoted-pair is allowed in a comment, the parenthesis and backslash characters may appear in a comment so long as they appear as a quoted-pair. Semantically, the enclosing parentheses are not part of the content of the comment; the content is what is contained between the two parentheses. Since comments have not hitherto been permitted in news articles, except in a few specified places, posters and posting-agents SHOULD NOT insert them except in those places, namely following addresses in From and similar headers, and to indicate the name of the timezone in Date-headers. However, compliant software MUST accept them in all places where they are syntactically allowed. 4.2.5. Header Properties There are three special properties that may apply to particular headers, namely: "experimental", "inheritable", and "variant". When a header is defined, in this (or any future) standard, as having one (or possibly more) of these properties, it is subject to special treatment, as indicated below. 4.2.5.1. Experimental Headers Experimental headers are those whose header-names begin with "X-". They are to be used for experimental Netnews features, or for enabling additional material to be propagated with an article. They are not (and will not be) defined by this, or any, standard. NOTE: Experimental headers are suitable for situations where they need only to be human readable. They are not intended to be recognized by widely deployed Netnews software and, should such a requirement be envisaged, it is preferable to use a normal header on the provisional basis set out in section 4.2.1. 4.2.5.2. Inheritable Headers Subject only to the overriding ability of the poster to determine the contents of the headers in a proto-article, headers with the inheritable property MUST be copied by followup agents (perhaps with some modification) into the followup article, and headers without that property MUST NOT be so copied. Examples include: o Newsgroups (5.5) - copied from the precursor, subject to any Followup-To-header. o Subject (5.4) - modified by prefixing with "Re: ", but otherwise copied from the precursor. o References (6.10) - copied from the precursor, with the addition of the precursor's Message-ID. o Distribution (6.6) - copied from the precursor. NOTE: The Keywords-header is not inheritable, though some older newsreaders treated it as such. 4.2.5.3. Variant Headers Headers with the variant property may differ between (or even be completely absent from) copies of the same article as stored or relayed throughout a Netnews system. The manner of the difference (or absence) MUST be as specified in this (or any future) standard. Typically, these headers are modified as articles are propagated, or they reflect the status of the article on a particular serving agent, or cooperating group of such agents. The variant header MAY be placed anywhere within the headers (though placing it first is recommended). The principal examples are: o Path (5.6) - augmented at each relaying agent that an article passes through. o Xref (6.16) - used to keep track of the article locators of crossposted articles so that newsreaders serviced by a particular serving agent can mark such articles as read. 4.2.6. Undesirable Headers A header whose content is empty is said to be an empty header (in fact, no such headers are defined by this standard). Relaying and reading agents SHOULD NOT consider presence or absence of an empty header to alter the semantics of an article (although syntactic rules, such as requirements that certain header-names appear at most once, MUST still be satisfied). Posting and injecting agents SHOULD delete empty headers from articles before posting them; relaying agents MUST pass them untouched. Headers that merely state defaults explicitly (e.g., a Followup-To- header with the same content as the Newsgroups-header, or a MIME Content-Type-header with contents "text/plain; charset=us-ascii") or state information that reading agents can typically determine easily themselves (e.g. the length of the body in octets) are redundant and posters and posting agents Ought Not to include them. 4.3. Body 4.3.1. Body Format Issues The body of an article SHOULD NOT be empty. A posting or injecting agent which does not reject such an article entirely SHOULD at least issue a warning message to the poster and supply a non-empty body. Note that the separator line MUST be present even if the body is empty. NOTE: Some existing news software is known to react badly to body-less articles, hence the request for posting and injecting agents to insert a body in such cases. The sentence "This article was probably generated by a buggy news reader" has traditionally been used in this situation. Note that an article body is a sequence of lines terminated by CRLFs, not arbitrary binary data, and in particular it MUST end with a CRLF. However, relaying and serving agents SHOULD treat the body of an article as an uninterpreted sequence of octets (except as mandated by changes of CRLF representation and by control message processing, as in 7.2.4) and SHOULD avoid imposing constraints on it. See also section 4.5. Posters SHOULD avoid using control characters and escape sequences except for tab (US-ASCII 9), formfeed (US-ASCII 12) and, possibly, backspace (US-ASCII 8). Tab signifies sufficient horizontal white space to reach the next of a set of fixed positions; posters are warned that there is no standard set of positions, so tabs should be avoided if precise spacing is essential. Formfeed (which is sometimes referred to as the "spoiler character") signifies a point at which a reading agent Ought to pause and await reader interaction before displaying further text. NOTE: Passing other control characters or escape sequences unaltered to a display or printing device is likely to have unpredictable results, except in the case of a device adapted to the special needs of some particular character set. NOTE: Backspace was historically used for underlining, done by an underscore (US-ASCII 95), a backspace, and a character, repeated for each character that should be underlined. Posters are warned that underlining is not available on all output devices or supported by all reading agents and is best not relied on for essential meaning. 4.3.2. Body Conventions A body is by default an uninterpreted sequence of octets for most of the purposes of this standard. However, a MIME Content-Type-header may impose some structure or intended interpretation upon it, and may also specify the character set in accordance with which the octets are to be interpreted. The following conventions for quotations, attributions and signatures, although not mandated by this standard, describe widely used practices. They are documented here in order to establish their correct usage, and the use of the words "MUST", "SHOULD", etc. is to be understood in that context. It is conventional for followup agents to enable the incorporation of the followed-up article (the "precursor") as a quotation. This SHOULD be done by prefacing each line of the quoted text (even if it is empty) with the character ">" (or perhaps with "> " in the case of a previously unquoted line). This will result in multiple levels of ">" when quoted content itself contains quoted content, and it will also facilitate the automatic analysis of articles. NOTE: Posters should edit quoted context to trim it down to the minimum necessary. However, followup agents Ought Not to attempt to enforce this beyond issuing a warning (past attempts to do so have been found to be notably counter-productive). The followup agent SHOULD also precede the quoted content by an "attribution line" (however, readers are warned not to assume that they are accurate, especially within multiply nested quotations). The following convention for such lines is intended to facilitate their automatic recognition and processing by sophisticated reading agents. The attribution SHOULD contain the name and/or the email address of the precursor's poster, as in Joe D. Bloggs wrote: or Helmut Schmidt schrieb: The attribution MAY contain also a single newsgroup-name (the one from which the followup is being made), the precursor's Message-ID and/or the precursor's Date and Time. Any of these that are present, SHOULD precede the name and/or email address. However, the inclusion or not of such fields Ought always to be under the control of the poster. To enable this line, and the Message-ID and the email address within it, to be recognized (for example to enable suitable reading agents to retrieve the precursor or email its poster by clicking on them), the following conventions SHOULD be observed: o The precursor's Message-ID SHOULD be enclosed within <...> or o The precursor's poster's email address SHOULD be enclosed within <...> o The various fields may be separated by arbitrary text and they may be folded in the same way as headers, but attributions SHOULD always be terminated by a ":" followed by CRLF. Further examples: On comp.foo in <1234@bar.example> on 24 Dec 2001 16:40:20 +0000, Joe D. Bloggs wrote: Am 24. Dez 2001 schrieb Helmut Schmidt : A "personal signature" is a short closing text automatically added to the end of articles by posting agents, identifying the poster and giving his network addresses, etc. Whenever a poster or posting agent appends such a signature to an article, it MUST be preceded with a delimiter line containing (only) two hyphens (US-ASCII 45) followed by one SP (US-ASCII 32). The signature is considered to extend from the last occurrence of that delimiter up to the end of the article (or up to the end of the part in the case of a multipart MIME body). Followup agents, when incorporating quoted text from a precursor, Ought Not to include the signature in the quotation. Posting agents Ought to discourage (at least with a warning) signatures of excessive length (4 lines is a commonly accepted limit). 4.4. Characters and Character Sets Transmission paths for news articles MUST treat news articles as uninterpreted sequences of octets, excluding the values 0 (US-ASCII NUL) and 13 and 10 (US-ASCII CR and LF, which MUST ONLY appear in the combination CRLF which denotes a line separator). NOTE: this corresponds to the range of octets permitted for MIME "8bit data" [RFC 2045]. Thus raw binary data cannot be transmitted in an article body except by the use of a Content- Transfer-Encoding such as base64. [Tentative paragraph to deal with IMAP] This requirement includes the transmissiom paths between posting agents, injecting agents, relaying agents, serving agents and reading agents, but it does NOT include paths traversed by Netnews articles that have been converted to Email (8.8.1.1). It SHOULD extend to IMAP4 servers which provide access to Netnews (see the extension described in section 4.4.3). Character data is represented by octets in accordance with some encoding scheme (UTF-8 for headers, and determined by the Content- Type- and Content-Transfer-Encoding-headers for bodies). If it comes to a relaying agent's attention that it is being asked to pass an article using the Content-Transfer-Encoding "8bit" to a relaying agent that does not support it, it SHOULD report this error to its administrator. It MUST refuse to pass the article and MUST NOT re-encode it with different MIME encodings. NOTE: This strategy will do little harm. The target relaying agent is unlikely to be able to make use of the article on its own servers, and the usual flooding algorithm will likely find some alternative route to get the article to destinations where it is needed. 4.4.1. Character Sets within Article Headers Within article headers, characters are represented as octets according to the UTF-8 encoding scheme [RFC 2279] or [ISO/IEC 10646], and hence all the characters in Unicode [UNICODE 3.2] or in the Universal Multiple-Octet Coded Character Set (UCS) [ISO/IEC 10646] (which is essentially identical to Unicode and expected to remain so) are potentially available. Although it will usually be unnecessary to use language tagging within headers, the tagging facilities provided in [UNICODE 3.2] (code points U+E0000 through U+E007F) MAY be used for that purpose. NOTE: UTF-8 is an encoding for the [ISO/IEC 10646] character set (in both its 16 and 32 bit forms) with the property that any octet less than 128 immediately represents the corresponding US-ASCII character, thus ensuring upwards compatibility with previous practice. Non-ASCII characters from Unicode are represented by sequences of octets satisfying the syntax of a UTF8-xtra-char (2.4.2), which excludes certain octet sequences not explicitly permitted by [RFC 2279]. Unicode includes all characters from the ISO-8859 series of characters sets [ISO 8859] (which includes all Cyrillic, Greek and Arabic characters) together with the more elaborate characters used in Asian countries. See the NOTEs in the following section for the appropriate treatment of Unicode characters by reading agents. [The sentence mentioning [RFC 2279] could be simplified if [RFC 2279bis] has been accepted by the time this standard is published.] Notwithstanding the great flexibility permitted by UTF-8, there is need for restraint in its use in order that the essential components of headers may be discerned using reading agents that cannot present the full Unicode range. In particular, header-names and tokens MUST be in US-ASCII, and certain other components of headers, as defined elsewhere in this standard - notably msg-ids, date-times, dot-atoms, domains and path-identities - MUST be in US-ASCII. Comments, phrases (as in mailboxes) and unstructured headers (such as the Subject-, Organization- and Summary-headers) MAY use the full range of UTF-8 characters, but SHOULD nevertheless be invariant under Unicode normalization NFC [UNICODE 3.2]. NOTE: Unicode allows for composite characters made up of a starter character - which can be a letter, number, punctuation mark, or symbol - plus zero or more combining marks (such as accents, diacritics, and similar). The requirement that a composite be invariant under normalization NFC means that, where it could be written in more than one way, only one particular one of those ways is allowed (for example, the single character E-acute is preferred over E followed by a non-spacing acute accent, and A-ring is preferred over the Angstrom symbol). At least for the main European languages, for which all the needed composites are already available as single characters, it is unlikely that posting agents will need to take any special steps to ensure normalization. In the particular case of newsgroup-names (see 5.5) there are more stringent requirements regarding the normalization and other usages of Unicode. Where the use of non-ASCII characters is permitted as above, they MAY be encoded in UTF-8 or they MAY be encoded using the MIME mechanisms defined in [RFC 2047] and [RFC 2231]. For this purpose, all headers defined in this standard are to be considered as "extension message header fields" for the purpose of section 5 of [RFC 2047] (insofar as they are not already covered under the existing Email standards). The effect of this is to permit the use of [RFC 2047] encodings within any unstructured header, or within any comment or phrase permitted within any structured header. Additionally, [RFC 2047] is considered to incorporate the extension to allow language tags within encoded-words described in [RFC 2231]. Likewise, the syntax for parameter (see 4.1 above) is to be considered as replaced by the revised syntax given in [RFC 2231], the effect of which is to allow the use of parameter value continuations, character sets and language information within the MIME-style parameters introduced in this standard (4.2.2). [We could go further and include that syntax explicitly in this document.] Exceptionally, where some other protocol, for example the authentication protocol based on OpenPGP defined in [RFC 3156], restricts some header to 7-bit data, the [RFC 2047] and [RFC 2231] encodings MUST be used in preference to UTF-8 (see also the similar restriction in 6.21.3). [This presupposes that the extension to permit UTF-8 in body part headers in 6.21.1 survives.] Examples: Organization: Technische =?iso-8859-1?Q?Universit=E4t_M=FCnchen?= Approved: =?iso-8859-1?Q?Fran=E7ois_Faur=E9?= (=?iso-8859-1?Q*fr?Mod=E9rateur_autoris=E9?=) Archive: yes; filename*=iso-8859-1'es'ma=F1ana.txt Reading agents MUST support the use of UTF-8, [RFC 2047] and [RFC 2231] in all those headers defined in this standard and in the Email standards, at least to the extent of their ability to display the characters presented to them. Moreover, since Netnews articles are regularly emailed as well as posted, and the current Email standards do not currently admit the use of full UTF-8 in headers, posting agents MUST ensure that [RFC 2047] and [RFC 2231] are used in preference to UTF-8 in those cases, at least within the emailed version (see also 6.9 and 8.8.1.1). Encoding by other means is not compliant with this standard. Nevertheless, encoding using other character sets (with no indication of which one beyond the user's ability to guess based upon other clues in the article, or custom within the newsgroup) has been in use in some hierarchies, and such usage may be expected to continue for some period after the introduction of this standard. Reading agents MAY, when such usage is detected, attempt to interpet the header according to whatever other character set can be deduced, or has been configued as a default by the reader. NOTE: It is possible to determine, with a high degree of accuracy, when a given text containing octets with the 8th bit set was not encoded using UTF-8, and using this test to recover such non-compliant texts is therefore commended where no other harm could arise. The [RFC 2047] encoding is not available within headers which contain a newsgroup-name, notably Newsgroups-headers and Followup-To-headers, because a newsgroup-name is neither a phrase nor a comment. Moreover such headers MUST in any case use UTF-8 in order to ensure that newsgroup-names appear in their canonical form. A special encoding for newsgroup-names is provided in section 5.5.2 for use when mailing to moderators and other gatewaying applications (8.7 and 8.8.1.1). NOTE: The choice between UTF-8 and [RFC 2047] when posting depends on various factors. Some reading agents do not recogize [RFC 2047], and some are incapable of decoding UTF-8 (though there in an increasing tendency for modern reading agents to understand, or to be configurable to understand, both). Since headers encoded in UTF-8 are currently prohibited in Email, special consideration needs to be given to articles that are both posted and mailed (6.9) or which are mailed to moderators (see 8.2.2). Posters and implementors of posting agents need to take account of all these factors when deciding which method to use. 4.4.2. Character Sets within Article Bodies Within article bodies, characters are represented as octets according to the encoding scheme implied by any Content-Transfer-Encoding- and Content-Type-headers [RFC 2045]. In the absence of such headers, reading agents cannot be relied upon to display correctly more than the US-ASCII characters, though they MUST display at least those. NOTE: The use of non-ASCII characters in the absence of an appropriate Content-Type-header is not compliant with this standard. Nevertheless such usage has been seen in some hierarchies, and it would be reasonable for reading agents to make an informed "guess" when confronted with that situation, and in particular it would be wise at least to test whether they were in the form of valid UTF-8 (see also the suggestion for such a test in 4.4.1). NOTE: It is not expected that reading agents will necessarily be able to present characters in all possible character sets. For example, a reading agent might be able to present only the ISO- 8859-1 (Latin 1) characters [ISO 8859], in which case it Ought to present undisplayable characters using some distinctive glyph, or by exhibiting a suitable warning. Followup agents MUST be careful to apply appropriate encodings to the outbound followup. A followup to an article containing non-ASCII material is very likely to contain non-ASCII material itself. 4.4.3. The NEWS-8BIT-HEADERS IMAP Extension [This section is highly tentative, and serves as a placeholder to indicate that an IMAP extension will be needed in order to ensure consistency with the present form of this draft. It shows the minimum extension that seems to be necessary, and would require significant further work for any final version.] The current IMAP4 protocol [RFC 2060] forbids 8-bit characters in headers (so as to conform with the previous Netnews standard [RFC 1036] amd with the current Email standards). [That reference to RFC 2060 should be changed to refer to [RFC 2060bis] if that has been accepted by the time this standard is published.] Implementations of IMAP4 conforming to this extension MUST 1. In the case of Netnews messages only, accept 8-bit octets in headers (part specifiers HEADER or MIME, or the header portion of a MESSAGE/RFC822 part) and pass them on to the client unchanged in any FETCH response; 2. Interpret all octets in such headers as being in the UTF-8 charset; 3. Include the capability NEWS-8BIT-HEADERS in any CAPABILITY response. NOTE: It is the responsibility of the client to interpret such headers. Users who require to see them displayed correctly will need to acquire clients with the necessary UTF-8 facilities. The new capability NEWS-8BIT-HEADERS is to be registered with IANA. [Memo: remember to update the IANA Considerations section.] 4.5. Size Limits Posting agents SHOULD endeavour to keep all header lines, so far as is possible, within 79 characters by folding them at suitable places (see 4.2.3). However, posting agents MUST permit the poster to include longer headers if he so insists, and compliant software MUST support headers of at least 998 octets. Likewise, injecting agents SHOULD fold any headers generated automatically by themselves. Relaying agents MUST NOT fold headers (i.e. they must pass on the folding as received). NOTE: There is NO restriction on the number of lines into which a header may be split, and hence there is NO restriction on the total length of a header (in particular it may, by suitable folding, be made to exceed the 998 octets restriction pertaining to a single header line). The syntax provides for the lines of a body to be up to 998 octets in length, not including the CRLF. All software compliant with this standard MUST support lines of at least that length, both in headers and in bodies, and all such software SHOULD support lines of arbitrary length. In particular, relaying agents MUST transmit lines of arbitrary length without truncation or any other modification. NOTE: The limit of 998 octets is consistent with the corresponding limit in [RFC 2822]. In plain-text messages (those with no MIME headers, or those with a MIME Content-Type of text/plain) posting agents Ought to endeavour to keep the length of body lines within some reasonable limit. The size of this limit is a matter of policy, the default being to keep within 79 characters at most, and preferably within 72 characters (to allow room for quoting in followups). Exceptionally, posting agents Ought Not to adjust the length of quoted lines in followups unless they are able to reformat them in a consistent manner. Moreover, posting agents MUST permit the poster to include longer lines if he so insists. NOTE: Plain-text messages are intended to be displayed "as-is" without any special action (such as automatic line splitting) on the part of the recipient. The policy limit (e.g. 72 or 79) should be expressed as a number of characters (as they will be displayed by a reading agent) rather than as the number of octets used to encode them. NOTE: This standard provides no upper bound on the overall size of a single article, but neither does it forbid relaying agents from dropping articles of excessive length. It is, however, suggested that any limits thought appropriate by particular agents would be more appropriately expressed in megabytes than in kilobytes. 4.6. Example Here is a sample article: Path: server.example/unknown.site2.example@site2.example/ relay.site.example/site.example/injector.site.example%jsmith Newsgroups: example.announce,example.chat Message-ID: <9urrt98y53@site1.example> From: Ann Example Subject: Announcing a new sample article. Date: Wed, 27 Mar 2002 12:12:50 +0300 Approved: example.announce moderator Followup-To: example.chat Reply-To: Ann Example Expires: Mon, 22 Apr 2002 12:12:50 +0300 Organization: Site1, The Number one site for examples. User-Agent: ExampleNews/3.14 (Unix) Keywords: example, announcement, standards, RFC 1036, Usefor Summary: The URL for the next standard. Injector-Info: injector.site.example; posting-host=du003.site.example Complaints-To: abuse@site.example Just a quick announcement that a new standard example article has been released; it is in the new USEFOR standard obtainable from ftp.ietf.org. Ann. -- Ann Example Sample Poster to the Stars "The opinions in this article are bloody good ones" - J. Clarke. [The RFC Editor is invited to change the above Date and Expires headers to match the actual publication dates and to insert its correct URL.] 5. Mandatory Headers An article MUST have one, and only one, of each of the following headers: Date, From, Message-ID, Subject, Newsgroups, Path. Note also that there are situations, discussed in the relevant parts of section 6, where References-, Sender-, or Approved-headers are mandatory. In control messages, specific values are required for certain headers. A proto-article (see 8.2.1) may lack some of these mandatory headers, but they MUST then be supplied by the injecting agent. 5.1. Date The Date-header contains the date and time that the article was prepared by the poster ready for transmission and SHOULD express the poster's local time. The content syntax makes use of syntax defined in [RFC 2822], subject to the following revised definition of zone. header =/ Date-header Date-header = "Date" ":" SP Date-content Date-content = date-time zone = (( "+" / "-" ) 4DIGIT) / "UT" / "GMT" The forms "UT" and "GMT" (indicating universal time) are to be regarded as obsolete synonyms for "+0000". They MUST be accepted, and passed on unchanged, by all agents, but they MUST NOT be generated as part of new articles by posting and injecting agents. The date-time MUST be semantically valid as required by [RFC 2822]. Although folding white space is permitted throughout the date-time syntax, it is RECOMMENDED that a single space be used in each place that FWS appears (whether it is required or optional). NOTE: A convention that is sometimes followed is to add a comment, after the date-time, containing the time zone in human-readable form, but many of the abbreviations commonly used for this purpose are ambiguous. The value given by the is the only definitive form. In order to prevent the reinjection of expired articles into the news stream, relaying and serving agents MUST refuse "stale" articles whose Date-header predates the earliest articles of which they normally keep record, or which is more than 24 hours into the future (though they MAY use a margin less than that 24 hours). Relaying agents MUST NOT modify the Date-header in transit. 5.1.1. Examples Date: Sat, 26 May 2001 11:13:00 -0500 (EST) Date: 26 May 2001 16:13 +0000 Date: 26 May 2001 16:13 GMT (Obsolete) 5.2. From The From-header contains the email address(es), possibly including the full name(s), of the article's poster(s). The content syntax makes use of syntax defined in [RFC 2822], subject to the following revised definition of local-part. header =/ From-header From-header = "From" ":" SP From-content From-content = mailbox-list addr-spec = local-part "@" domain local-part = dot-atom / strict-quoted-string NOTE: This syntax ensures that the local-part of an addr-spec is restricted to pure US-ASCII (and is thus in strict compliance with [RFC 2822]), whilst allowing any UTF-8 character to be used in a preceding quoted-string containing the poster's full name. If some future extension to the Email protocols should relax this restriction, one would expect the Netnews protocols to follow. Observe that there is no provision for parameters in this header, or in other headers containing addresses likely to be used for sending email (see 4.2.2). Each mailbox in the From-content SHOULD be a valid address, belonging to the poster(s) of the article, or person or agent on whose behalf the post is being sent (see the Sender-header, 6.2). When, for whatever reason, the poster does not wish to include such an address, the From-content SHOULD then be an address which ends in the top level domain of ".invalid" [RFC 2606]. NOTE: Since such addresses ending in ".invalid" are undeliverable, user agents Ought to warn any user attempting to reply to them and Ought Not, in any case, to attempt to deliver to them (since that would be pointless anyway). Whether or not a valid address can subsequently be extracted from such an address falls outside the scope of this standard (obviously, posters wishing to disguise their address need to do more than just add ".invalid" to it). Be warned, however, that some injecting agents which are unable to detect that the address belongs to the poster may choose to insert a Sender-header (but see 8.2.2) or some entry in an Injector-Info-header (6.19) which discloses some valid address for the poster. 5.2.1. Examples: From: John Smith From: "John Smith" , dave@isp.example From: "John D. Smith" , andrew@isp.example, fred@site2.example From: Jan Jones From: Jan Jones From: dave@isp.example (Dave Smith) NOTE: the last example shows a now deprecated convention of putting a poster's full name in a comment following the mailbox, rather than in a phrase at the start of it. Observe also the use of the quoted-string "John D. Smith" which is required on account of presence of the '.' character, and which would also have been required had any UTF8-xtra-char been present. 5.3. Message-ID The Message-ID-header contains the article's message identifier, a unique identifier distinguishing the article from every other article. The content syntax makes use of syntax defined in [RFC 2822], subject to the following revised definitions of msg-id, no- fold-quote and no-fold-literal. header =/ Message-ID-header Message-ID-header = "Message-ID" ":" SP Message-ID-content Message-ID-content = [FWS] msg-id [FWS] msg-id = "<" id-left "@" id-right ">" id-left = dot-atom-text / no-fold-quote id-right = dot-atom-text / no-fold-literal no-fold-quote = DQUOTE *( strict-qtext / "\\" / "\" DQUOTE ) qspecial *( strict-qtext / "\\" / "\" DQUOTE ) DQUOTE qspecial = "(" / ")" / ; same as specials except "<" / ">" / ; "\" and DQUOTE quoted "[" / "]" / ":" / ";" / "@" / "\\" / "," / "." / "\" DQUOTE no-fold-literal = "[" *( dtext / "\[" / "\]" / "\\" ) "]" The msg-id MUST NOT be more than 250 octets in length. NOTE: Observe that, in contrast to the corresponding header in [RFC 2822], the syntax does not allow comments within the Message-ID-header; this is to simplify processing by relaying and serving agents and to ensure interoperability with existing implementations. The restriction to strict-qtext ensures that no UTF8-xtra-char can appear. Msg-ids as defined here are a "normalized" subset of those defined by [RFC 2822], ensuring that no string of characters is quoted unless strictly necessary (it must contain at least one qspecial) and no single character is prefixed by a "\" in the form of a quoted-pair unless strictly necessary, and moreover there is no possibility for WSP to occur, whether quoted or not. The length restriction ensures that systems which accept message identifiers as a parameter when retrieving an article (e.g. [NNTP]) can rely on a bounded length. Observe that msg-id includes the '<' and '>'. An agent generating an article's message identifier MUST ensure that it is unique (as also required in [RFC 2822]) and that it is chosen in such a way that it will NEVER be applied to any other Netnews article or Email message. However, an article emailed (without encapsulation) to a moderator (8.2.2 and 8.7) or gatewayed into some other medium (8.8.1) SHOULD retain the same message identifier throughout its travels so long as it remains recognizably the same article. Even though commonly derived from the domain name of the originating site (and domain names are case-insensitive), a message identifier MUST NOT be altered in any way during transport, or when copied (as into a References-header), and thus a simple (case-sensitive) comparison of octets will always suffice to recognize that same message identifier wherever it subsequently reappears. NOTE: These requirements are to be contrasted with those of the un-normalized msg-ids defined by [RFC 2822], which may perfectly legitimately become normalized (or vice versa) during transport or copying in email systems. NOTE: Some old software may treat message identifiers that differ only in case within their id-right part as equivalent, and implementors of agents that generate message identifiers should be aware of this. 5.4. Subject The Subject-header contains a short string identifying the topic of the message. This is an inheritable header (4.2.5.2) to be copied into the Subject-header of any followup, in which case the new Subject-content SHOULD then default to the string "Re: " (a "back reference") followed by the contents of the pure-subject of the precursor. Any leading "Re: " in that pure-subject MUST be stripped. header =/ Subject-header Subject-header = "Subject" ":" SP Subject-content Subject-content = [ [FWS] back-reference ] pure-subject pure-subject = unstructured back-reference = %x52.65.3A.20 ; which is a case-sensitive "Re: " The pure-subject MUST NOT begin with "Re: ". NOTE: The syntax of unstructured differs from that prescribed in [RFC 2822], so ensuring that the Subject-content is not permitted to be completely empty, or to consist of WSP only (see remarks in 4.2.6 concerning undesirable headers). Followup agents MAY remove strings that are known to be used erroneously as back-reference (such as "Re(2): ", "Re:", "RE: ", or "Sv: ") from the Subject-content when composing the subject of a followup, and add a correct back-reference in front of the result. NOTE: that would be "SHOULD remove instances" except that we cannot find a sufficiently robust and simple algorithm to do the necessary natural language processing. Followup agents MUST NOT use any other string except "Re: " as a back reference. Specifically, a translation of "Re: " into a local language or usage MUST NOT be used. NOTE: "Re" is an abbreviation for the Latin "In re", meaning "in the matter of", and not an abbreviation of "Reference" as is sometimes erroneously supposed. Agents SHOULD NOT depend on nor enforce the use of back references by followup agents. For compatibility with legacy news software, the Subject-content of a control message (i.e. an article that also contains a Control-header) MAY start with the string "cmsg ", and non-control messages MUST NOT start with the string "cmsg ". See also section 6.13. 5.4.1. Examples In the following examples, please note that only "Re: " is mandated by this standard. "was: " is a convention used by many English- speaking posters to signal a change in subject matter. Software can always recognize such changes from the References-header. Subject: Film at 11 Subject: Re: Film at 11 Subject: Godwin's law considered harmful (was: Film at 11) Subject: Godwin's law (was: Film at 11) Subject: Re: Godwin's law (was: Film at 11) Subject: Re: Godwin's law 5.5. Newsgroups The Newsgroups-header's content specifies the newsgroup(s) in which the article is intended to appear. It is an inheritable header (4.2.5.2) which then becomes the default Newsgroups-header of any followup, unless a Followup-To-header is present to prescribe otherwise. Articles MUST NOT be passed between relaying agents or to serving agents unless the sending agent has been configured to supply and the receiving agent to receive at least one of the newsgroup- names in the Newsgroups-header. In order to allow newsgroup-names containing Non-ASCII characters, this section relies heavily on the provisions of the Unicode Standard. All references to "Unicode" mean [UNICODE 3.2] or any standard that supersedes it. That document contains guarantees of strict future upwards compatibility (e.g. no character will be removed or change classification). Implementors should be aware that currently unassigned code points (Unicode category Cn) may become valid characters in future versions of Unicode. Since the poster of an article might have access to a newer version of that standard, relaying and serving agents MUST accept such characters, but posting agents (and indeed all agents) MUST NOT generate them (though they might well follow up to newsgroup-names containing them). header =/ Newsgroups-header Newsgroups-header = "Newsgroups" ":" SP Newsgroups-content *( ";" extension-parameter ) Newsgroups-content = [FWS] newsgroup-name *( [FWS] ng-delim [FWS] newsgroup-name ) [FWS] newsgroup-name = component *( "." component ) component = 1*component-grapheme ng-delim = "," component-grapheme = combiner-base *combiner-mark combiner-base = combiner-ASCII / combiner-extended combiner-ASCII = DIGIT / ALPHA / "+" / "-" / "_" combiner-extended = combiner-mark = NOTE: the excluded characters in a combiner-extended are control characters (Cc), format control characters (Cf), surrogates (Cs), marks (M*) and separators (Z*). In particular, this excludes all whitespace characters. To all intents and purposes, a component-grapheme is what a user might regard as a single "character" as displayed on his screen, though it might be transmitted as several actual characters (e.g. q-circumflex is two characters). Note also that, in some writing schemes, several component-graphemes will merge into one visible object of variable size. Each component MUST be invariant under Unicode normalization NFKC (cf. the weaker normalization requirement for other headers in section 4.4.1 which specified no more than normalization NFC, and see also the explanatory NOTE in that section). NOTE: As a result of of this restriction, a name has only one valid form. Implementations can assume that a straight (case sensitive) comparison of characters or octets is sufficient to compare two newsgroup-names. The requirement that names be invariant under NFKC, rather than NFC, means that all characters with a "compatibility decomposition" are forbidden (Unicode provides the property "NFKC_NO" to make this test easier). The effect is to exclude variant forms of characters, such as superscripts and subscripts, wide and narrow forms, font variants, encircled forms, ligatures, and so on, as their use could cause confusion. There is insufficient experience in this area to determine whether this is the right long-term solution. Implementors should therefore be aware that a future version of this standard might reduce the requirement in the direction of NFC as opposed to NFKC. NOTE: An implementation is not required to apply NFKC, or any other normalization, to newsgroup-names. Only agencies that create new groups need to be careful to obey this restriction (7.2.1). However, if a posting agent neglects to normalize a newsgroup-name entered manually, this may lead to the user posting to a non-existent group without understanding why. Newsgroup-names containing non-ASCII characters MUST be encoded in UTF-8. The use of [RFC 2047] encoding is inappropriate for reasons explained in section 4.4.1. Components beginning with underline ("_") are reserved for use by future versions of this standard and MUST NOT occur in newsgroup- names (whether in Newsgroups-headers or in newgroup control messages (7.2.1)). However, such names MUST be accepted. Components beginning with "+" or "-" are reserved for use by implementations and MUST NOT occur in newsgroup-names (whether in Newsgroups-headers or in newgroup control messages). Implementors may assume that this rule will not change in any future version of this standard. NOTE: For example, implementors may safely use leading "+" and "-" to "escape" other entities within something that looks like a newsgroup-name. Agencies responsible for the administration of particular hierarchies Ought to place additional restrictions on the characters they allow in newsgroup-names within those hierarchies (such as to accord with the languages commonly used within those hierarchies, or to avoid perceived ambiguities pertinent to those languages). Where there is no such specific policy, the following restrictions SHOULD be applied to newsgroup-names. NOTE: These restrictions are intended to reflect existing practice, with some additions to accommodate foreseeable enhancements, and are intended both to avoid certain technical difficulties and to avoid unnecessary confusion. It may well be that experience will allow future extensions to this standard to relax some or all of these restrictions. The specific restrictions (to be applied in the absence of established policies to the contrary) are: 1. The following characters are forbidden, subject to the comments and notes at the end of the list: characters in category Cn (Other, Not assigned) [1] characters in category Co (Other, Private Use) [2] characters in category Lt (Letter, Titlecase) [3] characters in category Lu (Letter, Uppercase) [3] characters in category Me (Mark, Enclosing) [4] characters in category Pd (Punctuation, Dash) [4][5] characters in category Pe (Punctuation, Close) [4] characters in category Pf (Punctuation, Final quote) [4] characters in category Pi (Punctuation, Initial quote) [4] characters in category Po (Punctuation, Other) [4] characters in category Ps (Punctuation, Open) [4] characters in category Sc (Symbol, Currency) [4] characters in category Sk (Symbol, Modifier) [4] characters in category Sm (Symbol, Math) [4][5] characters in category So (Symbol, Other) [4] [1] As new characters are added to Unicode, the code point moves from category Cn to some other category. As stated above, implementors should be prepared for this. [2] Specific private use characters can be used within a hierarchy or co-operating subnet that has agreed meanings for them. [3] Traditionally, newsgroup-names have been written in lowercase. Posting agents Ought Not to convert uppercase or titlecase characters to the corresponding lowercase forms except under the explicit instructions of the poster. [4] Traditionally newsgroup-names have only used letters, digits, and the three special characters "+", "-" and "_". These categories correspond to characters outside that set. [5] Although the characters "+" and "-" are within categories Pd and Sm, they are not forbidden. 2. A component name is forbidden to consist entirely of digits. NOTE: This requirement was in [RFC 1036] but nevertheless several such groups have appeared in practice and implementors should be prepared for them. A common implementation technique uses each component as the name of a directory and uses numeric filenames for each article within a group. Such an implementation needs to be careful when this could cause a clash (e.g. between article 123 of group xxx.yyy and the directory for group xxx.yyy.123). 3. A component is limited to 30 component-graphemes and a newsgroup- name to 71 component-graphemes (counting also the '.'s separating the components). Whilst there is no longer any technical reason to limit the length of a component (formerly, it was limited to 14 octets) nor of a newsgroup-name, it should be noted that these names are also used in the newsgroups-line (7.2.1.2) where an overall policy limit applies and, moreover, excessively long names can be exceedingly inconvenient in practical use. Serving and relaying agents MUST accept any newsgroup-name that meets the above requirements, even if they violate one or more of the policy restrictions. Posting and injecting agents MAY reject articles containing newsgroup-names that do not meet these restrictions, and posting agents MAY attempt to correct them (but only with the explicit agreement of the poster for anything more than NFC or NFKC normalization). However, because of the large and changing tables required to do these checks and corrections throughout the whole of Unicode, this standard does not require them to do so. Rather, the onus is placed on those who create new newsgroups (7.2.1) to check the mandatory requirements, to consider the effects of relaxing the other restrictions, and to consider how all this may affect propagation of the group. Since future extensions to this standard and the Unicode standard, including a possible relaxation of the NFKC normalization, plus any relaxations of the default restrictions introduced by specific hierarchies might invalidate some such checks, warnings, and adjustments, implementations MUST incorporate means to disable them. NOTE: The newsgroup-name as encoded in UTF-8 should be regarded as the canonical form. Reading agents may convert it to whatever character set they are able to display and serving agents may possibly need to convert it to some form more suitable as a filename. Simple algorithms for both kinds of conversion are readily available. Observe that the syntax does not allow comments within the Newsgroups-header; this is to simplify processing by relaying and serving agents which have a requirement to process this header extremely rapidly. The inclusion of folding white space within a Newsgroups-content is a newly introduced feature in this standard. It MUST be accepted by all conforming implementations (relaying agents, serving agents and reading agents). Posting agents should be aware that such postings may be rejected by overly-critical old-style relaying agents. When a sufficient number of relaying agents are in conformance, posting agents SHOULD generate such whitespace in the form of so as to keep the length of lines in the relevant headers (notably Newsgroups and Followup-To) to no more than than 79 characters (or other agreed policy limit - see 4.5). Before such critical mass occurs, injecting agents MAY reformat such headers by removing whitespace inserted by the posting agent, but relaying agents MUST NOT do so. Posters SHOULD use only the names of existing newsgroups in the Newsgroups-header. However, it is legitimate to cross-post to newsgroups which do not exist on the posting agent's host, provided that at least one of the newsgroups DOES exist there, and followup agents SHOULD accept this (posting agents MAY accept it, but Ought at least to alert the poster to the situation and request confirmation). Relaying agents MUST NOT rewrite Newsgroups-headers in any way, even if some or all of the newsgroups do not exist on the relaying agent's host. Serving agents MUST NOT create new newsgroups simply because an unrecognized newsgroup-name occurs in a Newsgroups-header (see 7.2.1 for the correct method of newsgroup creation). The Newsgroups-header is intended for use in Netnews articles rather than in email messages. It MAY be used in an email message to indicate that it is a copy also posted to the listed newsgroups, in which case the inclusion of a Posted-And-Mailed header (6.9) would also be appropriate. However, it SHOULD NOT be used in an email-only reply to a Netnews article (thus the "inheritable" property of this header applies only to followups to a newsgroup, and not to followups to the poster). Moreover, if a newsgroup-name contains any non-ASCII character, it may need to be encoded using the mechanism defined in section 5.5.2. See also the further discussion in section 8.8.1.1. 5.5.1. Forbidden newsgroup-names The following forms of newsgroup-name MUST NOT be used except for the specific purposes indicated: o Newsgroup-names having only one component. These are reserved for newsgroups whose propagation is restricted to a single host or local network, and for pseudo-newsgroups such as "poster" (which has special meaning in the Followup-To-header - see section 6.7), "junk" (often used by serving agents), and "control" (likewise); o Any newsgroup-name beginning with "control." (used as pseudo- newsgroups by many serving agents); o Any newsgroup-name containing the component "ctl" (likewise); o "to" or any newsgroup-name beginning with "to." (reserved for the ihave/sendme protocol described in section 7.4, and for test messages sent on an essentially point-to-point basis); o Any newsgroup-name beginning with "example." (reserved for examples in this and other standards); o Any newsgroup-name containing the component "all" (because this is used as a wildcard in some implementations). A newsgroup-name SHOULD NOT appear more than once in the Newsgroups- header. The order of newsgroup-names in the Newsgroups-header is not significant, except for determining which moderator to send the article to if more than one of the groups is moderated (see 8.2). 5.5.2. Encoded newsgroup-names Where it is required to transport an article across some medium that cannot reliably convey the full 8 bits of each octet, such as when gatewaying it into Email (8.8.1.1), or when emailing it to a moderator or constructing the submission address of the moderator (8.2.2), it will be necessary under the current email standards to encode any newsgroup-name that contains some non-ASCII character (such as one occurring within a Newsgroups- or Followup-To-header). For that purpose, the following algorithm is provided: 1. Initially, the newsgroup-name is in the form of a sequence of octets representing that name in the UTF-8 character set. 2. Each octet in the name in the range 0x80-FF is replaced by an "=" character (US-ASCII 61), followed by two characters representing that octet in hexadecimal, in which the hexadecimal digits "A" through "F" MUST be in uppercase. 3. Each octet in the name in the range 0x00-7F remains unaltered (and thus MUST NOT be replaced by its hexadecimal equivalent). NOTE: Observe that this algorithm provides a unique encoding for each newsgroup-name. Observe also that within the unaltered range 0x00-7F, only the octets 0x2B, 0x2D-2E, 0x30-39, 0x41-5A, 0x5F, and 0x61-7A can appear in a newsgroup-name. This standard provides no authority for the use of this algorithm other than in the context of newsgroup-names occurring within headers being conveyed by email. In particular, it MUST NOT be used within any article conveyed by the Netnews protocols and thus, if an email using it is subsequently returned to the Netnews environment, it MUST be decoded back into UTF-8. 5.6. Path The Path-header shows the route taken by a message since its entry into the Netnews system. It is a variant header (4.2.5.3), each agent that processes an article being required to add one (or more) entries to it. This is primarily to enable relaying agents to avoid sending articles to sites already known to have them, in particular the site they came from, and additionally to permit tracing the route articles take in moving over the network, and for gathering Usenet statistics. Finally the presence of a '%' path-delimiter in the Path-header can be used to identify an article injected in conformance with this standard. 5.6.1. Format header =/ Path-header Path-header = "Path" ":" SP Path-content *( ";" extension-parameter ) Path-content = [FWS] *( path-identity [FWS] path-delimiter [FWS] ) tail-entry [FWS] path-identity = ( ALPHA / DIGIT ) *( ALPHA / DIGIT / "-" / "." / ":" / "_" ) path-delimiter = "/" / "?" / "%" / "," / "!" tail-entry = path-identity NOTE: A Path-content will inevitably contain at least one path- identity, except possibly in the case of a proto-article that has not yet been injected onto the network. NOTE: Observe that the syntax does not allow comments within the Path-header; this is to simplify processing by relaying and injecting agents which have a requirement to process this header extremely rapidly. A relaying agent SHOULD NOT pass an article to another relaying agent whose path-identity (or some known alias thereof) already appears in the Path-content. Since the comparison may be either case sensitive or case insensitive, relaying agents SHOULD NOT generate a name which differs from that of another site only in terms of case. A relaying agent MAY decline to accept an article if its own path- identity is already present in the Path-content or if the Path- content contains some path-identity whose articles the relaying agent does not want, as a matter of local policy. NOTE: This last facility is sometimes used to detect and decline control messages (notably cancel messages) which have been deliberately seeded with a path-identity to be "aliased out" by sites not wishing to act upon them. 5.6.2. Adding a path-identity to the Path-header When an injecting, relaying or serving agent receives an article, it MUST prepend its own path-identity followed by a path-delimiter to the beginning of the Path-content. In addition, it SHOULD then add CRLF and WSP if it would otherwise result in a line longer than 79 characters. The path-identity added MUST be unique to that agent. To this end it SHOULD be one of: 1. A fully qualified domain name (FQDN) associated (by the Internet DNS service [RFC 1034]) with an A record, which SHOULD identify the actual machine prepending this path-identity. Ideally, this FQDN should also be "mailable" (see below). 2. A fully qualified domain name (FQDN) associated (by the Internet DNS service) with an MX record, which MUST be "mailable". 3. An arbitrary name believed to be unique and registered at least with all sites immediately downstream from the given site. 4. An encoding of an IP address - or [RFC 2373] (the requirement to be able to use an is the reason for including ':' as an allowed character within a path- identity). The FQDN of an agent is "mailable" if the administrators of that agent can be reached by email using both of the forms "usenet@" and "news@", in conformity with [RFC 2142]. Of the above options, nos. 1 to 3 are much to be preferred, unless there are strong technical reasons dictating otherwise. In particular, the injecting agent's path-identity MUST, as a special case, be an FQDN as in option 1 or option 2, and MUST be mailable. Additionally, in the case of an injecting agent offering its services to the general public, its administrators MUST also be reachable using the form "abuse@" UNLESS a more specific complaints address has been specified in a Complaints-To-header (6.20). The injecting agent's path-identity MUST be followed by the special path-delimiter '%' which serves to separate the pre-injection and post-injection regions of the Path-content (see 5.6.3). In the case of a relaying or serving agent, the path-delimiter is chosen as follows. When such an agent receives an article, it MUST establish the identity of the source and compare it with the leftmost path-identity of the Path-content. If it matches, a '/' should be used as the path-delimiter when prepending the agent's own path- identity. If it does not match then the agent should prepend two entries to the Path-content; firstly the true established path- identity of the source followed by a '?' path-delimiter, and then, to the left of that, the agent's own path-identity followed by a '/' path-delimiter as usual. This prepending of two entries SHOULD NOT be done if the provided and established identities match. Any method of establishing the identity of the source may be used (but see 5.6.5 below), with the consideration that, in the event of problems, the agent concerned may be called upon to justify it. NOTE: The use of the '%' path-delimiter marks the position of the injecting agent in the chain. In normal circumstances there should therefore be only one '%' path-delimiter present, and injecting agents MAY choose to reject proto-articles with a '%' already in them. If, for whatever reason, more than one '%' is found, then the path-identity in front of the leftmost '%' is to be regarded as the true injecting agent. 5.6.3. The tail-entry For historical reasons, the tail-entry (i.e. the rightmost entry in the Path-content) is regarded as a "user name", and therefore MUST NOT be interpreted as a site through which the article has already passed. Moreover, the Path-content as a whole is not an email address and MUST NOT be used to contact the poster. Posting and/or injecting agents MAY place any string here. When it is not an actual user name, the string "not-for-mail" is often used, but in fact a simple "x" would be sufficient. Often this field will be the only entry in the region (known as the pre-injection region) after the '%', although there may be entries corresponding to machines traversed between the posting agent and the injecting agent proper. In particular, injecting agents that receive articles from many sources MAY include information to establish the circumstances of the injection such as the identity of the source machine (especially if an Injector-Info-header (6.19) is not being provided). Any such inclusion SHOULD NOT conflict with any genuine site identifier. The '!' path-delimiter may be used freely within the pre-injection region, although '/' and '?' are also appropriate if used correctly. 5.6.4. Path-Delimiter Summary A summary of the various path-delimiters. The name immediately to the left of the path-delimiter is always that of the machine which added the path-delimiter. '/' The name immediately to the right is known to be the identity of the machine from which the article was received (either because the entry was made by that machine and we have verified it, or because we have added it ourselves). '?' The name immediately to the right is the claimed identity of the machine from which the article was received, but we were unable to verify it (and have prepended our own view of where it came from, and then a '/'). '%' Everything to the right is the pre-injection region followed by the tail-entry. The name on the left is the FQDN of the injecting agent. The presence of two '%'s in a path indicates a double-injection (see 8.2.2). '!' The name immediately to the right is unverified. The presence of a '!' to the left of the '%' indicates that the identity to the left is that of an old-style system not conformant with this standard. ',' Reserved for future use, treat as '/'. Other Old software may possibly use other path-delimiters, which should be treated as '!'. But note in particular that ':', '-' and '_' are components of names, not path-delimiters, and FWS on its own MUST NOT be used as the sole path-delimiter. NOTE: Old Netnews relaying and injecting agents almost all delimit Path entries with a '!', and these entries are not verified. The presence of '%' indicates that the article was injected by software conforming to this standard, and the presence of '!' to the left of a '%' indicates that the message passed through systems developed prior to this standard. It is anticipated that relaying agents will reject articles in the old style once this new standard has been widely adopted. 5.6.5. Suggested Verification Methods It is preferable to verify the claimed path-identity against the source than to make routine use of the '?' path-delimiter, with consequential wasteful double-entry Path additions. If the incoming article arrives through some TCP/IP protocol such as NNTP, the IP address of the source will be known, and will likely already have been checked against a list of known FQDNs, IP addresses, or other registered aliases that the receiving site has agreed to peer with. Since the source host may have several IP addresses, checking the claimed FQDN or IP address against the source IP, or finding a suitable FQDN to report with a '?' path-delimiter, may involve several DNS lookups, following CNAME chains as required. Note that any reverse DNS lookup that is involved needs to be confirmed by a forward one. If the incoming article arrives through some other protocol, such as UUCP, that protocol MUST include a means of verifying the source site. In UUCP implementations, commonly each incoming connection has a unique login name and password, and that login name (or some alias registered for it) would be expected as the path-identity. 5.6.6. Example Path: foo.isp.example/ foo-server/bar.isp.example?10.123.12.2/old.site.example! barbaz/baz.isp.example%dialup123.baz.isp.example!x NOTE: That article was injected into the news stream by baz.isp.example (complaints may be addressed to abuse@baz.isp.example). The injector has taken care to record that it got it from dialup123.baz.isp.example. "x" is a dummy tail-entry, though sometimes a real userid is put there. The article was relayed, perhaps by UUCP, to the machine known, at least to its downstream, as "barbaz". Barbaz relayed it to old.site.example, which does not yet conform to this standard (hence the '!' path-delimiter). So one cannot be sure that it really came from barbaz. Old.site.example relayed it to a site claiming to have the IP address [10.123.12.2], and claiming (by using the '/' path- delimiter) to have verified that it came from old.site.example. [10.123.12.2] relayed it to "foo-server" which, not being convinced that it truly came from [10.123.12.2], did a reverse lookup on the actual source and concluded it was known as bar.isp.example (that is not to say that [10.123.12.2] was not a correct IP address for bar.isp.example, but simply that that connection could not be substantiated by foo-server). Observe that foo-server has now added two entries to the Path. "foo-server" is a locally significant name within the complex site of many machines run by foo.isp.example, so the latter should have no problem recognizing foo-server and using a '/' path-delimiter. Presumably foo.isp.example then delivered the article to its direct clients. It appears that foo.isp.example and old.site.example decided to fold the line, on the grounds that it seemed to be getting a little too long. 6. Optional Headers None of the headers appearing in this section is required to appear in every article but some of them are required in certain types of article, such as followups. Any header defined in this (or any other) standard MUST NOT appear more than once in an article unless specifically stated otherwise. Experimental headers (4.2.5.1) and headers defined by cooperating subnets are exempt from this requirement. See section 8 "Duties of Various Agents" for the full picture. 6.1. Reply-To The Reply-To-header specifies a reply address(es) to be used for personal replies for the poster(s) of the article when this is different from the poster's address(es) given in the From-header. The content syntax makes use of syntax defined in [RFC 2822], but subject to the revised definition of local-part given in section 5.2. header =/ Reply-To-header Reply-To-header = "Repl