-
-
Notifications
You must be signed in to change notification settings - Fork 32.5k
bpo-35805: Add parser for Message-ID email header. #13397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This only adds parser for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize sadly I no longer remember how this all works, but based on my recollection, this looks good to me. I would be interested in @bitdancer 's opinion too.
438f548
to
244541f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks great. Thanks!
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
@warsaw: I'm not sure I remember how it all works, and I wrote it. As far as I can tell @maxking did a great job of figuring out the principles behind this rather baroque parser :) For the record, this parser is essentially a "first draft" (or maybe a 1.5 draft, I forget), which I had hoped to come back to, extract what I learned from writing it, and write a version 2 that was more consistent and easier to understand (and most importantly would produce a parse tree that was easier to interrogate and manipulate). But at this point it seems unlikely I'll ever manage to find time to do that. |
This parser is based on the definition of Identification Fields from RFC 5322 Sec 3.6.4. This should also prevent folding of Message-ID header using RFC 2047 encoded words and hence fix bpo-35805.
Also, remove empty lines from classes that don't have any methods.
I have made the requested changes; please review again |
Thanks for making the requested changes! @bitdancer, @warsaw: please review the changes made to this pull request. |
@bitdancer I have made the changes you requested for the folding of msg-id tokens. |
Reviewing in real-time w/maxking, we think this is ready to land
except errors.HeaderParseError: | ||
message_id.defects.append(errors.InvalidHeaderDefect( | ||
"Expected msg-id but found {!r}".format(value))) | ||
message_id.append(token) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to bomb out when building hyper kitty, because token is referenced before assignment, when try
block fails.
======================================================================
ERROR: test_long_message_id (hyperkitty.tests.lib.test_incoming.TestAddToList)
----------------------------------------------------------------------
Traceback (most recent call last):
File "./hyperkitty/tests/lib/test_incoming.py", line 295, in test_long_message_id
msg["Message-ID"] = "X" * 260
File "/usr/lib/python3.8/email/message.py", line 409, in __setitem__
self._headers.append(self.policy.header_store_parse(name, val))
File "/usr/lib/python3.8/email/policy.py", line 148, in header_store_parse
return (name, self.header_factory(name, value))
File "/usr/lib/python3.8/email/headerregistry.py", line 602, in __call__
return self[name](name, value)
File "/usr/lib/python3.8/email/headerregistry.py", line 197, in __new__
cls.parse(value, kwds)
File "/usr/lib/python3.8/email/headerregistry.py", line 530, in parse
kwds['parse_tree'] = parse_tree = cls.value_parser(value)
File "/usr/lib/python3.8/email/_header_value_parser.py", line 2116, in parse_message_id
message_id.append(token)
UnboundLocalError: local variable 'token' referenced before assignment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, same for me. Also, accessing value[0]
before checking if value is there…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@surkova Thank you! I think this should be fixed up properly. Opened BPO https://bugs.python.org/issue38708
* bpo-35805: Add parser for Message-ID header. This parser is based on the definition of Identification Fields from RFC 5322 Sec 3.6.4. This should also prevent folding of Message-ID header using RFC 2047 encoded words and hence fix bpo-35805. * Prevent folding of non-ascii message-id headers. * Add fold method to MsgID token to prevent folding.
@maxking was there any progress made on fixing the folding of other identification fields elsewhere by any chance? Or is this still an open issue? Thanks! |
Python 3 before 3.8 has a bug that causes the email.policy classes to incorrectly fold and RFC2047-encode "identification fields" in email messages. This mainly applies to Message-Id, References, and In-Reply-To fields. We are impacted by this bug since odoo#35929 where we switched to using the "modern" email.message API. RFC2047 section 5 clearly states that those headers/fields are not to be encoded, and that would violate RFC5322. Further, such a folded Message-Id is considered non-RFC-conformant by popular MTAs (GMail, Outlook), which will then generate *another* Message-Id field, causing the original threading information to be lost. Replies to such a modified message will reference the new, unknown Message-Id, and won't be attached to the original thread. The solution we adopt here is to monkey-patch the SMTP policies to special-case those identification fields and deactivate the automatic folding, until the bug is properly and fully fixed in the standard lib. Some considerations taken into account for this patch: - `email.policy.SMTP` is being monkey-patched globally to make sure we fix all possible places where Messages are being encoded/folded - the fix is **not** made version-specific, considering that even in Python 3.8 the official bugfix only applies to Message-Id, but still fails to protect other identification fields, like *References* and *In-Reply-To*. The author specifically noted that shortcoming [2]. The fix wouldn't break anything on Python 3.8 anyway. - the `noFoldPolicy` trick for preventing folding is done with no max line length at all. RFC5322, section 2.1.1 states [3] that the maximum length is 998 due to legacy implementations, but there is no provision to wrap identification fields that are longer than that. Wrapping at 998 chars would corrupt the header anyway. We'll just count on the fact that we don't usually need 1k+ chars in those headers. The invalid folding/encoding in action on Python 3.6 (in Python 3.8 only the second header gets folded): ```py >>> msg = email.message.EmailMessage(policy=email.policy.SMTP) >>> msg['Message-Id'] = '<929227342217024.1596730490.324691772460938-example-30661-some.reference@test-123.example.com>' >>> msg['In-Reply-To'] = '<92922734221723.1596730568.324691772460444-another-30661-parent.reference@test-123.example.com>' >>> print(msg.as_string()) Message-Id: =?utf-8?q?=3C929227342217024=2E1596730490=2E324691772460938-exam?= =?utf-8?q?ple-30661-some=2Ereference=40test-123=2Eexample=2Ecom=3E?= In-Reply-To: =?utf-8?q?=3C92922734221723=2E1596730568=2E324691772460444-anot?= =?utf-8?q?her-30661-parent=2Ereference=40test-123=2Eexample=2Ecom=3E?= ``` and the expected result after the fix: ```py >>> msg = email.message.EmailMessage(policy=email.policy.SMTP) >>> msg['Message-Id'] = '<929227342217024.1596730490.324691772460938-example-30661-some.reference@test-123.example.com>' >>> msg['In-Reply-To'] = '<92922734221723.1596730568.324691772460444-another-30661-parent.reference@test-123.example.com>' >>> print(msg.as_string()) Message-Id: <929227342217024.1596730490.324691772460938-example-30661-some.reference@test-123.example.com> In-Reply-To: <92922734221723.1596730568.324691772460444-another-30661-parent.reference@test-123.example.com> ``` [1] bpo-35805: https://bugs.python.org/issue35805 [2] python/cpython#13397 (comment) [3] https://tools.ietf.org/html/rfc5322#section-2.1.1
Python 3 before 3.8 has a bug that causes the email.policy classes to incorrectly fold and RFC2047-encode "identification fields" in email messages. This mainly applies to Message-Id, References, and In-Reply-To fields. We are impacted by this bug since #35929 where we switched to using the "modern" email.message API. RFC2047 section 5 clearly states that those headers/fields are not to be encoded, and that would violate RFC5322. Further, such a folded Message-Id is considered non-RFC-conformant by popular MTAs (GMail, Outlook), which will then generate *another* Message-Id field, causing the original threading information to be lost. Replies to such a modified message will reference the new, unknown Message-Id, and won't be attached to the original thread. The solution we adopt here is to monkey-patch the SMTP policies to special-case those identification fields and deactivate the automatic folding, until the bug is properly and fully fixed in the standard lib. Some considerations taken into account for this patch: - `email.policy.SMTP` is being monkey-patched globally to make sure we fix all possible places where Messages are being encoded/folded - the fix is **not** made version-specific, considering that even in Python 3.8 the official bugfix only applies to Message-Id, but still fails to protect other identification fields, like *References* and *In-Reply-To*. The author specifically noted that shortcoming [2]. The fix wouldn't break anything on Python 3.8 anyway. - the `noFoldPolicy` trick for preventing folding is done with no max line length at all. RFC5322, section 2.1.1 states [3] that the maximum length is 998 due to legacy implementations, but there is no provision to wrap identification fields that are longer than that. Wrapping at 998 chars would corrupt the header anyway. We'll just count on the fact that we don't usually need 1k+ chars in those headers. The invalid folding/encoding in action on Python 3.6 (in Python 3.8 only the second header gets folded): ```py >>> msg = email.message.EmailMessage(policy=email.policy.SMTP) >>> msg['Message-Id'] = '<929227342217024.1596730490.324691772460938-example-30661-some.reference@test-123.example.com>' >>> msg['In-Reply-To'] = '<92922734221723.1596730568.324691772460444-another-30661-parent.reference@test-123.example.com>' >>> print(msg.as_string()) Message-Id: =?utf-8?q?=3C929227342217024=2E1596730490=2E324691772460938-exam?= =?utf-8?q?ple-30661-some=2Ereference=40test-123=2Eexample=2Ecom=3E?= In-Reply-To: =?utf-8?q?=3C92922734221723=2E1596730568=2E324691772460444-anot?= =?utf-8?q?her-30661-parent=2Ereference=40test-123=2Eexample=2Ecom=3E?= ``` and the expected result after the fix: ```py >>> msg = email.message.EmailMessage(policy=email.policy.SMTP) >>> msg['Message-Id'] = '<929227342217024.1596730490.324691772460938-example-30661-some.reference@test-123.example.com>' >>> msg['In-Reply-To'] = '<92922734221723.1596730568.324691772460444-another-30661-parent.reference@test-123.example.com>' >>> print(msg.as_string()) Message-Id: <929227342217024.1596730490.324691772460938-example-30661-some.reference@test-123.example.com> In-Reply-To: <92922734221723.1596730568.324691772460444-another-30661-parent.reference@test-123.example.com> ``` [1] bpo-35805: https://bugs.python.org/issue35805 [2] python/cpython#13397 (comment) [3] https://tools.ietf.org/html/rfc5322#section-2.1.1 closes #55609 Signed-off-by: Olivier Dony (odo) <odo@openerp.com>
Python 3 before 3.8 has a bug that causes the email.policy classes to incorrectly fold and RFC2047-encode "identification fields" in email messages. This mainly applies to Message-Id, References, and In-Reply-To fields. We are impacted by this bug since odoo#35929 where we switched to using the "modern" email.message API. RFC2047 section 5 clearly states that those headers/fields are not to be encoded, and that would violate RFC5322. Further, such a folded Message-Id is considered non-RFC-conformant by popular MTAs (GMail, Outlook), which will then generate *another* Message-Id field, causing the original threading information to be lost. Replies to such a modified message will reference the new, unknown Message-Id, and won't be attached to the original thread. The solution we adopt here is to monkey-patch the SMTP policies to special-case those identification fields and deactivate the automatic folding, until the bug is properly and fully fixed in the standard lib. Some considerations taken into account for this patch: - `email.policy.SMTP` is being monkey-patched globally to make sure we fix all possible places where Messages are being encoded/folded - the fix is **not** made version-specific, considering that even in Python 3.8 the official bugfix only applies to Message-Id, but still fails to protect other identification fields, like *References* and *In-Reply-To*. The author specifically noted that shortcoming [2]. The fix wouldn't break anything on Python 3.8 anyway. - the `noFoldPolicy` trick for preventing folding is done with no max line length at all. RFC5322, section 2.1.1 states [3] that the maximum length is 998 due to legacy implementations, but there is no provision to wrap identification fields that are longer than that. Wrapping at 998 chars would corrupt the header anyway. We'll just count on the fact that we don't usually need 1k+ chars in those headers. The invalid folding/encoding in action on Python 3.6 (in Python 3.8 only the second header gets folded): ```py >>> msg = email.message.EmailMessage(policy=email.policy.SMTP) >>> msg['Message-Id'] = '<929227342217024.1596730490.324691772460938-example-30661-some.reference@test-123.example.com>' >>> msg['In-Reply-To'] = '<92922734221723.1596730568.324691772460444-another-30661-parent.reference@test-123.example.com>' >>> print(msg.as_string()) Message-Id: =?utf-8?q?=3C929227342217024=2E1596730490=2E324691772460938-exam?= =?utf-8?q?ple-30661-some=2Ereference=40test-123=2Eexample=2Ecom=3E?= In-Reply-To: =?utf-8?q?=3C92922734221723=2E1596730568=2E324691772460444-anot?= =?utf-8?q?her-30661-parent=2Ereference=40test-123=2Eexample=2Ecom=3E?= ``` and the expected result after the fix: ```py >>> msg = email.message.EmailMessage(policy=email.policy.SMTP) >>> msg['Message-Id'] = '<929227342217024.1596730490.324691772460938-example-30661-some.reference@test-123.example.com>' >>> msg['In-Reply-To'] = '<92922734221723.1596730568.324691772460444-another-30661-parent.reference@test-123.example.com>' >>> print(msg.as_string()) Message-Id: <929227342217024.1596730490.324691772460938-example-30661-some.reference@test-123.example.com> In-Reply-To: <92922734221723.1596730568.324691772460444-another-30661-parent.reference@test-123.example.com> ``` [1] bpo-35805: https://bugs.python.org/issue35805 [2] python/cpython#13397 (comment) [3] https://tools.ietf.org/html/rfc5322#section-2.1.1 X-original-commit: 6726e9a
Python 3 before 3.8 has a bug that causes the email.policy classes to incorrectly fold and RFC2047-encode "identification fields" in email messages. This mainly applies to Message-Id, References, and In-Reply-To fields. We are impacted by this bug since #35929 where we switched to using the "modern" email.message API. RFC2047 section 5 clearly states that those headers/fields are not to be encoded, and that would violate RFC5322. Further, such a folded Message-Id is considered non-RFC-conformant by popular MTAs (GMail, Outlook), which will then generate *another* Message-Id field, causing the original threading information to be lost. Replies to such a modified message will reference the new, unknown Message-Id, and won't be attached to the original thread. The solution we adopt here is to monkey-patch the SMTP policies to special-case those identification fields and deactivate the automatic folding, until the bug is properly and fully fixed in the standard lib. Some considerations taken into account for this patch: - `email.policy.SMTP` is being monkey-patched globally to make sure we fix all possible places where Messages are being encoded/folded - the fix is **not** made version-specific, considering that even in Python 3.8 the official bugfix only applies to Message-Id, but still fails to protect other identification fields, like *References* and *In-Reply-To*. The author specifically noted that shortcoming [2]. The fix wouldn't break anything on Python 3.8 anyway. - the `noFoldPolicy` trick for preventing folding is done with no max line length at all. RFC5322, section 2.1.1 states [3] that the maximum length is 998 due to legacy implementations, but there is no provision to wrap identification fields that are longer than that. Wrapping at 998 chars would corrupt the header anyway. We'll just count on the fact that we don't usually need 1k+ chars in those headers. The invalid folding/encoding in action on Python 3.6 (in Python 3.8 only the second header gets folded): ```py >>> msg = email.message.EmailMessage(policy=email.policy.SMTP) >>> msg['Message-Id'] = '<929227342217024.1596730490.324691772460938-example-30661-some.reference@test-123.example.com>' >>> msg['In-Reply-To'] = '<92922734221723.1596730568.324691772460444-another-30661-parent.reference@test-123.example.com>' >>> print(msg.as_string()) Message-Id: =?utf-8?q?=3C929227342217024=2E1596730490=2E324691772460938-exam?= =?utf-8?q?ple-30661-some=2Ereference=40test-123=2Eexample=2Ecom=3E?= In-Reply-To: =?utf-8?q?=3C92922734221723=2E1596730568=2E324691772460444-anot?= =?utf-8?q?her-30661-parent=2Ereference=40test-123=2Eexample=2Ecom=3E?= ``` and the expected result after the fix: ```py >>> msg = email.message.EmailMessage(policy=email.policy.SMTP) >>> msg['Message-Id'] = '<929227342217024.1596730490.324691772460938-example-30661-some.reference@test-123.example.com>' >>> msg['In-Reply-To'] = '<92922734221723.1596730568.324691772460444-another-30661-parent.reference@test-123.example.com>' >>> print(msg.as_string()) Message-Id: <929227342217024.1596730490.324691772460938-example-30661-some.reference@test-123.example.com> In-Reply-To: <92922734221723.1596730568.324691772460444-another-30661-parent.reference@test-123.example.com> ``` [1] bpo-35805: https://bugs.python.org/issue35805 [2] python/cpython#13397 (comment) [3] https://tools.ietf.org/html/rfc5322#section-2.1.1 closes #55655 X-original-commit: 6726e9a Signed-off-by: Olivier Dony (odo) <odo@openerp.com>
Python 3 before 3.8 has a bug that causes the email.policy classes to incorrectly fold and RFC2047-encode "identification fields" in email messages. This mainly applies to Message-Id, References, and In-Reply-To fields. We are impacted by this bug since odoo#35929 where we switched to using the "modern" email.message API. RFC2047 section 5 clearly states that those headers/fields are not to be encoded, and that would violate RFC5322. Further, such a folded Message-Id is considered non-RFC-conformant by popular MTAs (GMail, Outlook), which will then generate *another* Message-Id field, causing the original threading information to be lost. Replies to such a modified message will reference the new, unknown Message-Id, and won't be attached to the original thread. The solution we adopt here is to monkey-patch the SMTP policies to special-case those identification fields and deactivate the automatic folding, until the bug is properly and fully fixed in the standard lib. Some considerations taken into account for this patch: - `email.policy.SMTP` is being monkey-patched globally to make sure we fix all possible places where Messages are being encoded/folded - the fix is **not** made version-specific, considering that even in Python 3.8 the official bugfix only applies to Message-Id, but still fails to protect other identification fields, like *References* and *In-Reply-To*. The author specifically noted that shortcoming [2]. The fix wouldn't break anything on Python 3.8 anyway. - the `noFoldPolicy` trick for preventing folding is done with no max line length at all. RFC5322, section 2.1.1 states [3] that the maximum length is 998 due to legacy implementations, but there is no provision to wrap identification fields that are longer than that. Wrapping at 998 chars would corrupt the header anyway. We'll just count on the fact that we don't usually need 1k+ chars in those headers. The invalid folding/encoding in action on Python 3.6 (in Python 3.8 only the second header gets folded): ```py >>> msg = email.message.EmailMessage(policy=email.policy.SMTP) >>> msg['Message-Id'] = '<929227342217024.1596730490.324691772460938-example-30661-some.reference@test-123.example.com>' >>> msg['In-Reply-To'] = '<92922734221723.1596730568.324691772460444-another-30661-parent.reference@test-123.example.com>' >>> print(msg.as_string()) Message-Id: =?utf-8?q?=3C929227342217024=2E1596730490=2E324691772460938-exam?= =?utf-8?q?ple-30661-some=2Ereference=40test-123=2Eexample=2Ecom=3E?= In-Reply-To: =?utf-8?q?=3C92922734221723=2E1596730568=2E324691772460444-anot?= =?utf-8?q?her-30661-parent=2Ereference=40test-123=2Eexample=2Ecom=3E?= ``` and the expected result after the fix: ```py >>> msg = email.message.EmailMessage(policy=email.policy.SMTP) >>> msg['Message-Id'] = '<929227342217024.1596730490.324691772460938-example-30661-some.reference@test-123.example.com>' >>> msg['In-Reply-To'] = '<92922734221723.1596730568.324691772460444-another-30661-parent.reference@test-123.example.com>' >>> print(msg.as_string()) Message-Id: <929227342217024.1596730490.324691772460938-example-30661-some.reference@test-123.example.com> In-Reply-To: <92922734221723.1596730568.324691772460444-another-30661-parent.reference@test-123.example.com> ``` [1] bpo-35805: https://bugs.python.org/issue35805 [2] python/cpython#13397 (comment) [3] https://tools.ietf.org/html/rfc5322#section-2.1.1 X-original-commit: 02b7877
Python 3 before 3.8 has a bug that causes the email.policy classes to incorrectly fold and RFC2047-encode "identification fields" in email messages. This mainly applies to Message-Id, References, and In-Reply-To fields. We are impacted by this bug since #35929 where we switched to using the "modern" email.message API. RFC2047 section 5 clearly states that those headers/fields are not to be encoded, and that would violate RFC5322. Further, such a folded Message-Id is considered non-RFC-conformant by popular MTAs (GMail, Outlook), which will then generate *another* Message-Id field, causing the original threading information to be lost. Replies to such a modified message will reference the new, unknown Message-Id, and won't be attached to the original thread. The solution we adopt here is to monkey-patch the SMTP policies to special-case those identification fields and deactivate the automatic folding, until the bug is properly and fully fixed in the standard lib. Some considerations taken into account for this patch: - `email.policy.SMTP` is being monkey-patched globally to make sure we fix all possible places where Messages are being encoded/folded - the fix is **not** made version-specific, considering that even in Python 3.8 the official bugfix only applies to Message-Id, but still fails to protect other identification fields, like *References* and *In-Reply-To*. The author specifically noted that shortcoming [2]. The fix wouldn't break anything on Python 3.8 anyway. - the `noFoldPolicy` trick for preventing folding is done with no max line length at all. RFC5322, section 2.1.1 states [3] that the maximum length is 998 due to legacy implementations, but there is no provision to wrap identification fields that are longer than that. Wrapping at 998 chars would corrupt the header anyway. We'll just count on the fact that we don't usually need 1k+ chars in those headers. The invalid folding/encoding in action on Python 3.6 (in Python 3.8 only the second header gets folded): ```py >>> msg = email.message.EmailMessage(policy=email.policy.SMTP) >>> msg['Message-Id'] = '<929227342217024.1596730490.324691772460938-example-30661-some.reference@test-123.example.com>' >>> msg['In-Reply-To'] = '<92922734221723.1596730568.324691772460444-another-30661-parent.reference@test-123.example.com>' >>> print(msg.as_string()) Message-Id: =?utf-8?q?=3C929227342217024=2E1596730490=2E324691772460938-exam?= =?utf-8?q?ple-30661-some=2Ereference=40test-123=2Eexample=2Ecom=3E?= In-Reply-To: =?utf-8?q?=3C92922734221723=2E1596730568=2E324691772460444-anot?= =?utf-8?q?her-30661-parent=2Ereference=40test-123=2Eexample=2Ecom=3E?= ``` and the expected result after the fix: ```py >>> msg = email.message.EmailMessage(policy=email.policy.SMTP) >>> msg['Message-Id'] = '<929227342217024.1596730490.324691772460938-example-30661-some.reference@test-123.example.com>' >>> msg['In-Reply-To'] = '<92922734221723.1596730568.324691772460444-another-30661-parent.reference@test-123.example.com>' >>> print(msg.as_string()) Message-Id: <929227342217024.1596730490.324691772460938-example-30661-some.reference@test-123.example.com> In-Reply-To: <92922734221723.1596730568.324691772460444-another-30661-parent.reference@test-123.example.com> ``` [1] bpo-35805: https://bugs.python.org/issue35805 [2] python/cpython#13397 (comment) [3] https://tools.ietf.org/html/rfc5322#section-2.1.1 closes #55656 X-original-commit: 02b7877 Signed-off-by: Olivier Dony (odo) <odo@openerp.com>
This parser is based on the definition of Identification Fields from RFC 5322
Sec 3.6.4.
This should also prevent folding of Message-ID header using RFC 2047 encoded
words and hence fix bpo-35805.
https://bugs.python.org/issue35805