The Wayback Machine - https://web.archive.org/web/20250522123707/https://github.com/python/cpython/pull/133549
Skip to content

gh-133546: Make re.Match a well-rounded Sequence type #133549

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

vberlier
Copy link
Contributor

@vberlier vberlier commented May 7, 2025

@@ -1378,6 +1378,27 @@ when there is no match, you can test whether there was a match with a simple
if match:
process(match)

Match objects are proper :class:`~collections.abc.Sequence` types. You can access
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not true with this PR, Sequence has a number of other requirements (e.g. an index and count method).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, thanks! I added index and count by adapting the implementation of tuple.index and tuple.count. I also updated the unit test to cover all Sequence mixin methods.


if (start < 0) {
start += self->groups;
if (start < 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should take a quick look at PEP-7 for our style guide. We try to avoid omitting brackets. There is a bit of inconsistency in this file, but it's good to just follow PEP 7 for new code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read through PEP-7 already but I didn't really think about it here since I copied this directly from tupleobject.c. Let me clean this up!


if (index < 0 || index >= self->groups) {
/* raise IndexError if we were given a bad group number */
if (!PyErr_Occurred()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't conditionally raise the error. match_item shouldn't ever be called with an exception set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

PyObject* group = match_getslice_by_index(self, i, Py_None);
if (group == NULL)
return NULL;
int cmp = PyObject_RichCompareBool(group, value, Py_EQ);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure we need to incref the value in case the __bool__ releases a reference. cc @picnixz who fixed a bunch of issues related to that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is what you're talking about, but I think this is already handled by PyObject_RichCompareBool out of the box. It's a wrapper around PyObject_RichCompare that unwraps and decref the returned object.

Copy link
Member

@picnixz picnixz May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not about the returned object, it's about the inputs. PyObject_RichCompareBool and PyObject_RichCompare can call arbitrary Python code in general, so if such arbitrary code actually releases or changes their inputs, we may end up with what we call a "use-after-free" (and usually it leads to a SIGSEV, but in other occurrences it can be a security issue).

In order to determine whether there is a UAF or not, you need to check whether group or value can actually be obtained from pure Python code and be changed by the call to __eq__. In this case, value can be modified in place by its call, but _sre_SRE_Match_index_impl should have already held a strong reference on it, so there shouldn't be a UAF.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for the explanation!

Copy link
Member

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test coverage is sometimes insufficient and sometimes redundant with existing ones.

@@ -1473,6 +1494,37 @@ when there is no match, you can test whether there was a match with a simple

.. versionadded:: 3.6

.. versionchanged:: 3.14
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. versionchanged:: 3.14
.. versionchanged:: next

>>> len(m)
3

.. versionadded:: 3.14
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. versionadded:: 3.14
.. versionadded:: next


Raises :exc:`ValueError` if the value is not present.

.. versionadded:: 3.14
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. versionadded:: 3.14
.. versionadded:: next


Return the number of occurrences of the value among the matched groups.

.. versionadded:: 3.14
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. versionadded:: 3.14
.. versionadded:: next

Comment on lines +627 to +628
self.assertEqual(m[1:-1], ("a", "b"))
self.assertEqual(m[::-1], ("c", "b", "a", "abc"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's check more slices.

self.assertEqual(m.count("a"), 1)
self.assertEqual(m.count("b"), 1)
self.assertEqual(m.count("c"), 1)
self.assertEqual(m.count("123"), 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's check with non-string objects.

Comment on lines 666 to 667
case [_, "a", "b", "c"]:
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's catch the first group (the entire match) and assert it.

self.fail()

match re.match(r"(\d+)-(\d+)-(\d+)", "2025-05-07"):
case [_, year, month, day]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do the same here.

@@ -599,8 +603,83 @@ def test_match_getitem(self):
with self.assertRaises(TypeError):
m[0] = 1

# No len().
self.assertRaises(TypeError, len, m)
def test_match_sequence(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's split this into multiple test: _index, _count and _iter, as well as match case. For getitem, let's move them under test_match_getitem as well and reduce the number of duplicated checks.

Copy link
Member

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some additional feedback. For the test failures, let's add some assertWarns. I don't think we should make some additional checks in C for the type of the operands (the issue is when you compare the input (which is a bytes) with the matched item (which are strings), hence a BytesWarning). So, instead, let's just expect the warning.

By the way, could you avoid using @picnixz in your commits as I'm getting pinged? TiA.

def test_match_sequence(self):
from collections.abc import Sequence

# Slices.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slices can be part of a new test if you want.

@@ -2491,7 +2491,10 @@ match_subscript(PyObject *op, PyObject* item)
if (self->pattern->groupindex) {
PyObject* index = PyDict_GetItemWithError(self->pattern->groupindex, item);
if (index && PyLong_Check(index)) {
return match_item(op, PyLong_AsSsize_t(index));
Py_ssize_t i = PyLong_AsSsize_t(index);
if (i != -1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 is strictly speaking a valid one, so you should use PyErr_Occurred()

@@ -2750,12 +2741,12 @@ _sre_SRE_Match_count_impl(MatchObject *self, PyObject *value)
}
int cmp = PyObject_RichCompareBool(group, value, Py_EQ);
Py_DECREF(group);
if (cmp < 0) {
return NULL;
}
if (cmp > 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now you can use an else if here

@vberlier
Copy link
Contributor Author

Oh, sorry for the ping. Is it okay if I just leave out the cases with bytes as input? I can't seem to trigger the warning locally, even with ./python -Werror -m test.

@vberlier vberlier requested a review from picnixz May 20, 2025 17:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants