Reduce usage of regex #2644

JelleZijlstra · 2021-11-25T15:02:01Z

This removes all but one usage of the regex dependency. Tricky bits included:

A bug in test_black.py where we were incorrectly using a character range. Fix also submitted separately in fix regex in test_black.py #2643.
tokenize.py was the original use case for regex (#455 Fix bug with tricky unicode symbols #1047). The important bit is that we rely on \w to match anything valid in an identifier, and re fails to match a few characters as part of identifiers. My solution is to instead match all characters except those we know to mean something else in Python: whitespace and ASCII punctuation. This will make Black able to parse some invalid Python programs, like those that contain non-ASCII punctuation in the place of an identifier, but that seems fine to me.
One import of regex remains, in trans.py. We use a recursive regex to parse f-strings, and only regex supports that. I haven't thought of a better fix there (except maybe writing a manual parser), so I'm leaving that for now.

My goal is to remove the regex dependency to reduce the risk of breakage due to dependencies and make life easier for users on platforms without wheels.

ichard26

I like it and so does diff-shades - reporting zero changes. Probably worth putting some more thoughts into the tokenize hack before merging this though.

ichard26 · 2021-11-26T01:09:09Z

CHANGES.md

@@ -6,6 +6,7 @@

 - Fixed Python 3.10 support on platforms without ProcessPoolExecutor (#2631)
 - Fixed `match` statements with open sequence subjects, like `match a, b:` (#2639)
+- Reduce usage of the `regex` dependency (#2644)


Is this worth mentioning in the changelog? This doesn't really have an impact on end users since we still depend on regex unconditionally so all of the problem involved in that will persist. Not a big deal but I wanted to flag this.

My thinking was that there's some chance it does have an impact on end users, so it's worth mentioning.

ichard26 · 2021-11-26T01:12:50Z

src/blib2to3/pgen2/tokenize.py

@@ -86,7 +86,7 @@ def _combinations(*l):
 Comment = r"#[^\r\n]*"
 Ignore = Whitespace + any(r"\\\r?\n" + Whitespace) + maybe(Comment)
 Name = (  # this is invalid but it's fine because Name comes after Number in all groups
-    r"\w+"
+    r"[^\s#\(\)\[\]\{\}+\-*/!@$%^&=|;:'\",\.<>/?`~\\]+"


At this point we might need to add a FAQ entry describing why Black is incredibly inconsistent detecting invalid syntax. We don't promise that Black will fail on all invalid code but people do reasonably assume consistency. We don't need to get into the nitty gritty but simply explaining how it requires less work while achieving a high degree compatibility.

Yes, I can add that separately.

ichard26 · 2021-11-26T01:14:58Z

tests/test_black.py

@@ -70,7 +70,7 @@
 R = TypeVar("R")

 # Match the time output in a diff, but nothing else
-DIFF_TIME = re.compile(r"\t[\d-:+\. ]+")
+DIFF_TIME = re.compile(r"\t[\d\-:+\. ]+")


Nice catch!

This came up in #2644.

JelleZijlstra · 2021-11-28T03:29:03Z

The fuzz failure is real but happens on main too. Reported #2651.

This came up in #2644.

We were no longer using it since GH-2644 and GH-2654. This should hopefully make using Black easier to use as there's one less compiled dependency. The core team also doesn't have to deal with the surprisingly frequent fires the regex packaging setup goes through. Co-authored-by: Richard Si <[email protected]>

JelleZijlstra added 6 commits November 25, 2021 06:19

fix regex

c097111

use re in black

1bcaff6

put it back in trans

d0c7213

remove in conv.py

e7a8d67

remove regex usage in lib2to3

8b14fc5

CHANGELOG

275e811

JelleZijlstra requested a review from ichard26 November 25, 2021 15:02

JelleZijlstra added 3 commits November 25, 2021 07:04

no re.Pattern in 3.6

444415e

exclude all whitespace

2979e2c

backslash

4ab530c

ichard26 reviewed Nov 26, 2021

View reviewed changes

JelleZijlstra added a commit that referenced this pull request Nov 26, 2021

add FAQ entry about undetected syntax errors

04e1310

This came up in #2644.

JelleZijlstra mentioned this pull request Nov 26, 2021

add FAQ entry about undetected syntax errors #2645

Merged

Merge branch 'main' into noregex

e809b80

JelleZijlstra mentioned this pull request Nov 28, 2021

Instability with "0^)=0#" #2651

Closed

JelleZijlstra added a commit that referenced this pull request Nov 30, 2021

add FAQ entry about undetected syntax errors (#2645)

ebd3e39

This came up in #2644.

Merge branch 'main' into noregex

6dfc313

JelleZijlstra merged commit 5e2bb52 into main Dec 1, 2021

JelleZijlstra deleted the noregex branch December 1, 2021 02:01

ichard26 mentioned this pull request Dec 1, 2021

Look into alternatives to using regex #2197

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce usage of regex #2644

Reduce usage of regex #2644

JelleZijlstra commented Nov 25, 2021

ichard26 left a comment

ichard26 Nov 26, 2021

JelleZijlstra Nov 26, 2021

ichard26 Nov 26, 2021

JelleZijlstra Nov 26, 2021

ichard26 Nov 26, 2021

JelleZijlstra commented Nov 28, 2021

Reduce usage of regex #2644

Reduce usage of regex #2644

Conversation

JelleZijlstra commented Nov 25, 2021

ichard26 left a comment

Choose a reason for hiding this comment

ichard26 Nov 26, 2021

Choose a reason for hiding this comment

JelleZijlstra Nov 26, 2021

Choose a reason for hiding this comment

ichard26 Nov 26, 2021

Choose a reason for hiding this comment

JelleZijlstra Nov 26, 2021

Choose a reason for hiding this comment

ichard26 Nov 26, 2021

Choose a reason for hiding this comment

JelleZijlstra commented Nov 28, 2021