-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce usage of regex #2644
Reduce usage of regex #2644
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it and so does diff-shades - reporting zero changes. Probably worth putting some more thoughts into the tokenize hack before merging this though.
@@ -6,6 +6,7 @@ | |||
|
|||
- Fixed Python 3.10 support on platforms without ProcessPoolExecutor (#2631) | |||
- Fixed `match` statements with open sequence subjects, like `match a, b:` (#2639) | |||
- Reduce usage of the `regex` dependency (#2644) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this worth mentioning in the changelog? This doesn't really have an impact on end users since we still depend on regex unconditionally so all of the problem involved in that will persist. Not a big deal but I wanted to flag this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking was that there's some chance it does have an impact on end users, so it's worth mentioning.
@@ -86,7 +86,7 @@ def _combinations(*l): | |||
Comment = r"#[^\r\n]*" | |||
Ignore = Whitespace + any(r"\\\r?\n" + Whitespace) + maybe(Comment) | |||
Name = ( # this is invalid but it's fine because Name comes after Number in all groups | |||
r"\w+" | |||
r"[^\s#\(\)\[\]\{\}+\-*/!@$%^&=|;:'\",\.<>/?`~\\]+" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point we might need to add a FAQ entry describing why Black is incredibly inconsistent detecting invalid syntax. We don't promise that Black will fail on all invalid code but people do reasonably assume consistency. We don't need to get into the nitty gritty but simply explaining how it requires less work while achieving a high degree compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I can add that separately.
tests/test_black.py
Outdated
@@ -70,7 +70,7 @@ | |||
R = TypeVar("R") | |||
|
|||
# Match the time output in a diff, but nothing else | |||
DIFF_TIME = re.compile(r"\t[\d-:+\. ]+") | |||
DIFF_TIME = re.compile(r"\t[\d\-:+\. ]+") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch!
The fuzz failure is real but happens on main too. Reported #2651. |
We were no longer using it since GH-2644 and GH-2654. This should hopefully make using Black easier to use as there's one less compiled dependency. The core team also doesn't have to deal with the surprisingly frequent fires the regex packaging setup goes through. Co-authored-by: Richard Si <[email protected]>
This removes all but one usage of the
regex
dependency. Tricky bits included:tokenize.py
was the original use case for regex (#455 Fix bug with tricky unicode symbols #1047). The important bit is that we rely on\w
to match anything valid in an identifier, andre
fails to match a few characters as part of identifiers. My solution is to instead match all characters except those we know to mean something else in Python: whitespace and ASCII punctuation. This will make Black able to parse some invalid Python programs, like those that contain non-ASCII punctuation in the place of an identifier, but that seems fine to me.regex
remains, intrans.py
. We use a recursive regex to parse f-strings, and onlyregex
supports that. I haven't thought of a better fix there (except maybe writing a manual parser), so I'm leaving that for now.My goal is to remove the
regex
dependency to reduce the risk of breakage due to dependencies and make life easier for users on platforms without wheels.