Welcome to Software Development on Codidact!
Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.
Python re.sub() how to include and slice the match in the substitution?
I recently started learning regular expressions and I'm trying to use Python's re module, specifically the re.sub() function, to convert a subset of Markdown syntax to HTML whenever it appears in a string. However, I haven't been able to figure out how to slice the source string so that the Markdown syntax is removed.
For example, the string This is a *test* string. should get converted to This is a <i>test</i> string., but it keeps the asteriks like so: This is a <i>*test*</i> string.
This is my code (the regex checks for backslashes in case the syntax is escaped and is non-greedy in case of multiple matches):
testString = re.sub(r'(?<!\\)\*.*?\*(?!\\)', r'<i>\g<0></i>', testString)
I've tried splitting up the substitution and using string splicing like this r'<i>' + r'\g<0>'[1:-1] + r'</i>', but that just returns an italicized 'g'.
2 answers
To answer your immediate question, use a capture group.
re.sub(r'(?<!\\)\*(.*?)\*(?!\\)', r'<i>\g<1></i>', testString)
With that out of the way, text parsing for a language is probably best served by an existing library.
For instance, r'(?<!\\)\*(.*?)\*(?!\\)'
is probably meant to be r'(?<!\\)\*(.*?)(?!\\)\*',
but r'This is a *test\* string*' will still mess up what you think it should do.
1 comment thread
The following users marked this post as Works for me:
| User | Comment | Date |
|---|---|---|
| Michael | (no comment) | Nov 1, 2025 at 19:23 |
What went wrong
The call to re.sub is not magic; it works the same way as any other time that a function is called. The string that you pass for the replacement has special meaning, but that meaning comes from the regex implementation, not from Python.
In the code (inferred from your description)
testString = re.sub(r'(?<!\\)\*.*?\*(?!\\)', r'<i>' + r'\g<0>'[1:-1] + r'</i>', testString)
The replacement string r'<i>' + r'\g<0>'[1:-1] + r'</i>' is computed before calling re.sub. (Just to be sure: the r prefixes on these strings have nothing to do with regular expressions; they are part of the Python language syntax for describing strings.) The result, naturally, is r'<i>g<0</i>'; the leading backslash and closing angle bracket are cut off. (If you put the result into any kind of HTML viewer, the open angle bracket by itself is invalid; the results may vary according to the viewer.)
Understanding the syntax of replacement strings
The \g<0> part of the replacement string is not some arbitrary nonsense to represent the original string. The g stands for "group", and the number in angle brackets indicates which group's match to use.
When a regex is matched, the result may contain any number of sub-matches for "groups" representing a portion of the matching part of the text. Group 0 stores the entire match; additional results come from "capturing groups" in the regex pattern.
In the given regex (?<!\\)\*.*?\*(?!\\), (?<!\\) and (?!\\) are non-capturing groups that don't add group matches to the result. (This is not because they don't match any text; empty matches can be captured, and non-empty matches can be intentionally discarded, using the right syntax.)
How to use the system
By adding a capturing group to the regex that captures the part between the asterisks, we can then make a replacement that only uses that part. To make a "simple" capturing group that doesn't have any special effects, just add parentheses around that part of the regex. Thus, (?<!\\)\*(.*?)\*(?!\\) adds a capturing group to the previous regex, matching the part between asterisks.
The capturing group will produce group match 1 in the result, so we can refer to it in the replacement string that way: r'<i>\g<1></i>'. Putting it all together gives Michael's example code.
Advanced technique
Aside from this kind of special string with references back to the group matches, Python allows you to pass a function as the "replacement" used by re.sub. This function will take one argument, which is a match object created by the regex (the same result you'd get by using re.match with the pattern instead of re.sub); it is expected to return a string, which will then be used for the replacement.
For example, to "slice the match" as originally described (instead of using a capturing group), first write a function that takes in the match, extracts its .group(0) (remember, we are responsible for the logic now, so we don't have access to the special interpretation of '\g<0>') and does the appropriate formatting. Like so:
def asterisks_to_i_tags(match):
text = match.group(0)[1:-1]
return f'<i>{text}</i>'
(This also shows the use of an f-string to assemble the desired string.)
Now it can be used in the re.sub call, like so:
testString = re.sub(r'(?<!\\)\*.*?\*(?!\\)', asterisks_to_i_tags, testString)
(Notice that the function is passed by writing only its name, not any parentheses or arguments afterward.)
Of course, for the current case, this is pointless; but knowing about this technique opens up possibilities.

0 comment threads