Add analyze_text to make make check happy. by Anteru · Pull Request #1549 · pygments/pygments

Anteru · 2020-09-19T13:23:14Z

This also fixes a few small bugs:

Slash uses *.sla as the file ending, not *.sl
IDL has endelse, not elseelse

This also fixes a few small bugs: * Slash uses *.sla as the file ending, not *.sl * IDL has endelse, not elseelse

birkenfeld

In general I'd prefer for searched-for keywords to be bounded by a \b word delimiter.

birkenfeld · 2020-09-20T06:17:17Z

pygments/lexers/dotnet.py

+
+    def analyse_text(text):
+        """F# doesn't have that many unique features -- |> for matching
+        seems quite common though, in addition to let/match."""


let/match/-> are quite common in functional and derived languages, e.g. Haskell or Rust. I wouldn't use them.

birkenfeld · 2020-09-20T06:17:50Z

pygments/lexers/esoteric.py

+        other programming language."""
+        if '+++++' in text or '------' in text:
+            return 0.5
+        if '+++' in text:


+++ and --- are common in diff/patch files, which are probably more common than brainfuck :)

The longer +++++++ and ------ sequences are typical for underlines in readable markup like rst/markdown.

Another signature of brainfuck which has less conflicts is probably [-], i.e. "clear this cell".

I've changed it to check if 25% of the source consist of +- or <>, that seems fairly unlikely in any other source (even HTML is not that <> heavy.)

While this is probably a good test, I'm worrying about the performance implications of looping over the string on the Python level... the other tests at least use the C-based regex/stringlib routines.

Given it's statistics, maybe trying it on the first 256 characters or so is enough? The character distribution for brainfuck should be similar across any subset of the program.

birkenfeld · 2020-09-20T06:22:22Z

pygments/lexers/int_fiction.py

+        """We try to find a keyword which seem relatively common, unfortunately
+        there is a decent overlap with Smalltalk keywords otherwise here.."""
+        result = 0
+        if re.match('origsource', text, re.IGNORECASE):


match vs search again?

birkenfeld · 2020-09-20T06:22:52Z

pygments/lexers/configs.py

+    def analyse_text(text):
+        """This is a quite simple script file, but there are a few keywords
+        which seem unique to this language."""
+        if re.match('osversion|includecmd', text, re.IGNORECASE):


match is anchored at the beginning of the text, is that intended?

Nope, and I'll fix all remaining instances as well.

birkenfeld · 2020-09-20T06:24:02Z

pygments/lexers/matlab.py

+        # A \ B is actually quite uncommon outside of Matlab/Octave
+        if re.match(r'\w+\s*\\\s*\w+',text):
+            return 0.05
+        if re.match(r'\[\s*(?:(?:\d+\s*)+;\s*)+\s*\]', text):


This is a quite complicated regex that might backtrack (not catastrophically, but still a bit much to do for analyse_text...)

I just tried those out and both aren't really good indicators either. I'm inclined to return 0 until someone has a better idea.

birkenfeld · 2020-09-20T06:24:26Z

pygments/lexers/modula2.py

            yield index, token, value
+
+    def analyse_text(text):
+        """Not much we can go by. (* for comments is our best guess."""


that's the same for all pascal-like languages, unfortunately.

Ok. New plan: Figure out if it's Pascal-Like first, then check if PROCEDURE is present (not valid Pascal), and back out if FUNCTION is found (not valid Modula2)

birkenfeld · 2020-09-20T06:25:08Z

pygments/lexers/scripting.py

+        """public method and private method don't seem to be quite common
+        elsewhere."""
+        result = 0
+        if re.match(r'(?:public|private)\s+method', text):


* Make Perl less confident in presence of :=. * Improve brainfuck check to not parse the whole input. * Improve Unicon by matching \self, /self * Fix Ezhil not matching against the input text

birkenfeld · 2020-09-23T05:29:30Z

pygments/lexers/ezhil.py

        is obviously horribly off if someone uses string literals in tamil
        in another language."""
-        if len(re.findall('[\u0b80-\u0bff]')) > 10:
+        if len(re.findall(r'[\u0b80-\u0bff]', text)) > 10:


Oops, I missed that.

Me too, but that's what ~~friends~~ tests are for :)

birkenfeld · 2020-09-23T05:32:10Z

pygments/lexers/modula2.py

-            result += 0.01
+            is_pascal_like += 0.5
+
+        if is_pascal_like == 1:


Looks good, but can you make this a boolean flag (pascal = x in text and y in text and z in text)? The float-adding-and-comparing seems to be unneeded...

Anteru requested a review from birkenfeld September 19, 2020 13:23

Anteru force-pushed the task/add-analyze-text branch from a1dee1a to eb7ecc0 Compare September 19, 2020 13:24

Add analyze_text to make make check happy.

834c48b

This also fixes a few small bugs: * Slash uses *.sla as the file ending, not *.sl * IDL has endelse, not elseelse

Anteru force-pushed the task/add-analyze-text branch from eb7ecc0 to 834c48b Compare September 19, 2020 13:26

Anteru mentioned this pull request Sep 19, 2020

Add a check for CR/LF in files. #1547

Merged

birkenfeld requested changes Sep 20, 2020

View reviewed changes

Anteru added 2 commits September 21, 2020 21:37

Improve various analyse_text methods.

07f596d

Improve various analyse_text methods.

f0d31da

* Make Perl less confident in presence of :=. * Improve brainfuck check to not parse the whole input. * Improve Unicon by matching \self, /self * Fix Ezhil not matching against the input text

birkenfeld approved these changes Sep 23, 2020

View reviewed changes

Simplify Modula2::analyse_text.

94f0b02

Anteru self-assigned this Sep 23, 2020

Anteru added the changelog-update Items which need to get mentioned in the changelog label Sep 23, 2020

Anteru merged commit 9fca2a1 into master Sep 23, 2020

Anteru deleted the task/add-analyze-text branch September 23, 2020 16:14

Anteru removed the changelog-update Items which need to get mentioned in the changelog label Oct 24, 2020

Anteru added this to the 2.7.2 milestone Oct 24, 2020

Conversation

Anteru commented Sep 19, 2020

Uh oh!

birkenfeld left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants