Add analyze_text to make make check happy.#1549
Conversation
a1dee1a to
eb7ecc0
Compare
This also fixes a few small bugs: * Slash uses *.sla as the file ending, not *.sl * IDL has endelse, not elseelse
eb7ecc0 to
834c48b
Compare
birkenfeld
left a comment
There was a problem hiding this comment.
In general I'd prefer for searched-for keywords to be bounded by a \b word delimiter.
pygments/lexers/dotnet.py
Outdated
|
|
||
| def analyse_text(text): | ||
| """F# doesn't have that many unique features -- |> for matching | ||
| seems quite common though, in addition to let/match.""" |
There was a problem hiding this comment.
let/match/-> are quite common in functional and derived languages, e.g. Haskell or Rust. I wouldn't use them.
pygments/lexers/esoteric.py
Outdated
| other programming language.""" | ||
| if '+++++' in text or '------' in text: | ||
| return 0.5 | ||
| if '+++' in text: |
There was a problem hiding this comment.
+++ and --- are common in diff/patch files, which are probably more common than brainfuck :)
The longer +++++++ and ------ sequences are typical for underlines in readable markup like rst/markdown.
Another signature of brainfuck which has less conflicts is probably [-], i.e. "clear this cell".
There was a problem hiding this comment.
I've changed it to check if 25% of the source consist of +- or <>, that seems fairly unlikely in any other source (even HTML is not that <> heavy.)
There was a problem hiding this comment.
While this is probably a good test, I'm worrying about the performance implications of looping over the string on the Python level... the other tests at least use the C-based regex/stringlib routines.
There was a problem hiding this comment.
Given it's statistics, maybe trying it on the first 256 characters or so is enough? The character distribution for brainfuck should be similar across any subset of the program.
pygments/lexers/int_fiction.py
Outdated
| """We try to find a keyword which seem relatively common, unfortunately | ||
| there is a decent overlap with Smalltalk keywords otherwise here..""" | ||
| result = 0 | ||
| if re.match('origsource', text, re.IGNORECASE): |
pygments/lexers/configs.py
Outdated
| def analyse_text(text): | ||
| """This is a quite simple script file, but there are a few keywords | ||
| which seem unique to this language.""" | ||
| if re.match('osversion|includecmd', text, re.IGNORECASE): |
There was a problem hiding this comment.
match is anchored at the beginning of the text, is that intended?
There was a problem hiding this comment.
Nope, and I'll fix all remaining instances as well.
pygments/lexers/matlab.py
Outdated
| # A \ B is actually quite uncommon outside of Matlab/Octave | ||
| if re.match(r'\w+\s*\\\s*\w+',text): | ||
| return 0.05 | ||
| if re.match(r'\[\s*(?:(?:\d+\s*)+;\s*)+\s*\]', text): |
There was a problem hiding this comment.
This is a quite complicated regex that might backtrack (not catastrophically, but still a bit much to do for analyse_text...)
There was a problem hiding this comment.
I just tried those out and both aren't really good indicators either. I'm inclined to return 0 until someone has a better idea.
pygments/lexers/modula2.py
Outdated
| yield index, token, value | ||
|
|
||
| def analyse_text(text): | ||
| """Not much we can go by. (* for comments is our best guess.""" |
There was a problem hiding this comment.
that's the same for all pascal-like languages, unfortunately.
There was a problem hiding this comment.
Ok. New plan: Figure out if it's Pascal-Like first, then check if PROCEDURE is present (not valid Pascal), and back out if FUNCTION is found (not valid Modula2)
pygments/lexers/scripting.py
Outdated
| """public method and private method don't seem to be quite common | ||
| elsewhere.""" | ||
| result = 0 | ||
| if re.match(r'(?:public|private)\s+method', text): |
* Make Perl less confident in presence of :=. * Improve brainfuck check to not parse the whole input. * Improve Unicon by matching \self, /self * Fix Ezhil not matching against the input text
| is obviously horribly off if someone uses string literals in tamil | ||
| in another language.""" | ||
| if len(re.findall('[\u0b80-\u0bff]')) > 10: | ||
| if len(re.findall(r'[\u0b80-\u0bff]', text)) > 10: |
There was a problem hiding this comment.
Me too, but that's what friends tests are for :)
pygments/lexers/modula2.py
Outdated
| result += 0.01 | ||
| is_pascal_like += 0.5 | ||
|
|
||
| if is_pascal_like == 1: |
There was a problem hiding this comment.
Looks good, but can you make this a boolean flag (pascal = x in text and y in text and z in text)? The float-adding-and-comparing seems to be unneeded...
This also fixes a few small bugs: