Skip to content

Add analyze_text to make make check happy.#1549

Merged
Anteru merged 4 commits intomasterfrom
task/add-analyze-text
Sep 23, 2020
Merged

Add analyze_text to make make check happy.#1549
Anteru merged 4 commits intomasterfrom
task/add-analyze-text

Conversation

@Anteru
Copy link
Copy Markdown
Collaborator

@Anteru Anteru commented Sep 19, 2020

This also fixes a few small bugs:

  • Slash uses *.sla as the file ending, not *.sl
  • IDL has endelse, not elseelse

@Anteru Anteru requested a review from birkenfeld September 19, 2020 13:23
@Anteru Anteru force-pushed the task/add-analyze-text branch from a1dee1a to eb7ecc0 Compare September 19, 2020 13:24
This also fixes a few small bugs:

* Slash uses *.sla as the file ending, not *.sl
* IDL has endelse, not elseelse
Copy link
Copy Markdown
Member

@birkenfeld birkenfeld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I'd prefer for searched-for keywords to be bounded by a \b word delimiter.


def analyse_text(text):
"""F# doesn't have that many unique features -- |> for matching
seems quite common though, in addition to let/match."""
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let/match/-> are quite common in functional and derived languages, e.g. Haskell or Rust. I wouldn't use them.

other programming language."""
if '+++++' in text or '------' in text:
return 0.5
if '+++' in text:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+++ and --- are common in diff/patch files, which are probably more common than brainfuck :)

The longer +++++++ and ------ sequences are typical for underlines in readable markup like rst/markdown.

Another signature of brainfuck which has less conflicts is probably [-], i.e. "clear this cell".

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed it to check if 25% of the source consist of +- or <>, that seems fairly unlikely in any other source (even HTML is not that <> heavy.)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is probably a good test, I'm worrying about the performance implications of looping over the string on the Python level... the other tests at least use the C-based regex/stringlib routines.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given it's statistics, maybe trying it on the first 256 characters or so is enough? The character distribution for brainfuck should be similar across any subset of the program.

"""We try to find a keyword which seem relatively common, unfortunately
there is a decent overlap with Smalltalk keywords otherwise here.."""
result = 0
if re.match('origsource', text, re.IGNORECASE):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

match vs search again?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed :(

def analyse_text(text):
"""This is a quite simple script file, but there are a few keywords
which seem unique to this language."""
if re.match('osversion|includecmd', text, re.IGNORECASE):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

match is anchored at the beginning of the text, is that intended?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, and I'll fix all remaining instances as well.

# A \ B is actually quite uncommon outside of Matlab/Octave
if re.match(r'\w+\s*\\\s*\w+',text):
return 0.05
if re.match(r'\[\s*(?:(?:\d+\s*)+;\s*)+\s*\]', text):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a quite complicated regex that might backtrack (not catastrophically, but still a bit much to do for analyse_text...)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried those out and both aren't really good indicators either. I'm inclined to return 0 until someone has a better idea.

yield index, token, value

def analyse_text(text):
"""Not much we can go by. (* for comments is our best guess."""
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's the same for all pascal-like languages, unfortunately.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. New plan: Figure out if it's Pascal-Like first, then check if PROCEDURE is present (not valid Pascal), and back out if FUNCTION is found (not valid Modula2)

"""public method and private method don't seem to be quite common
elsewhere."""
result = 0
if re.match(r'(?:public|private)\s+method', text):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

match

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* Make Perl less confident in presence of :=.
* Improve brainfuck check to not parse the whole input.
* Improve Unicon by matching \self, /self
* Fix Ezhil not matching against the input text
is obviously horribly off if someone uses string literals in tamil
in another language."""
if len(re.findall('[\u0b80-\u0bff]')) > 10:
if len(re.findall(r'[\u0b80-\u0bff]', text)) > 10:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I missed that.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Me too, but that's what friends tests are for :)

result += 0.01
is_pascal_like += 0.5

if is_pascal_like == 1:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but can you make this a boolean flag (pascal = x in text and y in text and z in text)? The float-adding-and-comparing seems to be unneeded...

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

@Anteru Anteru self-assigned this Sep 23, 2020
@Anteru Anteru added the changelog-update Items which need to get mentioned in the changelog label Sep 23, 2020
@Anteru Anteru merged commit 9fca2a1 into master Sep 23, 2020
@Anteru Anteru deleted the task/add-analyze-text branch September 23, 2020 16:14
@Anteru Anteru removed the changelog-update Items which need to get mentioned in the changelog label Oct 24, 2020
@Anteru Anteru added this to the 2.7.2 milestone Oct 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants