Sebastian Zarnekow's Blog: Eclipse

Showing posts with label Eclipse. Show all posts

Monday, May 15, 2017

Moving On - Part 2

A big thank you for all the nice feedback and encouraging words that I received after my announcement to leave SMACC. Now, that I’ve had my last day at the company, I think it’s time to raise the curtain. And there aren’t too many surprises behind it, I guess.
From 01 June 2017 on, I’ll be a freelancer and professional consultant. I will build solutions for software developers and solve language engineering problems for my customers. My goal is to help developers and domain experts sharpening their tools, so that they can tackle their business challenges more efficiently.
Also I will work closely with the great people and friends from itemis and be part of the growing team in the Berlin branch. Of course I’m looking forward to contributing to Xtext again. After being absent for more than 15 months, a few things changed in the project, but there are plenty of interesting topics to tackle in the framework, for sure. Time to get my hands dirty!
Long story short: I’m happy to be back :)

Monday, April 17, 2017

Why DSLs?

A lot has been written about domain specific languages, their purpose and their application. According to the ever changing wisdom of wikipedia, a DSL “is a computer language specialized to a particular application domain. This is in contrast to a general-purpose language (GPL), which is broadly applicable across domains.” In other words, a DSL is supposed to help to implement software systems or parts of those in a more efficient way. But it begs the question, why engineers should learn new syntaxes, new APIs and new tools rather than using their primary language and just get the things done?

Here is my take on this. And to answer that question, let’s move the discussion away from programming languages towards a more general language understanding. And instead of talking abstract, I’ll use a very concrete example. In fact one of the most discussed domains ever - and one that literally everyone has an opinion about: The weather.

We all know this situation: When watching the news, the forecaster will tell something about sunshine duration, wind speed and direction, or temperature. Being not a studied meteorologist, I can still find my way through most of the facts, though the probability of precipitation always gives me a slight headache. If we look at the vocabulary that is used in an average weather forecast, we can clearly call that a domain specific language, though it only scratches the surface of meteorology. But what happens, when two meteorologists talk to each other about the weather? My take: they will use a very efficient vocabulary to discuss it unambiguously.
Now let’s move this gedankenexperiment forward. There are approximately 40 non-compound words in the Finnish language that describe snow. Now what happens, when a Finnish forecaster and a German news anchor talk about snowy weather conditions and the anchorman takes English notes on that? I bet it is safe to assume that there will be a big loss of precision when it comes to the mutual agreement on the exact form of snowy weather. And even more so, when this German guy later on tries to explain to another Finn what the weather was like. The bottomline of this: common vocabulary and language is crucial to successful communication.

Back to programming. Let’s assume that the English language is a general purpose programming language, the German guy is a software developer and the Finnish forecaster is a domain expert for snowy weather. This may all sound a little farfetched, but in fact it is exactly how most software projects are run: A domain expert explains the requirements to a developer. The dev will start implementing the requirements. Other developers will be onboarded on the project. They try to wrap their head around the state of the codebase and surely read the subtleties of the implementation differently, no matter how fluent they are in English. Follow-up meetings will be scheduled to clarify questions with the domain experts. And the entire communication is prone to loss in precision. In the end all involved parties talk about similar yet slightly different things. Misunderstandings go potentially unnoticed and cause a lot of frustration on all sides.

This is where domain specific languages come into play! Instead of a tedious, multi-step translation from one specialized vocabulary to a general purpose language and vice versa, the logic is directly implemented using the domain specific terms and notation. The knowledge is captured with fewer manual transformation steps; the system is easier to write, understand and review. This may even work to the extent that the domain experts do write the code themselves. Or they pair up with the software engineers and form a team.

As usual, there is no such thing as free lunch. As long as your are not Omnilingual, you should probably not waste your time learning Finnish by heart, especially when you are working with Spanish people next week, and the French team the week thereafter. But without any doubt, fluent Finnish will pay off as long as your are working with the Finns.

A development process based on domain specific languages and thus based on a level of abstraction close to the problem domain can relief all involved people. There are fewer chances for misunderstandings and inaccurate translations. Speaking the same language and using the same vocabulary naturally feels like pulling together. And that’s what makes successful projects.

Monday, April 10, 2017

Moving on

After an exciting journey of 15 months as the Director Engineering at SMACC, I decided to move on. It was not an easy decision to make, though it’s still one that I wanted to make. In the past year I made many new friends, met great people, and had the chance to work in a super nice team. It was a great time with plenty of challenges, important learnings and great fun. But I also realized that I was missing the time as a technical consultant. Language engineering always was and still is a strong passion of mine. So I figured it’s about time to move on and refocus. Xtext, Eclipse, Language oriented programming - exciting times ahead. Keeping you posted ...

Friday, November 6, 2015

Improved Grammar Inheritance

Since the very first day of Xtext, it was possible to extend another grammar to mixin its rule declarations to reuse or specialize them. For most use cases that was straightforward and a perfect match. For others it was rather cumbersome so far because the original declaration was no longer reachable from the sub-language. Copy and paste was the only solution to that problem. The good news? The situation changes with Xtext 2.9 significantly.
The newly introduced super call allows to override a rule and still use its super implementation without the need to duplicate it. Along with super, Xtext 2.9 also provides a way to call inherited or locally declared rules explicitly. Explicit rule calls will overrule the polymorphism that is usually applied in the context of grammar inheritance. As a language library designer you get fine grained control over the syntax, even if your language is supposed to be extended by sub-languages.
But let's look at an example:

grammar SuperDsl
with org.eclipse.xtext.common.Terminals
..
Element:
'element' name=ID
;
Thing:
'thing' name=SuperDsl::ID
;
terminal ID: ('a'..'z')+;

grammar SubDsl with SuperDsl
..
Element:
super // 1; or super::Element
| 'element' name=super::ID // 2
| 'element' name=Terminals::ID // 3
;
terminal ID: 'id';

Here we see different use cases for the super call and also for qualified rule calls. The first occurrence of super (1) illustrates the shortest possible notation to reach out to the super implementation. If you override a rule and want to use the original declaration in the rule's body, you can simply call super from there.
It is also possible to use a qualified::RuleCall. Qualified invocations point directly to the referenced rule. The qualifier can either be a generic super qualifier (2) or an explicit language name (3). The latter provides a way to skip the immediate super language and reach out to its parent. This offers great flexibility. You can ensure that you call the rule from your own grammar, even if a sub-language will override the declaration. The benefit is illustrated by the rule Thing. It calls the ID declaration from SuperDsl explicitly thus it will also do so from the SubDsl. As long as you do not explicitly override the declaration of Thing, its syntax will not change in any inheritor from SuperDsl.
Long story short: super calls add a lot more flexibility for language mixins and greatly reduce the need to copy and paste entire rule bodies in the sub-language. Go ahead and download the latest milestone to give it a try!

Thursday, October 22, 2015

The Xtext Grammar Learned New Tricks

Since the Xtext 2.9 release is around the corner - and you've for sure read about the upcoming support for IntelliJ IDEA or Xtext editors in the browser -, it's time to unveil some of the new features of the Xtext grammar language itself. In a nutshell the enhancements address a couple of long standing feature requests and non-critical issues that we had. But especially complex grammars sometimes required duplicated or repetitive parts to implement the language syntax. We felt that it was about time to get rid of these idioms.
Long story short: In the next version the grammar language will support a couple of new features:

/* SuppressWarnings[all] */: The severity of errors and warnings in a grammar file can be customized on a per project level since Xtext 2.8. But sometimes you don't want to disable the validation rule completely just to get rid of one particular false positive (False positive?!? you think? Stay tuned, I'll elaborate on that in a separate post). For that purpose it's now possible to mute a certain validation rule for a selected element, a rule or the entire grammar.
super calls and grammar mixins: Xtext 2.9 renders our former advise 'You have to copy the parent rule into your sub-language' obsolete. Eventually it is possible to simply use a super call instead.
A long standing feature request for the grammar described a means to extract common parts of parser rules without screwing up the ecore model. The newly introduced parser fragments allow to normalize production rules that formerly required copy'n'paste, e.g. due to left factoring. Fragments even sport smarter inference of the ecore model when it comes to multiple inheritance.
Last but not least, the new JavaScript specification was an inspiration for conditional alternatives in a grammar definition. Advanced language use cases may require to enable or disable a decision path deep down in some recursive chain of rule calls. Until now there was no concise way to support something like that. This limitation led often to dozens of copied rules if a syntax required to support conditionally enabled or disabled branches. Parameterized rule calls remove that limitation and enable much more expressive notations.

I'll explain all these new features in-depth in a short blog series to make sure that every bit of it gets proper attention. Make sure to follow-up if you are curious about them.

Monday, November 3, 2014

After EclipseCon is Before EclipseCon

Now that the EclipseCon Europe 2014 is over, it's time to focus on the next big community event: EclipseCon North America 2015 - especially since the deadline for the call for paper is already approaching. Better get your session proposal ready soon, if you want to share something new, cool, interesting or enlightening with your peers. If you are really fast, you may even reach the deadline for the early bird picks. Chances won't get better.
In other words: San Francisco. March 2015. EclipseCon. Submit now!

In case you don't know it yet: EclipseCon North America will again feature theme days that focus on special topics, one of those will be dedicated to Xtext. If you want to share insights about your application of domain-specific languages, how you solved challenges in your language implementation or how you use the framework in general, I can only encourage you to submit a talk for the Xtext track.
As every year, I expect EclipseCon to be a great community event with deep technical content. Like nowhere else, you can get in touch with the committers of the various Eclipse projects, discuss solutions and have a great time. So even if you don't plan to submit a proposal, make sure to save the date: March 9 - 12 in sunny California, EclipseCon NA!
Still not convinced? Check out the impressions from past EclipseCons and see what you are going to miss!

Monday, October 6, 2014

Musing about Eclipse UX Metaphors: The Blocking Build

tl;dr

For the upcoming version of Xtext we are revising the approach to building. It appears to be promising to rethink the overall lifecycle of the Xtext builder to aim at:

Better user experience by introducing a non-blocking build infrastructure
Improved performance due to improved parallelization
Incremental builds also on the command line

The Problem

The Xtext framework implements an Eclipse builder and is thereby immediately affected by the builder's user experience metaphor (even bad experience is still experience). Whenever a file is saved or a project is explicitly built, the user is essentially locked out from doing work in the editor.

Go Home, Eclipse! You're drunk!

That's not because the editor isn't generally usable during the build. But it turns out, that it becomes quite a habit to Eclipse users to save early and save often. As soon as you wrote some code and you save the file that you're working on, the builder kicks in and tries to validate the new file state. Since you are continuing to edit, it's quite likely that you hit save again and are confronted with that modal dialog with greetings from the 90s. Of course you don't see this message all the time when you save a file since the incremental build is usually quite fast, but when you see it, it is definitely not what you expected.

Some Background

Generally speaking, the Eclipse builder is responsible to produce artifacts from the source files in a project. There may be different builders configured for the very same project and the term artifact does not only describe compilation results in the form of new files, but also validation markers. While a builder is running for a project, it holds an acquired lock not only for that project including its contained files and folders but for the entire workspace. This ensures that there are no intermitted events that remove or modify any state on disk (details have been discussed here). And this is where the trouble starts from the users perspective.

On the one hand, the locking prevents from unexpected modifications within Eclipse, on the other hand it gets in the way of users since they can no longer work without interruption. The thing was apparently designed to ensure consistency within the workspace between sources and compilation result. But if you look into the dirty corners, the paid price is way too high. The blocking mechanism introduces only the impression of safety but can never guarantee it. Literally every external process may still perform I/O operations on the very same files and the build would go bananas since the state known to Eclipse is no longer in sync with the actual state on disk. But that's probably another can of worms that is not subject of this post. Instead let's focus on ways to improve the situation which may lead to a more responsive UI.

Action Items

For Xtext, we are currently analyzing how we can change the way we build files and projects. Rather than getting in the way of the user, we are thinking about performing the build in the background without unnecessary blocking. The main goal with that regards is to move the complete build out of the coarse grained project lock and break it into manageable, smaller pieces. E.g. as soon as the files are loaded, they don't need to be locked anymore. In the validation phase only the markers are written but not the entire files. For incremental builds, only a small subset of files needs to be considered in the first place.

This breakdown of locking is desirable on various level. First and foremost, the user experience would be improved a lot since Xtext would present fewer blocking dialogs to the user. Another positive effect is that the build and its lifecycle would be essentially decoupled from the Eclipse builder and its related UI components. By factoring out the build cycle, Xtext can support incremental compilation on the command line, too.

In times of many-cores, it also becomes more and more interesting to parallelize the build to go full throttle with todays CPUs. The leverage the potential there, the build process itself has to be analyzed carefully. The Xtext build inherently runs in multiple passes that are currently strictly sequential, especially in the context of Eclipse projects. These steps are performed for each individual project during a build.

First of all, the resources in a project have to be indexed to collect all the reachable names and derive an index structure that can be seen as a global symbol table.
After all symbols and names are known, the cross references are resolved and their resolution state is cached as reference descriptions. Currently also the validation is performed on that stage but that can be seen as step 2.5
The last step is the code generation. All resources are processed to create derived artifacts from them.

There are already means in Xtext to perform some steps in parallel. E.g. the loading of files into memory for stage (1) can be done in parallel rather than sequentially since Xtext 2.0. In the future, we want to improve on that and allow a lot more parallelization. Given that the build would be decoupled from the Eclipse builders lifecycle, we could index all the resources in the workspace at the same time. In phase (1), there is no need for one project to wait for another. Multiple projects would be processed in parallel rather than sequentially. Also the reference resolution can be done in parallel - at least if the projects do not depend on each other transitively. For the code generation, there is already support for parallelization since Xtext 2.7, but there's still room for improvements, e.g. we could not only generate resources within a single project in parallel but also run the full build concurrently for multiple projects.
But there's even more that we are discussing right now about the way Xtext projects are build within Eclipse. We are looking into means to preserve the index state if a project is closed by the user, for example. Instead of rebuilding the entire project, the builder state would be available immediately after the project is reopened, similar as with plain Java projects. Also the general handling of archives and resources in these archives is under review. For bigger projects, it may pay off to have precomputed information available that is packaged together with the resources in the archive.
In the end, the overall goal is to improve the perceived performance and the responsiveness of the IDE. Never ever should a user action be blocked by some task the IDE is performing in the background. The build should also be decoupled from the Eclipse infrastructure. With that regards, the contracts for each build step have to be sharpened and of course correctness should not be traded for concurrency. Exciting times!

Friday, October 3, 2014

Testing multiple Xtext DSLs

Recently there was a question in the Eclipse forum about testing multiple Xtext languages. The described scenario involves two language and one should have cross references to the other. Since this usecase caused some headaches in the past, Michael Vorburger provided a patch (Thanks for that!) that adds information about that particular topic to the official Xtext documentation. The updated docs are available since the recent 2.7 release. To provide some additional guidance, I hacked a small example that can be used as a blueprint if you want to get a jump start on that issue. This example also documents briefly how Xtext's unit test facilities can be used for effective language testing.

Key to testing the infrastructure of a DSL is the setup of a proper test environment. The Guice services have to be registered, EMF has to be initialized and obviously everything has to be cleaned up again after a test has been executed. For that purpose, Xtext provides a dedicated Junit4 test runner that uses an injector provider to do the initialization. The nice thing about that approach is that you can directly inject the necessary services into your unit test class. Exaclty as you are used to in your production code, too.

When it comes to testing multiple DSLs, basically the same infrastructure can be used, though you have to make sure, that all involved languages are properly initialized. For that purpose, a custom injector provider has to be derived from the one, that was already generated for your language. The to-be-written subclass needs to takes care of all the prerequisites and register the dependent languages. This mainly involves delegation to their injector providers.

Now that the setup is ready, we can test cross references between multiple DSLs. It is important to know that these references are only properly resolved if all models are available in the same resource set. That's why we need to use an explicit resource set in the tests. Besides that, it's the programming model that you know from Xtext and EMF in general.

A complete examplary test is available on Github.

Friday, March 22, 2013

EclipseCon 2013 - Last Minute Preview

Now that I know how to find the way from the Logan Intl Airport to the EclipseCon venue at the Seaport WTC in Boston, it's time for some shameless advertising. Since the conference schedule is again packed with deep technical content and all of you have only limited time, I think it's only fair to tell you in advance what you should expect from the sessions that I am giving (of course all of them are highly recommended ;-).

Mon 1:00PM - 4:00PM: Getting Started With Xtend

Monday is Tutorial Day. In the afternoon, Sven and I will give you a jump start with Xtend, a new programming language which makes the day-to-day tasks of us Java developers a real pleasure. We listened to you and prepared some new and entertaining tasks where you get your hands on Active Annotations and some new, interesting puzzlers. To put a long story short: If you want to learn about the hot new programming language that is developed at Eclipse, come and join our tutorial.

Wed 4:15PM - 4:50PM: Null-Safety on Steroids
I'm happy to tell you that the annotation based null-ness analysis of the Eclipse JDT is getting better and better. Since the recent milestones, they do include fields into the analysis and allow to inherit the null specification from super types so the analysis results become much more reliable and easier to adopt. Nevertheless, the JDTs approach is sometimes still based on assumption which I consider ... how shall I put that ... not really pragmatic. In this session, I want to outline the pros and cons of the current state of null analysis in Eclipse. Furthermore, I will talk about other approaches to tackle the occasional NPE that each of us developers is familiar with. I want to discuss the implications of the different solutions and offer advise on how to deal with them.

Thu 11:00AM - 11:35AM: Xtext - More Best Practices
My third session at this year's EclipseCon is a follow-up talk to another one that I gave in Reston, last year. In this years edition of the Xtext - Best Practices, I will focus on other topics, especially on the adoption of the Xbase expressions. If you want to learn more about those, I can also highly recommend Jan's talk on Java DSLs with Xtext on Tuesday.

Anyway, there are still some things to prepare and there is never enough time for polishing. Obviously there are a lot more interesting sessions scheduled than I can list here. I'm really looking forward to a great conference and an intense week packed with interesting discussions. See you in Boston!

Thursday, March 21, 2013

Pimp My Visitors

One of the most noteworthy features of Xtend 2.4 are the Active Annotations. They allow you to participate in the compile process of Xtend code to the Java source code. In fact, you implement some kind of a mini code generator and hook into the Xtend compiler - and all this via a lightweight library. And the IDE is fully aware of all the changes that you do to the generated Java artifacts. The astonishing about it, that this approach allows you to develop the annotation processor and the client code side by side.

Are you facing a repetitive coding problem and want to automate that? Nothing could be simpler. Just annotate your class and implement an Active Annotation and you are good.

Which brings me to design patterns. Those often require a lot of boiler plate code since these patterns describe a blue print on how several objects interact with each other to solve a specific problem. So they are quite useful but also verbose by definition. One of the most tedious examples is the visitor pattern. Since I actually like to use a visitor to handle heterogeneous data structures (you know, decoupling and externalized traversal can be quite convenient) I decided to write a small annotation that creates all the fluff around this pattern on the fly.

In order to implement a visitor, I just have to annotate the root type in the hierarchy and all the accept methods as well as the base class of the visitor implementation are automatically unfolded. You don't even have to define the base class for the visitor itself. The following small example basically expands to the very same, verbose Java code as in the example on Wikipedia .

Especially amazing is the fact, that this allows to define different kinds of visitors easily. Your accept-method has to take additional arguments? Just list them in the prototype method signature. You want to return a result? Nothing's easier than that - simply declare a different return type. The base implementation for all concrete visitors is already known? Just add the method body to the prototype and it will be moved to the visitor itself. Have a look at the complete example to see how beautiful this approach is. If you want to learn more about Active Annotations, you may want to dive into the documentation on xtend-lang.org and install Xtend to get your hands dirty.

Monday, January 21, 2013

Java Hacks - Changing Final Fields

Changing final fields? Really? Which may sound crazy at a first glance may be helpful in order to implement mocks or fix-up libraries that don't expose the state that you really wanted them to expose. And after all there is not always a fork me button available. But really: Final fields? Yes, indeed. You shall never forget: There is no spoon.

Let's consider the following data class Person that we want to hack.
Once a person was instantiated, it is not possible to change the value of the field name, is it?

Reflection To The Rescue

Fortunately - or unfortunately - Java allows to access fields reflectively, and if (ab)used wisely, it is possible to change their value, too - even for final fields. Key is the method Field#setAccessible. It allows to circumvent visibility rules - which is step one - and interestingly it also implicitly allows change the value of instance fields reflectively, if they were marked as accessible.

Modifying static fields is a little trickier. Even if #setAccessible was invoked, the runtime virtual machine will throw an IllegalAccessException (which I would expect anyway) because one 'Can not set static final my.field.Type field' even though that was perfectly ok for instance fields. And since there is still no spoon - there is a way out. And again it's based on reflection, but this one's a little trickier. If we don't just set the field accessible but also change its modifiers to non-final, it's ok to alter the value of the field.

This hack will allow to change the value of static fields, too. That is, as long as they are not initialized with literals that will be inlined by the compiler. Those include number literals and string literals, which are compiled directly into the call site to save some computation cycles (yes, refering to String constants from other classes does not introduce a runtime dependency to those classes). Despite those cases, other common static field types like loggers or infamous singletons can be easily modified at runtime and even (re)set to null.

The complete code looks like this and as promised, it will print the new name of the person to the console and the changed default name, too. But keep in mind: Don't do this at home!

Thursday, December 13, 2012

Fixed Checked Exceptions - The Xtend Way

Recently I stumbled across a post about checked exceptions in Sam Beran's Java 8 blog. What he basically described is a means to reduce the burden when dealing with legacy APIs that ~~abused~~ used Java's checked exceptions. His example is build around the construction of a java.net.URI which may throw an URISyntaxException.
Actually the URI class is not too bad, since it already provides a static factory URI#create(String) that wraps the checked URISyntaxException in an IllegalArgumentException, but you get the idea.

An Attempt to Tackle Checked Exception

Now, that Java will finally get lambda expressions with JSR 335, Sam suggests to use some utility class in order to avoid littering your code with try { .. } catch () statements. For example, Throwables#propagate could take care of that boilerplate:
Does that blend? I don't think so. That's still way too much code in order to deal with something that I cannot handle at all in the current context - and compared to the Java7 version, it's not much of an improvement either. The latter does not even carry the stacktrace so the actual code would more look like this:
According to the number of characters and taking into account that this snippet does not even tell the reader which sort of exception was expected, I would always go for the classic try { .. } catch ().

Or I'd Use Xtend.

Xtend will transparently throw the checked exception if you don't care about it. However, if you want to catch and handle it, feel free to do so. For the ~~common~~ other cases, the Xtend compiler uses the sneaky throw mechanism that is used in project Lombok, too. It just uses some generics magic to trick the Java compiler thus allowing to throw a checked exception without declaring it. You are free to catch that one whenever you want. There is no need to wrap it into some sort of RuntimeException just to convince the compiler that you know what you are doing.

By the way: You could of course use something like Throwables with Xtend, too:
That's what I consider fixing checked exceptions.

Tuesday, November 27, 2012

Performance Is Not Obvious

Recently there was a post in the Xtext forum about the runtime performance of a particular function in the Xtext code base:

Ed Merks suggested to rewrite the method to a loop iteration instead of a recursive function and to save one invocation of the method eContainer such as the new implementation shall become at "least twice as fast."

I really liked the new form since is much easier to debug and to read and from that point a the change is definitely worth it. However, as I recently did a deep dive into the runtime behavior of the JVM, I doubted that the change would have to much impact on the actual performance of that method. So I took the time and sketched a small caliper benchmark in order to double check my intuition.

As it turns out, the refactored variant is approximately 5 to 20% faster for the rare cases where a lot of objects have to be skipped before the result was found and takes the same time for the common case where the requested instance is found immediately. So it's not even close to the expected improvements. But what's the take-away?

Before applying optimizations it's worthy to measure the impact. It may be intuitive to assume that cutting the number of method invocation down to a fraction of the original implementation - after all it was a recursive implementation before - saves a lot of time but actually the JVM is quite powerful and mature at inlining and producing optimized native code. So I can only repeat the conclusion of my talk about Java Performance (which I proposed for EclipseCon Boston, too):

[..] Write Readable and Clear Code. [..] (David Keenan)
[..] slavishly follow a principle of simple, clear coding that avoids clever optimizations [..] (Caliper FAQ)
Performance advice has a short shelf-life  (B. Goetz)

From that point of view, the refactored implementation is definitely worth it, even though there is no big impact on the runtime behavior of the method.

Wednesday, November 14, 2012

Xtext Corner #9 - About Keywords, Again

In the last weeks, I compiled some information about proper usage of keywords and generally about terminals in Xtext:

Keywords may help to recover from parse errors in a sense that they guide the parser.
It's recommended to use libraries instead of a hard wired keyword-ish representation for some built in language features.
Data type rules are the way to go if you want to represent complex syntactical concepts as atomic values in the AST.

In addition to these hints, there is one particular issue that arises quite often in the Xtext forum. People often wonder why their grammar does not work properly for some input files but perfectly well for others. What it boils down to in many of these examples is this:

Spaces are evil!

This seems to be a bold statement but let me explain why I think that keywords should never contain space characters. I'm assuming you use the default terminals but actually this fits for almost all terminal rules that I've seen so far. There is usually a concept of an ID which is defined similar to this:

terminal ID:
('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z'|'0'..'9')*;

IDs and Keywords

IDs start with a character followed by an arbitrary number of additional characters or digits. And keywords usually look quite similar to an ID. No surprises so far. Now let's assume a keyword definition like 'some' 'input' compared to 'some input'. What happens if the lexer encounters an input sequence 'som ' is the following. It'll start to consume the leading 's' and has not yet decided which token to emit, since it could become a keyword or an identifier. Same for the 'o' and the 'm'. The trailing space is the character where it can finally decide that 'som ' contains two tokens: an identifier and a whitespace token. So far so good.

Let the Parser Fail - For Free

Now comes the tricky part since the user continues to type an 'e' after the 'm': 'some '. Again, the lexer starts with the 's' and continues to consume the 'o', 'm' and 'e'. No decision was made yet: it could still be an ID or the start of the keyword 'some input'. The next character is a space, and that's the crucial part here: If grammar contains a keyword 'some input', the space is expected since it is part of the keyword. Now, the lexer has only one valid alternative. After the space it is keen on consuming an 'i', 'n', 'p', 'u' and 't'. Unfortunately, there is no 'i' in the parsed text since the lexer already reached the end of the file.

As already mentioned in an earlier post, the lexer will never roll back to the token 'some' in order to create an ID token and a subsequent whitespace. In fact the space character was expected as part of single token so it was safe to consume it. Instead of rolling back and creating two tokens, the lexer will emit an error token which cannot be handled by the parser. Even though the text appeared to be a perfectly valid ID followed by a whitespace, the parser will fail. That's why spaces in keywords are considered harmful.

In contrast, the variant with two split keywords of the grammar works fine. Here, the user is free to apply all sorts of formatting to the two adjacent keywords, any number of spaces, line breaks or even comments can appear between them, are valid and handled well by the parser. If you are concerned about the convenience in the editor - after all, a single keyword with a space seems to be more user friendly in the content assistant - I recommend to tweak that one instead of using an error prone grammar definition.

Thursday, November 8, 2012

Xtext Corner #8 - Libraries Are Key

In today's issue of the Xtext Corner, I want to discuss the library approach and compare it to some hard coded grammar bits and pieces. The question about which path to choose often arises if you want to implement an IDE for an existing language. Most languages use a run-time environment that exposes some implicit API.

Just to name a few examples: Java includes the JDK with all its classes and the virtual machine has a notion of primitive types (as a bonus). JavaScript code usually has access to a DOM including its properties and functions. The DOM is provided by the run-time environment that executes the script. SQL in turn has built-in functions like max, avg or sum. All these things or more or less an integral part of the existing language.

As soon as you start to work on an IDE for such a language, you may feel tempted to wire parts of the environment into the grammar. After all, keywords like int, boolean or double feel quite natural in a Java grammar - at least a first glance. In the long run it often turns out to be a bad idea to wire these things into the grammar definition. The alternative is to use a so called library approach: The information about the language run-time is encoded in an external model that is accessible to the language implementation.

An Example

To use again the Java example (and for the last time in this post): The ultimate goal is to treat types like java.lang.Object and java.util.List in the same way as int or boolean. Since we did this already for Java as part of the Xtext core framework, let's use a different, somehow artificial example in the following. Our dummy language supports function calls of which max, min and avg are implicitly available.

The hard-coded approach looks quite simple at first. A simplified view on the things will lead to the conclusion that the parser will automatically check that the invoked functions actually exist, content assist works out of the box and even the coloring of keywords suggests that the three enumerated functions are somehow special.

Not so obvious are the pain-points (which come for free): The documentation for these functions has to be hooked up manually, the complete signatures of them have to be hard-coded, too. The validation has to be aware of the parameters and return types in order to check the conformance with the actual arguments. Things become rather messy beyond the first quickly sketched grammar snippet. And last but not least there is no guarantee that the set of implicit functions is stable forever with each and every version of the run-time. If the language inventor introduces a new function sum in a subsequent release, everything has to be rebuild and deployed. And you can be sure that the to-be-introduced keyword sum will cause trouble at least in one of the existing files.

Libraries Instead of Keywords

The library approach seems to be more difficult at first but it pays off quickly. Instead of using hard-coded function names, the grammar uses only a cross reference to the actual function. The function itself is modeled in another resource that is also deployed with the language.

This external definition of the built-in functions can usually follow the same guidelines as custom functions do. But of course they may even use a simpler representation. Such a stub may only define the signature and some documentation comment but not the actual implementation body. It's actualy pretty similar to header files. As long as there is no existing format that can be used transparently, it's often the easiest way to define an Xtext language for a custom stub format. The API description should use the same EPackage as the full implementation of the language. This ensures that the built-ins and the custom functions follow the same rules and all the utilities like the type checker and documentation provider can be used independently from the concrete invoked function.

If there is an existing specification of the implicit features available, that one should be used instead. Creating a model from an existing, processable format is straight forward and it avoids mistakes because there is no redundant declaration of the very same information. In both cases there is a clear separation of concerns: The grammar remains what it should be: a description of the concrete syntax and not something that is tight to the run-time. The API specification is concise and easy to grasp, too. And in case an existing format can be used for that purpose, it's likely that the language users are already familiar with that format.

Wrap Up

You should always consider to use external descriptions or header stubs of the environment. A grammar that is tightly coupled to a particular version or API is quite error-prone and fragile. Any evolution of the run-time will lead to grammar changes which will in turn lead to broken existing models (that's a promise). Last but not least, the effort for a seamless integration of built-in and custom functions for the end-user exceeds the efforts for a clean separation of concerns by far.

A very sophisticated implementation of this approach, can be explored in the Geppetto repository at GitHub. Geppetto uses puppet files and ruby libraries as the target platform, parses them and puts them onto the scope of the project files. This example underlines another advantage of the library approach: It is possible to use a configurable environment. The APIs may be different from version to version and the concrete variant can be chosen by the user. This would never be possible with a hard-wired set of built-ins.

Tuesday, November 6, 2012

Xtext Corner #7 - Parser Error Recovery

A crucial feature of the parser in Xtext is the ability to recover from errors: The parser may not fail on the first erroneous token in an invalid input but should continue after that token. In fact it should continue to the end of the document and yield an AST that is incomplete but contains as many nodes as possible. This feature of the parser is called error recovery: The ability to consume input text that is not conform to the grammar.

Error recovery is obviously necessary in an interactive environment like an editor since most of the time the input will actually be invalid. As soon as the user starts to type, the document may be broken in all sorts of ways. The user does not really care whether his actions are in line with some internal AST-structure or grammar rules. Copy and paste, duplicate lines, remove portions of the file by using toggle comment or just plain insertion of characters into the editor - none of these operations should cause the parser to fail utterly. After all, content assist, the outline and code navigation are expected to work for broken documents, too - at least to some extend.

Recovery strategies

The Antlr parser that is generated from an Xtext grammar supports different recovery strategies. If the parser encounters an invalid input, it'll basically perform one of the following operations:

Skip the invalid terminal symbol
If the current token is unexpected, the following terminal is considered. If that one matches the current expectation, the invalid token is skipped and flagged with an error.
Inject the missing token
The seen terminal is not valid at the current position in the parsing process but would be expected as the subsequent token. In that case, the parser might inject the missing token. It skips the current expectation and continues with the next step in the parsing. The information about the missing token is annotated on the current input symbol.
Pop the parsing stack
If the input is broken in a way that does not allow the parser to skip or inject a single token, it'll start to consume the following terminal symbols until it sees a token that is somehow unique in the expectation according to the current parsing stack. The parser will pop the stack and do a re-spawn on that element. This may happen in the rare case that the input is almost completely messed up.
Fail
The mentioned recovery strategies may fail due to the structure of the grammar or the concrete error situation in the input. In that case parsing will be aborted. No AST for subsequent input in the document will be produced.

Helping the Parser

There exist several things that you should watch out for if you experience poor error recovery in your language. First and foremost it may be the absence of keywords in the grammar. Keywords are often the only anchor that the parser can use to identify proper recovery points. If you feel tempted to write an overly smart grammar without any keywords because it should look and feel like natural language, you should really reconsider your approach. Even though I don't want to encourage a keyword-hell, keywords are somehow convenient if they are used properly. And please note that things like curly braces, parentheses or other symbols with only one character are as good as a keywords as other, longer sequences - at least from the parsers perspective. So to give a very popular example: Instead of using indentation to describe the structure of your language (similar to Python), using a c-style notations may save you a lot of effort with the grammar itself and provide a better user experience when editing code. And keywords also serve as a nice visual anchor in an editor so users will have an easier time when reading code in your language.

A second strategy to improve the behavior of the parser and the chance for nice error recovery is the auto-edit feature. It may have some flaws but it's quite essential for a good user experience. The most important aspect here is the insertion of closing quotes for strings and comments. As soon as you have an input sequence that is not only broken for the parser but lets even the lexer choke, you are basically screwed. Therefore multiline comments and strings are automatically closed as soon as you open them. If you use custom terminal rules, you should really consider to look for unmatched characters that should be inserted in pairs according to the lexer definition. The rule basically applies for paired parentheses, too. Even though the concrete auto-edit features may still need some fine tuning to not get in the way of the user, they already greatly improve the error recovery of the parser.

Friday, November 2, 2012

Xtext Corner #6 - Data Types, Terminals, Why Should I Care?

Issue #6 of the Xtext Corner is about some particularities of the parsing process in Xtext. As I already mentioned a few times in the past, Xtext uses Antlr under the hood in order to do the pure parsing. This is basically a two step process: At first, the input sequence of characters is split into tokens (often referred to as terminals) by a component that is called the lexer. The second step is to process the resulting list of tokens. The actual parser is responsible for that stage. It will create the abstract syntax tree from the token stream.

This divide-and-conquer approach is mostly called parsing altogether so the distinction between lexing and parsing is quite encapsulated. Nevertheless, the Xtext grammar definition honors both aspects: It is possible to define (parser) rules that are processed by the parser and it is also possible to define terminal rules. Those will be handled in the lexer. So the when should I use parser rules and when should I use terminal rules?

Production Rules also: Parser Rules

The obvious case and just for the sake of completeness: Production rules will yield an instance in the abstract syntax tree. These can only be implemented by the parser thus there is no question whether to use terminals instead. Production rules are the most common rules in almost every Xtext grammar.

Data Type Rules

Those are a completely different thing even though they are handled by the parser, too: Where ordinary parser rules produce instances of EClasses, data type rules will return data types (you did not guess that, did you?). Data types in the sense of Xtext and its usage of the Eclipse Modeling Framework are basically primitive Java types, Strings or other commons like BigDecimal or enums. The parser will not create those on its own but rather pass the consumed tokens as a string to a value converter. The language developer is responsible for converting the string to a data type.

Terminal Rules

Terminal rules are essentially the same as data type rules when you only consider the interface to the grammar. Internally they are completely different since they will not be processed by the parser but by the lexer. The consequences are quite severe if you want to get a working grammar. But one step after the other: As already mentioned, terminal rules can return the very same things as data type rules can. That is, they yield Strings, ints or other primitives. But since they are handled by the lexer, they are not quite as powerful as data type rules are.

Implementation aspects

The lexer is a pretty dumb component which is generated in a way that weights performance over intuitive behavior. Where the parser generator will produce nice error messages in case of ambiguous data types rules, conflicting terminals are mostly resolved by a first-come-first-served (FCFS, also FIFO) principle. For terminals it's crucial to get them in the right order. Consider the following terminal rules:

terminal ID: ('a'..'z') ('a'..'z'|'0'..'9')*;
terminal CHARS: ('a'..'z'|'_')+;

The ID rule shall consume something that starts with a lowercase letter and is followed by a letter or number. The CHARS rule is pretty close to that one: It shall match a sequence that contains only lowercase letters or underscores. The problem with these is that the matched sequences are not mutually exclusive. If you take the input abc as an example, it will be matched as an ID given the two rules above. As soon as you switch the order of the declarations, the sequence abc will out of a sudden be returned as CHARS. That's one thing that you have to be a aware of if you use terminal rules. But there is more to keep in mind.

Terminal rules are applied without any contextual information which has some interesting implications. The plus: it's easily possible to use the lexer on partial input - entry points can be computed almost trivially. There is no such thing as an entry rule as for the parser. But the disadvantages have to be taken into account, too. The lexer does not care about the characters that are still to come in the input sequence. Everything that matches will be consumed. And not reverted (as of Antlr 3.2). To explain what that means, let's take another example:

terminal DECIMAL: INT '.' INT;
terminal INT: '0'..'9'+;

The rules define decimal numbers in a hypothetical language that also supports method invocation. At a first glance, things seem to work fine: 123 is consumed as an INT where 123.456 will be a decimal. But the devil's in the details. Let's try to parse the string 123.toString(). The lexer will find an INT 123 - so far, so good. Now it sees a dot which is expected by the terminal rule DECIMAL. The lexer will consume the dot and try to read an INT afterwards - which is not present. Now it'll simply fail for the DECIMAL rule but never revert that dot character which was consumed almost by accident. The lexer will create an invalid token sequence for the parser and the method call cannot be read successfully. That's because the lexer simply does not know about things like the expectation in the current parser state. Attempts to define decimals, qualified names or other more complex strings like dates are very error prone and can often be implemented quite easy by means of data type rules:

DECIMAL: INT '.' INT;
terminal INT: '0'..'9'+;

Terminals Considered Harmful?

Data type rules move the decision from the ~~dumb~~ highly optimized lexer to the parser which has a lot more information at hand (the so called look ahead) to make decisions. So why not use data types everywhere? The simple answer is: performance. The duo of lexer and parser is optimized for a stream of reasonable sized tokens instead of hundreds of single characters. Things will not work out that well with Antlr at run-time if the parser is implemented in a so called scanner-less manner. The rule of thumb here is to use only a few number of terminal rules that can be distinguished easily and put data type rules on top of those. It'll simplify your live as a language developer tremendously.

Thursday, November 1, 2012

JUGFFM: A Scenic View, Ebblewoi and Very Nice People

Yesterday I had the opportunity to give a presentation about Xtend at the JUG Frankfurt. I really enjoyed it since the audience had a lot of very good questions and quite some interesting discussion unfolded from those during the talk and thereafter. Many thanks to Alex who organized the event.

The JUGF Stammtisch took place in the German National Library which is such an amazing location. We were in a room on the upper floor and the nightly view on the skyline of Frankfurt was almost paralyzing - I even forgot to take a picture... For the informal Stammtisch after the talk we changed location to a secret Franconian Apfelwein Schenke whose coordinates may not be disclosed. According to the locals, it's one of the last resorts in Frankfurt that's still rather free from tourists (except for the Kieler guy who will probably never manage to pronounce Ebblewoi correctly).

Dank an @szarnekow für den interessanten Xtend-Vortrag bei der @jugffm. Laßt euch das Apfelkompott schmecken... ;-)
— Hameister (@Hameiste) October 31, 2012

In the pub, the discussions continued over cutlet with traditional Green Sauce, typical Franconian cider or beer, since only Hessian stomachs can handle proper amounts of Ebblewoi. Unfortunately I had to leave at 10pm since I had to catch the receptionist in the hotel (thanks for the short briefing before I left, Apple maps indeed tried to play tricks on me...).

To put a long story short: The JUGF is a really nice crowd and I enjoyed the company a lot. Their next meeting is already on 07 Nov 2012. If you are in Frankfurt next week, make sure to stop by if you want to discuss DevOps topics.

Wednesday, October 31, 2012

Xtext Corner #5 - Backtracking vs Syntactic Predicates

The Xtext grammar language allows to create a working parser in almost no-time. Its concise notation to describe the concrete syntax and the mapping to an object model is giving quite a jump start if you want to create a language. Nevertheless it's also quite easy to get into some trouble. Xtext uses Antlr 3.2 as the underlying parser technology and we try really hard to hide the complexity and peculiarity of Antlr. Unfortunately that's not possible in all cases. From time to time Antlr will report ambiguities in the grammar definition with a charming message like this:

warning(200): Decision can match input such as "{EOF, RULE_ID, '('}" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input

The parse generator basically complains about an ambiguous grammar. At some point in the syntax description it cannot decide which path to follow for a given input sequence. It's rather obvious that the warning message is not really helpful. Neither is there any chance to find the correct line that caused the error (which is not a problem of Antlr but cause by the translation of Xtext to Antlr) nor is it easily possible to spot the concrete decision that the parser generator complains about. The worst about this is that it's not really a warning either. What the parser generator basically did is the following: It removed a possible path from the grammar description. It will always choose the one remaining path for that particular situation. Which could by chance be the one that you'd expect. But it could also be the wrong path.

AntlrWorks

Fortunately there is a tool that helps to identify the problem: AntlrWorks allows to take a look at the grammar and visualize all the problems that it has graphically. It's still far from trivial to find the root cause of a problem but better than nothing. Make sure you pick the version 3.2 from download section if you want to give it a try.

Now you may wonder how you should handle the cases that are ambiguous by definition and by intention. You could of course enable backtracking in your language and afterwards everything appears to be fine. However, you can think about backtracking as a wildcard for Antlr to remove alternatives from your grammar everywhere where it spots an ambiguity. This will shadow the real problems in the grammar that may be introduced due to subsequent changes, a refactoring or new language features. That's why I strongly recommend to go for the hard way and analyze the root cause for the warnings. As soon as you found the actual decision that the parser generator complained about, you can use a syntactic predicate to fix that locally. Now you are in control on which alternative to remove and which path to follow. I think that makes perfect sense to be in charge in those cases.

Backtracking

But the shadowing of errors in the generator is only one drawback of backtracking. It will furthermore lead to surprising messages at run-time. If you consider the following snippet it's easy to see that the right operand of the binary operation is missing.

The parser will correctly report something along the lines of

mismatched input '}' expecting RULE_INT

Unfortunately it does so on a totally unexpected location. If you enabled backtracking and the algorithm decides that the function declaration is not complete - it fails to read a valid function body - the parsing will roll back to the start of the function and put the most specific error message on that token. You'll see an error marker under the keyword function. However, it would be more intuitive to have that error on the binary operation. At least that's what I would expect, wouldn't you?

Syntactic Predicates

Nevertheless it's not always possible to write an unambiguous grammar. There are some common patterns that are undecidable by definition. The most famous one is the Dangling else. If a language allows nested if-else constructs, it's not definite where a subsequent else keyword may belong to. Consider the following Java snippets which only differ in formatting:

The semantics of both code snippets should be independent from the formatting. Nevertheless it's ambiguous for the parser in the same way as a reader might be confused by an inconsistent indentation. Therefore you have to force the parser into one concrete direction in order to disambiguate the grammar. A syntactic predicate has to be added.

The => operator forces the parser to go a certain path if the input sequence would allow two or more possible decisions. It can be read as If you see these tokens, go this way. It's even possible to use alternatives or groups of elements as the criteria. Only the UnorderedGroup is prohibited in predicates.

In this example, the parser shall follow the given path if it can look ahead to a sequence like person.name= (or more abstract ID . ID =).

Implementation Detail

One thing is important to note. Syntactic predicates in Xtext are different from the plain Antlr predicates. In the Xtext grammar language it's only possible to use a complete or a partial sequence of production tokens as the predicate where Antlr allows to use arbitrary tokens that seem to be independent from the actual rule content. Here Antlrs approach appears to be more powerful. But actually that's only at a first glance. Firstly Xtext's variant is easier to use since you don't have to repeat parts of your grammar manually. And secondly it's the framework that does the heavy lifting: The syntactic predicates in Xtext are automatically propagated to the right places which you'd have to do manually otherwise. Just insert it at the spot that you identified with AntlrWorks and you're done.

Monday, October 29, 2012

Xtext Corner Revived

It's been a long time since I wrote about Xtext tips and tricks. However, I assembled a bunch of interesting tips and tricks while I prepared my Xtext Best Practices session for this years EclipseCon which I want to share with you.

The talk starts with a short overview on how I personally like to tackle the task of implementing a language with Xtext. If the syntax is not yet carved in stone, I usually start of with some sketched sample files to get an idea about the different use cases. In doing so it's quite important to find a concise notation for the more common cases and to be more verbose with the unusual patterns that are anticipated in the language. As soon as the first version of the syntax is settled, it's obvious to begin with the grammar declaration.

That's a task that I really like. The grammar language of Xtext is probably the most concise and information rich DSL that I ever worked with. With very few orthogonal concepts it's possible to describe how a text is parsed and in the very same breath map those parsed information to a in memory representation. This representation is called abstract syntax tree (AST) and often referred to as model. The AST that Xtext yields is strongly typed and therefore heterogeneous, but still provides generic traversal possibilities since it is based on the Eclipse Modeling Framework (EMF, also: Ed Merks Framework). So the grammar is about the concrete syntax and its mapping to the abstract syntax.

As soon as the result of the parsing is satisfying, the next step when implementing a language is scoping. Without that one, any subsequent implementation efforts are quite a waste of effort. Scoping is the utility that helps to enrich the information in the AST by creating a graph of objects (Abstract Syntax Graph, ASG). This process is often called cross linking. Thereby some nodes in the tree will be linked with others that are not directly related to them in the first place. This is one of the most important aspects of a language implementation because after the linking and scoping was done, the model is actually far more powerful from a clients perspective. Any code that is written on top of that can leverage and traverse the complete graph even if the concrete language is split across many files.

Validation is the next step and it is implemented on top of the linked ASG. While the parser and the linking algorithm already produced some error annotations on invalid input sequences, it's the static constraint checking which will find the remaining semantic problems in the input. If the files were parsed and linked successfully and the static analysis does not reveal any problems, the model can be considered valid.

Now that one can be sure that the ASG as the in-memory representation of the files fulfills the semantic constraints of the language, it's possible to implement the execution layer which is often a compiler, a code generator or an interpreter. Actually those three are all very similar. You can think of a code generator as an interpreter which evaluates a model to a string. And of course a compiler is pretty much the same as a code generator but the output is not plain text but some sequence of bytes. The important thing is that the evaluation layer should (at least in the beginning) only consider valid input models. This will dramatically simplify the implementation and that's the reason why I like to implement that on top of a checked ASG. You don't have to take all those possible violated constraints into account.

Now there is of course still the huge field of the user interface that entwines around the editor and its services like content assist, navigation or syntax coloring. However, I would usually postpone that until the language runtime works at least to some extend.

The most important message in this intro is that this is not a waterfall process. All this can be implemented in small iterations each of which is accompanied with refined sample models, unit tests (!) and feedback from potential users.

In the next days I'll wrap up some of the main points of my presentation which will be about grammar tips, some hints on scoping, validation or content assist. Stay tuned for those!