Replace U+00a0 (nbsp) (or other unicode) characters with VS Code

On my Mac I have the issue that when I type the pipe character | and then pressing space, often the non-break space character (unicode U+00a0) is inserted. If anyone knows how disable this “feature” drop me a note here, this is quite annoying but I did not found the reason for this behaviour.

This looks like normal space, but VS Code highlights these places. In this example, I’m working on asciidoc files where this happens quite frequently:

This is just a small tip to fix these files with the “Replace in Files” feature. Open this from the “Edit menu”, activate Use Regular Expression (.*) and search for the unicode character with regex [\xa0] and replace by space.

grep command to filter distinct values from XML tags

I have a ton of Oracle Forms XML export files and wanted to know, which different patterns occur for the value of the FormatMask XML attribute. The input looks as follows:

<Item Name="CREATION_DATE" UpdateAllowed="false" DirtyInfo="false" Visible="false" QueryAllowed="false" InsertAllowed="false" Comment="TABLE ALIAS&amp;#10;  FDA&amp;#10;&amp;#10;BASED ON TABLE&amp;#10;  TMI_FINANCIAL_DATA&amp;#10;&amp;#10;COLUMN USAGES&amp;#10;  ...    CREATION_DATE                 SEL&amp;#10;" ParentModule="OBJLIB1" Width="10" Required="false" ColumnName="CREATION_DATE" DataType="Date" ParentModuleType="25" Label="Creation Date" ParentType="15" ParentName="QMSSO$QUERY_ONLY_ITEM" MaximumLength="10" PersistentClientInfoLength="142" ParentFilename="tmiolb65_mla.olb" FormatMask="DD-MM-RRRR">

A naive grep command would print out the whole line, including the file name. After some iterations I came to the following command, which does what I want in a single line.

grep -R -h -o -e FormatMask=\"[^\"]* * | sed 's/FormatMask="//g' | sort | uniq

What the command does is:

  • grep recursively (-R) for a regular expression (-e)
  • search for FormatMask="<any-char-until-quotation>
  • print only the matching part of the line (-o). This will include the prefix FormatMask="
  • print without the file name (-h)
  • strip off the prefix with sed
  • sort the results alphabetically
  • remove duplicate lines (uniq)

The result (excerpt)is:

00
09
099
0999
0999999
09999999
0D0
0D999
0D9999
9
90
90D0
90D000
...