Re: Parsing: Help on ignoring quoted tokens.

On Jun 1, 7:30 am, paktsardi...@xxxxxxxxx wrote:
I am writing a (hopefully) simple parser to parse the contents of a
text file and turn it into some sort of html form. Here's a small

forms.txt contains something like:

# Registration Form
registration {
[heading: Account Details] [ ]
[label:"User Name:"] [textbox:username:amcnab:mandatory]
[label:"First Name:"] [textbox:first_name:Andy]
[label:"Last Name:"] [textbox:last_name:McNab]
[label:"Password:"] [passbox:passwd::mandatory]


# Error form
error {
[heading:Explosion Error!][]
[label:"Vent Gas?:"] [select:vent:yes|no:no]


[.*] denotes an html table cell.


Now, my question is: what is the best way to approach the parsing of
this file?

If you say "parse a text file", you are usually dealing with brackets
and/or nested { ... } constructs and I can clearly see the
"registration { ... }" - and "error { ... }" - structure in your

I strongly recommend to read first perlfaq4: "How do I find matching/
nesting anything?"

However, in order to keep this simple, I would suggest to make a few
assumptions about the structure of your file, thereby effectively
eliminating the inherent nested structure.

Those assumption would be, for example:
- there are no nested { ... } constructs.
- each { ... } - contruct begins with a single line format /^\w+\s*{$/
and it ends with a single line /^}$/
- inside a { ... } construct, each line begins with format /^\s+/
and it is of the form /\s*\[.*?\]/g
- the first line inside a { ... } construct would be of the form

This would allow to process the file line-by-line using only regexes,
but still producing valid html code. At first, this solution seems to
be over simplified, but as long as you can keep away from nested
structures, you can easily add/remove/modify more regexes in a trial-
and-error approach as you develop your Perl program from the bottom

Here is how I would start the bottom-up approach with your test-file:

use strict;
use warnings;

my $inputfile = 'forms.txt';
open my $inp, '<', $inputfile
or die "Error 0010: open < '$inputfile': $!";

my $comment = '';
while (<$inp>) {
if (m{^\#\s*(.*)$}xms) {
$comment = $1;
if (m{^\s+\[}xms) {
my @td = m{\[(.*?)\]}gxms;
if ($comment ne '') {
if (@td != 2
or $td[0] !~ m{^heading:(.*)$}xms) {
die "Error 0020: unexpected '$_'";
print "<h2>$1 ($comment)</h2>\n";
print "<table>\n";
$comment = '';
print " <tr>\n";
for my $element (@td) {
if ($element =~ m{^\s*$}xms) {
print " <td>&nbsp;</td>\n";
else {
print " <td>$element</td>\n";
print " </tr>\n";
if (/^}/xms) {
print "</table>\n";
$comment = '';

close $inp;

This approach is very flexible and extremely scalable, I've already
tried it successfully by transforming a plain old schema-listing of a
mainframe database from basic Ascii format into Html.

Bonus points if your answer makes no reference to lex or yacc. :)

Thanks for the bonus points :-)