Re: How do I get String.split() to do what I want?

From: Chris Smith (cdsmith_at_twu.net)
Date: 06/30/04


Date: Wed, 30 Jun 2004 07:31:59 -0600

Phil Hühn wrote:
> Thanks, I probably should have stated my problem in terms of parsing a
> CSV! ;-)
> Ta for the CSVreader, but I want to use the regex stuff if possible...
> if not I could simply have written a small method to tokenise the string
> at commas, ignoring those within single quotes, but surely I can do that
> with a regular expression..?
> Maybe I need a perl NG...

Regular expressions aren't magic, unfortunately, and CSV is a bit more
complex than a regular expression. There are a lot of special cases to
handle. Trust me; I wrote a CSV parser using regular expressions some
time back. In the course of making modifications to handle variations
on the format used by many popular software export features, it became
obvious that regular expressions were more complex than they are worth.

Besides, even if you implement a CSVReader in terms of regular
expressions, it's a higher level of abstraction as a pseudo-specified
file format, and deserves its own encapsulation.

If you don't want to maintain dependence on a third-party (even free or
open-source) utility to do this for you, I'm placing the following code
into the public domain. Just don't sue me. :)

Sorry for the wrapping; no easy way to fix it.

public class CSVUtil
{
    /**
     * Reads the next logical line of the CSV file. Returns the next
line as a
     * {@link java.lang.String} without the trailing newline, or
     * <code>null</code> if there is no more data to read before the end
of
     * the file.
     *
     * <p>
     * The last line of a file is returned if it contains any
characters, but
     * is ignored if it is empty. The contract for this method is
intended to
     * approximate the contract of {@link
java.io.BufferedReader#readLine}.
     * </p>
     */
    public static String readLine(PushbackReader r)
        throws IOException
    {
        StringBuffer buf = new StringBuffer();
        int ch;
        boolean inQuote = false;

        while ((ch = r.read()) != -1)
        {
            if (ch == '\"')
            {
                inQuote = !inQuote;
            }
            else if (!inQuote && (ch == '\n'))
            {
                break;
            }
            else if (!inQuote && (ch == '\r'))
            {
                /*
                 * See if this is a CRLF pair.
                 */
                int ch2 = r.read();

                if (ch2 == '\n') break;
                else if (ch2 != -1) r.unread(ch2);
            }

            buf.append((char) ch);
        }

        if ((buf.length() == 0) && (ch == -1))
        {
            /*
             * Reached the end of the file, and there was nothing on the
             * line. This indicates an end of file, such that we should
             * return null, rather than the empty string.
             */
            return null;
        }
        else
        {
            /*
             * Return the line that was read.
             */
            return buf.toString();
        }
    }

    /**
     * Parses a logical line of CSV content into fields. An attempt is
made to
     * reconstruct CSV data in a generally compatible way across import
     * sources.
     */
    public static String[] parse(String line)
    {
        List fields = new ArrayList();

        int pos = -1;

        while (pos < line.length())
        {
            pos++;

            /*
             * Determine the type of token. This could be plain,
             * single-quoted, double-quoted, or empty.
             */
            pos = skipWhitespace(pos, line);

            if ((pos >= line.length()) || (line.charAt(pos) == ','))
            {
                /*
                 * Empty. When there is no non-whitespace between two
                 * commas, the result is an empty field. Whitespace is
                 * always ignored.
                 */
                fields.add("");
            }
            else if (isQuoteCharacter(line.charAt(pos)))
            {
                char quoteChar = line.charAt(pos);

                /*
                 * Quoted. The contents extend from here to any of
                 * the following:
                 *
                 * 1. The end of the source string.
                 * 2. The next occurrence of a double-quote that is NOT
                 * followed by a second double-quote.
                 */
                StringBuffer field = new StringBuffer();
                pos++; /* skip the quote */

                boolean done = false;

                while (!done)
                {
                    int next = nextOccurrence(pos, line, quoteChar);
                    field.append(line.substring(pos, next));
                    pos = next;

                    if (next >= line.length())
                    {
                        /*
                         * Unterminated quote. Nevertheless, we'll take
it.
                         */
                        done = true;
                    }
                    else if (next == line.length() - 1)
                    {
                        /*
                         * Quote is at the end of the line. It is,
therefore,
                         * not doubled.
                         */
                        done = true;
                        pos++; /* skip the closing quote */
                    }
                    else if (line.charAt(next + 1) != quoteChar)
                    {
                        /*
                         * Quote is not doubled.
                         */
                        done = true;
                        pos++; /* skip the closing quote */
                    }
                    else
                    {
                        /*
                         * Quote is doubled. It should be considered
part of
                         * the content, and does not end the field.
                         */
                        field.append(quoteChar);
                        pos += 2; /* skip both doubled quotes */
                    }
                }

                fields.add(field.toString());
            }
            else
            {
                /*
                 * Plain. Non-quoted fields may not contain quotes,
                 * commas, newlines, or leading or trailing whitespace.
                 */
                int next = nextOccurrence(pos, line, ',');
                String field = line.substring(pos, next);
                pos = next;

                fields.add(field.trim());
            }

            /*
             * Skip to the next comma. Any text found after a complete
element
             * (only possible when elements are quoted) is invalid, and
will be
             * discarded.
             */
            pos = nextOccurrence(pos, line, ',');
        }

        return (String[]) fields.toArray(new String[fields.size()]);
    }

    /**
     * Determines if a character should be considered a quote. As far
as I can
     * tell, only double-quotes are supposed to be used for quoting in
CSV.
     * However, I can recall (but can't find documentation for) some
mention
     * of single quotes. This method is provided to abstract the
identity of
     * a quote in case more are possible.
     */
    private static boolean isQuoteCharacter(char ch)
    {
        return ch == '\"';
    }

    /**
     * Returns the next character index in <code>line</code> that is not
     * whitespace, beginning at <code>pos</code>. If there is no such
index,
     * the method returns <code>line.length()</code>.
     */
    private static int skipWhitespace(int pos, String line)
    {
        while (
            (pos < line.length())
            && Character.isWhitespace(line.charAt(pos)))
        {
            pos++;
        }

        return pos;
    }

    /**
     * Returns the next character index in <code>line</code> that is not
     * whitespace, beginning at <code>pos</code>. Returns the next
character
     * index in <code>line</code> that contains the specified character,
     * <code>ch</code>, beginning at <code>pos</code>. If there is no
such
     * index, the method returns <code>line.length()</code>.
     */
    private static int nextOccurrence(int pos, String line, char ch)
    {
        while ((pos < line.length()) && (line.charAt(pos) != ch))
        {
            pos++;
        }

        return pos;
    }

    /**
     * Forms a logical line of CSV content from separate fields into one
String
     * in the CSV format, with appropriate quoting. An attempt is made
to be
     * compatible with the most basic CSV possible.
     */
    public static String form(String[] elements)
    {
        StringBuffer line = new StringBuffer();

        for (int i = 0; i < elements.length; i++)
        {
            String element = elements[i];

            if (
                (element.indexOf('\"') != -1)
                || (element.indexOf('\'') != -1)
                || (element.indexOf(',') != -1)
                || (element.indexOf('\r') != -1)
                || (element.indexOf('\n') != -1)
                || !(element.trim().equals(element)))
            {
                /*
                 * The element needs to be quoted. There are four cases
                 * where an element must be quoted:
                 *
                 * 1. It contains a single or double quote.
                 * 2. It contains a comma.
                 * 3. It contains a newline.
                 * 4. It has leading or trailing whitespace.
                 */
                line.append('\"');

                for (int j = 0; j < element.length(); j++)
                {
                    char ch = element.charAt(j);

                    if (ch == '\"')
                    {
                        line.append('\"');
                        line.append('\"');
                    }
                    else
                    {
                        line.append(ch);
                    }
                }

                line.append('\"');
            }
            else
            {
                line.append(element);
            }

            if (i < elements.length - 1)
            {
                line.append(',');
            }
        }

        return line.toString();
    }
}

-- 
www.designacourse.com
The Easiest Way to Train Anyone... Anywhere.
Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation


Relevant Pages

  • Re: Serious Perl Regular Expression deficiency?
    ... I started doing Perl 2 years ago and have ... > conclusion that regular expressions have a serious ... This is serious because the not string ... If you want to pull out the contents of XML comments you could do this. ...
    (comp.lang.perl.misc)
  • Re: Remove characters from string
    ... and your link took me to the templates page at microsoft office. ... there expaining regular expressions unless you meant I should search for it. ... | them to the same format for ease of processing. ... | the string I remove extraneous characters. ...
    (microsoft.public.excel.programming)
  • Re: dividing an replacing spaces in string
    ... I knew regular expressions would help in this. ... This newly delimited string will dump into separate rows like this ... Dim colMatches As Object ... Set objRe = CreateObject ...
    (microsoft.public.excel.programming)
  • Re: combining millions of different regular expressions
    ... match a given string with all of them some how. ... merged state machine will have an optimal structure to improve the ... First, be careful with what you mean by matching regular expressions, ... (One of those cases where theory and practice mis-align.) ...
    (comp.theory)
  • Re: Extract email addresses
    ... Because of a difference in the VBA flavor of Regular Expressions, ... Function REMid(str As String, Pattern As String, _ ... Dim objRegExp As RegExp ...
    (microsoft.public.excel.worksheet.functions)