Re: Having problem with SPLIT



Earl Grieda wrote:
I guess I misunderstand the "split" documentation, which says "split - Split
a string into a proper Tcl list".  Why is, after the split, the line is "not
really a list of space separated words, it's really just a string"?
http://www.tcl.tk/man/tcl8.4/TclCmd/split.htm

Well, you wrote that you had to convert multiple spaces to a single space before splitting. That's why I said the original string wasn't really a list of words separated by a space.




That seems like a recipe for disaster.


No, it works great.

In that case, problem solved!
...
The problem started when I thought it would be nice to have HTML and/or tabs
in a header, so I tested it to see if it works.  But, "split" changes the
word "\tHeader" to {\tHeader},

That is an incorrect assessment. Believe it or not, split will *not* add any characters to the original data. Internally the header does not contain those curly braces. Split is not what is causing the {}'s to appear in your data.


The problem comes when you use this list as a string. When you do that (such as when you do a puts), tcl automatically converts the list to a string. When tcl converts a list to a string it does so in such as fashion as to guarantee that the resultant string can be used to recreate the original list, precisely.

In the case of your data, tcl must insert {}'s around the header to enforce this rule. Again, this only happens when you convert a list to a string; internally the list does *not* include these extra curly braces.

Thus, if you print your list or write it to a file, and one or more of the list elements has characters special to the tcl parser, you'll end up with {}'s or \'s in the string representation of the list.


which started this thread. The "split"
documentation  says:

"Extract the list words from a string that is not a well-formed list:

split "Example with {unbalanced brace character"
     => Example with \{unbalanced brace character
"

It doesn't mention doing anything special with chars that could be
interpreted as control characters, such as \t.  I did try using \\tHeader,
but that didn't make any differance.

It bears repeating: split does not do anything special with any special characters at all. Split does nothing extra to or because of the tab or sequence \t. The anomolies you see are a result of converting the list back into a string.




Since you're only interested in the first word at this point, how about
just pulling out the flag and leaving the rest of the data alone?

    # put in whatever pattern fits your actual data
    regexp {^([a-zA-Z0-9_]+)} $orgLine -- flag

Is there a specific reason you're converting a string to a list, other
than to make it easy to pick out the first word? It looks like the crux
of  your problem is that program A is changing the nature of the data
before passing it to B. Thus, B gets a string representation of a list
representation of the original string, which is not the same as the
original string.



This is a communications protocol.  Currently I am using email as the
transport mechinism, but am slowly moving to sockets.  The first word
determines how the remainder of the line is treated.  Differant flags result
in differant operations.  In this case the header flag states that the rest
of the line is a header.



All the more reason to *not* use split on the data. Since the first word determines how the rest of the data is to be interpreted, what if the rest of the data must be preserved? Your code is changing the data because you convert multiple spaces to a single space, split on a single space, then (apparently) converting the resulting list back to a string. All of those steps are altering your data.


.



Relevant Pages

  • Re: TIP #185: Null Handling
    ... reconstructed from its string representation. ... this is how Tcl works. ... Tcl database API ought to work. ... extension, it would be a good idea for the Core to provide ...
    (comp.lang.tcl)
  • Re: [Re:] question about character encodings with Tcl interpreter embedded in C++
    ... > in UTF-8, the internal encoding), than you don't want to use ByteArray ... string I got from the outside world, unbeknownst to Tcl. ...
    (comp.lang.tcl)
  • Re: Is garbage collection here yet?
    ... I don't consider tcl to be a fully higher-level language. ... Tcl has so many other cool features and such a clean ... As others have answered, Tcl does do ref-counting GC of its values, which works fine for strings as they can't contain circular references and are stateless/immutable. ... Basically, it's hard to distinguish a reference from any other string, which makes it difficult to know when it is safe to delete something. ...
    (comp.lang.tcl)
  • Re: BYTE?
    ... > header followed by bytes of data. ... > the buffer got currupted when i cast it from string to BSTR. ... and the latter is a "wide" string (16-bit characters). ... code that would lead to "corruption". ...
    (microsoft.public.vc.language)
  • Re: Cant get this regular expression figured out
    ... that bite you. ... I see you using a + in the legit characters in the string - but you ... Tcl replied "a space". ...
    (comp.lang.tcl)