Re: Will standard C++ allow me to replace a string in a unicode-encoded text file?

From: Eric Lilja (ericliljaNoSpam_at_yahoo.com)
Date: 02/22/05

  • Next message: Ron Natalie: "Re: try-catch name"
    Date: Tue, 22 Feb 2005 15:27:04 +0100
    
    

    "Eric Lilja" <ericliljaNoSpam@yahoo.com> wrote in message
    news:cvff9h$hii$1@news.island.liu.se...
    >
    > "Chris Croughton" wrote:
    >> On Tue, 22 Feb 2005 01:24:58 +0100, Eric Lilja
    >> <ericliljaNoSpam@yahoo.com> wrote:
    >>
    >>> Thanks for your reply, Jerry. The file starts with 0xFF 0xFE, so that
    >>> means
    >>> utf-16? I was thinking of opening it in binary mode, read the first two
    >>> bytes then start a loop that reads from the file byte by byte and adds
    >>> the
    >>> first, the third, the fifth byte etc to a std::string (or a std::vector
    >>> of
    >>> chars maybe). When the loop is done I should have the actual text of the
    >>> file. Then I can look for the pattern I want and replace it as needed.
    >>> Then
    >>> I will open the file for writing (still in binary of course) and write
    >>> out
    >>> as utf-16. Sounds like this should work?
    >>
    >> It's more likely to be UCS-2 (UTF-16 is an extension to UCS-2 which
    >> allows UCS-4 characters to be embedded in a UCS-2 stream). The Byte
    >> Order Mark is defined to be 0xFEFF, with the character 0xFFFE defined as
    >> invalid, so that the byte order (big/little endian) can be determined.
    >> In your case the order must be LSB MSB, so you want all even numbered
    >> bytes (assuming standard C array indices starting at zero), but you
    >> ought to check for a portable implementation.
    >>
    >> You really should check that the other bytes are zero, as well, and give
    >> some sort of error if not (it's a character not representable in a
    >> normal string, unless you're on an implementation with 16 bit or more
    >> bytes); at minimum I would either ignore such a character or convert it
    >> to an error character ('?' for instance, like my mailer does).
    >>
    >> Or you can do all of your work in UCS-2 (or UCS-4), and thus preserve
    >> any non-ASCII characters. This will be a bit slower as an
    >> implementation, but on modern machines still faster than the I/O.
    >>
    >> If you really want portability, look at interpreting UCS-32, UTF-8 and
    >> UTF-16 as well as UCS-2 (and plain old text), with both big- and
    >> little-endian representations, and write a generic routine which
    >> converts any of them to a string (note that a C++ string type can take
    >> wide characters or longs as its element type). But for your case you
    >> may only need to do one or two of the formats.
    >>
    >> For further reading, see:
    >>
    >> http://www.unicode.org/faq/
    >>
    >> (and its parent if you want to get into the spec.). Warning: if you're
    >> like me, you can waste (er, spend) many happy hours reading the spec.
    >> and forget to do the work <g>...
    >>
    >> Chris C
    >
    > Thanks for your replies everyone. I wrote the following little test
    > program that I hope to get working for ucs-2 encoded files where all
    > characters are representable using ascii (i.e, the second byte after the
    > byte-order mark is \0 for all chars in the file). The program doesn't work
    > as expected, however, because if you look at the function read_file it
    > will read the byte order mark into the contents variable so when I write
    > the new file (where I have replaced some strings), I get the byte-order
    > mark twice although the second one has padding. If you look at the file in
    > a hex editor you see: FF FE FF 00 FE 00. I can easily work around it by I
    > want to know why read_file() is doing what it's doing.
    >
    > Here's the complete code:
    > #include <cstdlib>
    > #include <fstream>
    > #include <iostream>
    > #include <string>
    >
    > using std::cerr;
    > using std::cout;
    > using std::endl;
    > using std::exit;
    > using std::ifstream;
    > using std::ios_base;
    > using std::ofstream;
    > using std::string;
    >
    > static string read_file(const char *);
    > static void find_and_replace(string& s, const string&, const string&);
    > static void write_file(const char *, const string&);
    >
    > static const char padding = '\0';
    >
    > int
    > main()
    > {
    > const string find_what = "foobar";
    > const string replace_with = "abcdef";
    >
    > string contents = read_file("testfile.txt");
    >
    > find_and_replace(contents, find_what, replace_with);
    >
    > write_file("outfile.txt", contents);
    >
    > return EXIT_SUCCESS;
    > }
    >
    > static string
    > read_file(const char *filename)
    > {
    > ifstream file(filename, ios_base::binary);
    >
    > if(!file)
    > {
    > cerr << "Error: Failed to open " << filename << endl;
    >
    > exit(EXIT_FAILURE);
    > }
    >
    > char c = '\0';
    > string contents;
    >
    > file.read(&c, sizeof(c));
    > contents += c;
    > file.read(&c, sizeof(c));
    > contents += c;
    >
    > if((unsigned char)contents[0] != 0xFF ||
    > (unsigned char)contents[1] != 0xFE)
    > {
    > cerr << "Error: The file doesn't appear to be a unicode-file." <<
    > endl;
    >
    > /* std::ifstreams destructor will close the file. */
    > exit(EXIT_FAILURE);
    > }
    >
    > int count = 0;
    >
    > while(file.read(&c, sizeof(c)))
    > {
    > if(!(count++ % 2))
    > contents.push_back(c);
    > else
    > if(c != padding) /* padding is a static global that equals \0 */
    > {
    > cerr << "Error: Found a character that is too "
    > << "big to fit into a single byte." << endl;
    >
    > /* std::ifstreams destructor will close the file. */
    > exit(EXIT_FAILURE);
    > }
    > }
    >
    > /* std::ifstreams destructor will close the file. */
    > return contents;
    > }
    >
    > static void
    > find_and_replace(string& s, const string& find_what, const string&
    > replace_with)
    > {
    > string::size_type start = 0;
    > string::size_type offset = 0;
    > size_t occurencies = 0;
    >
    > while((start = s.find(find_what, offset)) != string::npos)
    > {
    > s.replace(start, find_what.length(), replace_with);
    >
    > /* Very important that we set offset to start + 1 or we will
    > go into an infinite loop because we will find the first {
    > over and over again. */
    > offset = start + 1;
    >
    > ++occurencies;
    > }
    >
    > cout << "Replaced " << occurencies << " occurencies." << endl;
    > }
    >
    > static void
    > write_file(const char *filename, const string& contents)
    > {
    > ofstream file(filename, ios_base::binary);
    >
    > const char byte_order_mark[2] = { 0xFF, 0xFE };
    >
    > file.write(&byte_order_mark[0], sizeof(char));
    > file.write(&byte_order_mark[1], sizeof(char));
    >
    > for(string::size_type i = 0; i < contents.length(); ++i)
    > {
    > file.write(&contents[i], sizeof(char));
    > file.write(&padding, sizeof(char));
    > }
    > }
    >
    > Thanks for any replies
    >
    > / Eric
    >

    Lol, nevermind! I saw that I was using the contents variable for reading the
    byte-order mark. I thought the reading position was being rewound somehow.
    Anyway, if you have any other comments on the code, please share them.

    / Eric


  • Next message: Ron Natalie: "Re: try-catch name"

    Relevant Pages

    • Re: Admired designs / designs to study
      ... instructions to try to pack together a whole word ... resort to using one of the string or bit move ... Get the next character from the ... if the number isn't zero yet. ...
      (comp.arch)
    • Re: Adding a leading zero to a SSN
      ... I haven't notice that the request was simply to display a SSN ... formatted string. ... "I need a have a formula that will add a zero to the begging of a 8 ... character length. ...
      (microsoft.public.excel.misc)
    • Re: DEC-C: Null string constant ?
      ... To be a bit more precise, a null character is a valid character. ... > a valid character within the body of that string. ... >>A null string is a string with no characters, zero length. ...
      (comp.os.vms)
    • Re: COBOL file dump...
      ... If the last character not is the equivalent for -0 you can set the the ... Concatenate that value as ASCW to the string. ... Than when the it is minus subtract the string from an integer with the value ... zero and otherwise make from it an integer. ...
      (microsoft.public.dotnet.framework)
    • Re: COBOL file dump...
      ... If the last character not is the equivalent for -0 you can set the the ... Concatenate that value as ASCW to the string. ... Than when the it is minus subtract the string from an integer with the value ... zero and otherwise make from it an integer. ...
      (microsoft.public.dotnet.languages.vb)