minidom xml & non ascii / unicode & files



lo all,

some of the questions i'll ask below have most certainly been discussed already, i just hope someone's kind enough to answer them again to help me out..

so i started a python 2.3 script that grabs some web pages from the web, regex parse the data and stores it localy to xml file for further use..

at first i had no problem using python minidom and everything concerning my regex/xml processing works fine, until i tested my tool on some french page with "non ascii" chars and my script started to throw errors all over the place..

I've looked into the matter and discovered the unicode / string encoding processes implied when dealing with non ascii texts and i must say i almost lost my mind.. I'm loosing it actually..

so here are the few questions i'd like to have answers for :

1. when fetching a web page from the net, how am i supposed to know how it's encoded.. And can i decode it to unicode and encode it back to a byte string so i can use it in my code, with the charsets i want, like utf-8.. ?

2. in the same idea could anyone try to post the few lines that would actually parse an xml file, with non ascii chars, with minidom (parseString i guess).
Then convert a string grabbed from the net so parts of it can be inserted in that dom object into new nodes or existing nodes.
And finally write that dom object back to a file in a way it can be used again later with the same script..


I've been trying to do that for a few days with no luck..
I can do each separate part of the job, not that i'm quite sure how i decode/encode stuff in there, but as soon as i try to do everything at the same time i get encoding errors thrown all the time..


3. in order to help me understand what's going on when doing encodes/decodes could you please tell me if in the following example, s and backToBytes are actually the same thing ??

s = "hello normal string"
u = unicode( s, "utf-8" )
backToBytes = u.encode( "utf-8" )

i knwo they both are bytestrings but i doubt they have actually the same content..

4. I've also tried to set the default encoding of python for my script using the sys.setdefaultencoding('utf-8') but it keeps telling me that this module does not have that method.. i'm left no choice but to edit the site.py file manually to change "ascii" to "utf-8", but i won't be able to do that on the client computers so..
Anyways i don't know if it would help my script at all..


any help will be greatly appreciated
thx

Marc
.



Relevant Pages

  • Re: using xml to save/open project data
    ... http://www.TransProCalc.org - Free translation project mgmt software ... set filename tk_getOpenfile ... Exec'ing the script you run it as a separate process, your script won't be able to access its vars. ... generating the xml file is easy enough. ...
    (comp.lang.tcl)
  • using xml to save/open project data
    ... Now, I've made a program, a translation project management tool, ... set filename tk_getOpenfile ... to run the script to reset the variables to reopen the project. ... I've been told that generating an xml file is the best way to go about ...
    (comp.lang.tcl)
  • RE: Problem with LWP::UserAgent
    ... have an xml file that contains a list of URLs. ... hash and pass them as a parameter into the function that does LWP get. ... > assume your script is named net.pl, you can involve the debugger with: ...
    (perl.beginners)
  • Re: WSH and XML Parser
    ... Once you pass that hurdle it is routine selectNodes or selectSingleNode. ... an xml file there is a fat chance that you will never learn how to read one. ... For those who manage to set their server shares via script, ... > I would like to map a set of network drives in which the data fro the ...
    (microsoft.public.scripting.wsh)
  • Re: The wonderful non-intuitive php include statement
    ... The database configuration is an XML file. ... relative paths are relative to the executing script, ...
    (comp.lang.php)