Re: cleaning up an ASCII file?



Apologies, I figured there was some easy, obvious solution, since there is in BBedit. I will explain further...

John Machin wrote:
On Jun 11, 6:09 am, Nick Matzke <mat...@xxxxxxxxxxxx> wrote:
Hi all,

So I'm parsing an XML file returned from a database. However, the
database entries have occasional non-ASCII characters, and this is
crashing my parsers.

So fix your parsers. google("unicode"). Deleting stuff that you don't
understand is an "interesting" approach to academic research :-(

Not if it's just weird versions of dash characters and umlauted characters the like, which is what I bet it is. Those sorts of things and the apparent inability of lots of email readers and websites to deal with them have been annoying me for years, so I tend to move straight towards genocidal tactics when I detect their presence.

(My database source is GBIF, they get museum specimen submissions from around the planet, there are zillions of records, I am just a user, so fixing it on their end is not a realistic option.)

Care to divulge what "crash" means? e.g. the full traceback and error
message, plus what version of python on what platform, what version of
ElementTree or other XML spftware you are using ...

All that is fine, the problem is actually when I try to print to screen in IPython:

============
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 293: ordinal not in range(128)
============

Probably this is the line in the file which is causing problems (as displayed in BBedit):

======================
<gbif:statements>-

This document contains data shared through the GBIF Network - see http://data.gbif.org/ for more information.

All usage of these data must be in accordance with the GBIF Data Use Agreement - see http://www.gbif.org/DataProviders/Agreements/DUA

Please cite these data as follows:

Jyv&#228;skyl&#228; University Museum - The Section of Natural Sciences, Vascular plant collection of Jyvaskyla University Museum (accessed through GBIF data portal, http://data.gbif.org/datasets/resource/462, 2009-06-11)
Missouri Botanical Garden, Missouri Botanical Garden (accessed through GBIF data portal, http://data.gbif.org/datasets/resource/621, 2009-06-11)
Museo Nacional de Costa Rica, herbario (accessed through GBIF data portal, http://data.gbif.org/datasets/resource/566, 2009-06-11)
National Science Museum, Japan, Kurashiki Museum of Natural History (accessed through GBIF data portal, http://data.gbif.org/datasets/resource/599, 2009-06-11)
The Swedish Museum of Natural History (NRM), Herbarium of Oskarshamn (OHN) (accessed through GBIF data portal, http://data.gbif.org/datasets/resource/1024, 2009-06-11)
Tiroler Landesmuseum Ferdinandeum, Tiroler Landesmuseum Ferdinandeum (accessed through GBIF data portal, http://data.gbif.org/datasets/resource/1509, 2009-06-11)
UCD, Database Schema for UC Davis [Herbarium Labels] (accessed through GBIF data portal, http://data.gbif.org/datasets/resource/734, 2009-06-11)

-
</gbif:statements>
======================


Presumably "Jyv&#228;skyl&#228; University Museum" is the problem since there are umlauted a's in there. (Note, though, that I have thousands of records to parse, so there is going to be all kinds of other umlauted & accented stuff in these sorts of search results.

So the goal is to replace the characters with un-umlauted versions or some such.

Cheers!
Nick


PS: versions I am using:
========
nick$ python -V
Python 2.5.2 |EPD Py25 4.1.30101|
========




Center for Theoretical Evolutionary Genomics

If your .sig evolves much more, it will consume all available
bandwidth in the known universe and then some ;-)

....its easier to have a big sig than to try and remember all that stuff ;-)...




--
====================================================
Nicholas J. Matzke
Ph.D. Candidate, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley

Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: matzke@xxxxxxxxxxxx

Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140

-----------------------------------------------------
"[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together."

Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989.
http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
====================================================
.



Relevant Pages

  • cleaning up an ASCII file?
    ... However, the database entries have occasional non-ASCII characters, and this is crashing my parsers. ... Huelsenbeck Lab ... When people thought the earth was spherical, ...
    (comp.lang.python)
  • Re: cleaning up an ASCII file?
    ... So I'm parsing an XML file returned from a database. ... entries have occasional non-ASCII characters, ... Huelsenbeck Lab ... "hen people thought the earth was flat, ...
    (comp.lang.python)
  • Dinosaurs for Creationists
    ... "The new Museum of Earth History that opened last week in Eureka ... exhibits depicting Eden and the Tower of Babel and learn that all life ...
    (alt.politics)
  • Re: WingNutDaily columnist: Huckabee was right
    ... true" more than double those who believe strongly in evolution. ... When the "scientific" community proclaimed the earth to be ... We saw this same kind of caricature when the $27 million Creation ... Museum opened near Cincinnati, Ohio, a few weeks ago. ...
    (talk.origins)
  • May 28, 2007.- The Day the Stupidest People Ever Born Opened a Museum
    ... It took -billions- of years for this earth and the human race to ... evolve into what we are today and we are set to destroy it all in just ... The Creation Museum is a 60,000 square foot, $27 million museum in the ...
    (alt.politics)

Loading