Re: Trying to fix Invalid CSV File



On Aug 5, 6:56 am, Larry Bates <larry.ba...@xxxxxxxxxxx`> wrote:
Ryan Rosario wrote:
On Aug 4, 8:30 am, Emile van Sebille <em...@xxxxxxxx> wrote:
John Machin wrote:
On Aug 4, 6:15 pm, Ryan Rosario <uclamath...@xxxxxxxxx> wrote:
On Aug 4, 1:01 am, John Machin <sjmac...@xxxxxxxxxxx> wrote:
On Aug 4, 5:49 pm, Ryan Rosario <uclamath...@xxxxxxxxx> wrote:
Thanks Emile! Works almost perfectly, but is there some way I can
adapt this to quote fields that contain a comma in them?
<snip>

Emile's snippet is pushing it through the csv reading process, to
demonstrate that his series of replaces works (on your *sole* example,
at least).
Exactly -- just print out the results of the passed argument:

rec.replace(',"',",'''").replace('",',"''',").replace('"','""').replace("'''",'"')

'123,"Here is some, text ""and some quoted text"" where the quotes
should have been doubled",321'

Where it won't work is if any of the field embedded quotes are next to
commas.

I'd run it against the file. Presumably, you've got a consistent field
count expectation per record. Any resulting record not matching is
suspect and will identify records this approach won't address.

There's probably better ways, but sometimes it's fun to create
executable line noise. :)

Emile

Thanks for your responses. I think John may be right that I am reading
it a second time. I will take a look at the CSV reader documentation
and see if that helps. Then once I run it I can see if I need to worry
about the comma-next-to-quote issue.

This is a perfect demonstration of why tab delimited files are so much better
than comma and quote delimited.

No, it's a perfect demonstration of what happens when a protocol is
not followed.

Virtually all software can handle table
delimited as well as comma and quote delimited, but you would have none of these
problems if you had used tab delimited. The chances of tabs being embedded in
most data is virtually nil.


There may be no tabs in *your* data. There is no guarantee that there
are no tabs in a VARCHAR(n) column in somebody else's database. I've
seen all of \x00, \t, \n, \r and \x1a (Ctrl-Z (EOF in CP/M, *DOS and
Windows text files)).

The possibilities include (1) Don't check (2) check if '\t' in field
and raise an exception (3) silently remove tabs; what do you
recommend?
.