Re: text parsing
- From: Carolyn Marenger <cajunk@xxxxxxxxxxxx>
- Date: Wed, 23 Jan 2008 09:38:34 -0500
McKirahan wrote:
"Carolyn Marenger" <cajunk@xxxxxxxxxxxx> wrote in message
news:81d7b$47973b05$cf70133e$360@xxxxxxxxxxxx
McKirahan wrote:the"Carolyn Marenger" <cajunk@xxxxxxxxxxxx> wrote in message
news:7c0f$4795ea54$cf70133e$1079@xxxxxxxxxxxx
McKirahan wrote:text"Carolyn Marenger" <cajunk@xxxxxxxxxxxx> wrote in message
news:74fb1$479501d1$cf70133e$7458@xxxxxxxxxxxx
Can someone point me in the direction of some good documentation ondatabase,parsing?
I want to take a bunch of text files (rtf), read them in and dump the
contents in a database. The files are effectively a flat file
with I suspect some fairly intricate programming needed to processintodatafiles. Unfortunately, they are laid out for human readability, notA few answersconversion.A few questions.
How many is a "bunch"?
What would the target database be -- MySQL?
What table and column structures do you envision?
Perhaps simply a single table with two columns:
filename (key) and a memo field containing the data?
What is the purpose behind doing this?
A bunch is about a dozen. Basically one large file that was brokeneachsixteen subsets, following the initial letter for each record.
The target database would be MySQL
I haven't looked too closely at the data, but I think one main table
with a few linked tables for those cases where there may be more than
one piece of data for a category. There are about 25 categories to
therecord. Eventually there would be additional structure added around(Iimported data, but that isn't relevant to importing the data itself.towill confirm this before beginning to code.
The purpose: I am a D&D fan and I run games. I would like to be ableYes, they are online. Yes, you can look at them. Yes, those are thereference the material and automate much of the process so I don't haveAny chance the RTF files are online so I could look at them?
to lug and reference 20lbs of books.
Perhaps http://www.wizards.com/default.asp?x=d20/article/srd35?
http://www.wizards.com/d20/files/v35/SRD.zip contains 88 RTF files.
Also, I gather, this might be a one-time effort; correct?
Not what you requested but ...
I've developed a VBScript solution that takes the following approach:
for a given folder, each RTF file is opened in MS-Word and saved
as a text file which is opened and read then saved in an MS-Access
database table containing 3 columns: id (AutoNumber), file, data.
Using those 86 RTF files it created a 10MB MS-Access database.
files except I only care about the 16 monster files. Yes, this is a one
time effort.
My goal is to create a encounter generation program - where I key in
climate, geography, season, encounter level, time of day, proximity to
civilization, and the application gives me a suggested random encounter
suited to the scenario. For example, if the party was wandering around
the city sewers on a hot summer night, they might encounter a pack of
giant rats being led by a were rat. I would then want the program to
determine how many rats, how many hit points each, and any other
pertinent variable data, including what weapons and treasure the wererat
was carrying and using.
Having the rtfs loaded into a database like your script does, would
enable faster searches, it would not go the next step and perform the
various calculations based on the results of the searches. It is a good
start, but if it has stripped any of the rtf encoding, it may make it
harder to have a script find the various 'fields'.
Thanks, Carolyn
I counted 17 "Monster" prefixed files.
My version creates ".txt" files which do strip "the rtf encoding".
An alternative version creates ".htm" files which retains the
formatting you want; I don't think you really want all of the
"rtf encoding" unless you fully understand the specification:
(search on "rtf specification".)
Perhaps, as an intermediate step, you would like all of the
"Monster" rtfs converted to HTML and made available via
an interface to open one or more for viewing.
As HTML files they consume 7.5MB.
There are a couple of the monster prefixed files that are not listings of monsters but other information, such as monsters as characters. Anyway, exact number of files is not overly important.
I just did a little test, and looking at the files, I think the easiest to work with may indeed be the text file.
Here is an example to illustrate: I am pulling the monster name, type and hit dice from each file format.
in rtf...
{
\par }{\fs36
\par DELVER
\par }\trowd \trgaph108\trleft-108\trbrdrh\brdrs\brdrw10 \trftsWidth1\trautofit1\trpaddl108\trpaddr108\trpaddfl3\trpaddfr3 \clvertalt\clbrdrb\brdrs\brdrw10 \cltxlrtb\clftsWidth1 \cellx1969\clvertalt\clbrdrb\brdrs\brdrw10 \cltxlrtb\clftsWidth3\clwWidth4871
\cellx6840\pard \ql \li0\ri0\nowidctlpar\intbl\faauto\rin0\lin0 {\b\fs20 }{\b\fs19 \cell }{\fs20 Huge Aberration}{\fs19 \cell }\pard \ql \li0\ri0\widctlpar\intbl\aspalpha\aspnum\faauto\adjustright\rin0\lin0 {\fs19 \trowd \trgaph108\trleft-108\trbrdrh
\brdrs\brdrw10 \trftsWidth1\trautofit1\trpaddl108\trpaddr108\trpaddfl3\trpaddfr3 \clvertalt\clbrdrb\brdrs\brdrw10 \cltxlrtb\clftsWidth1 \cellx1969\clvertalt\clbrdrb\brdrs\brdrw10 \cltxlrtb\clftsWidth3\clwWidth4871 \cellx6840\row }\trowd
\trgaph108\trleft-108\trbrdrh\brdrs\brdrw10 \trftsWidth1\trautofit1\trpaddl108\trpaddr108\trpaddfl3\trpaddfr3 \clvertalt\clbrdrt\brdrs\brdrw10 \clbrdrb\brdrs\brdrw10 \cltxlrtb\clftsWidth1 \cellx1969\clvertalt\clbrdrt\brdrs\brdrw10 \clbrdrb\brdrs\brdrw10
\cltxlrtb\clftsWidth3\clwWidth4871 \cellx6840\pard \ql \li0\ri0\nowidctlpar\intbl\faauto\rin0\lin0 {\b\fs20 Hit Dice:}{\b\fs19 \cell }{\fs20 15d8+78 (145 hp)}{\fs19 \cell }\pard \ql
----------
in .html...
<P STYLE="page-break-after: avoid"><FONT SIZE=5>DARKMANTLE</FONT></P>
<TABLE WIDTH=410 BORDER=1 BORDERCOLOR="#000000" CELLPADDING=7 CELLSPACING=0 FRAME=VOID RULES=ROWS>
<COL WIDTH=124>
<COL WIDTH=258>
<TR VALIGN=TOP>
<TD WIDTH=124>
<P CLASS="western">
</P>
</TD>
<TD WIDTH=258>
<P CLASS="western"><FONT SIZE=2>Small Magical Beast</FONT></P>
</TD>
</TR>
<TR VALIGN=TOP>
<TD WIDTH=124>
<P CLASS="western"><FONT SIZE=2><B>Hit Dice:</B></FONT></P>
</TD>
<TD WIDTH=258>
<P CLASS="western"><FONT SIZE=2>1d10+1 (6 hp)</FONT></P>
</TD>
</TR>
---------
in .txt...
DARKMANTLE
Small Magical Beast
Hit Dice:
1d10+1 (6 hp)
--------
So, looking at that and assuming the rest will be similar, the text version looks the easiest to deal with. If document styling such as 'title', 'heading' and 'subheading' had been used, maybe not, but in this case, a new line seems to denote either a field heading or field data. There are exceptions of course - particularly when denoting a category of monster.
That doies bring me a little closer to achievign my goal. Thanks for the assistance. :)
Carolyn
.
- Follow-Ups:
- Re: text parsing
- From: McKirahan
- Re: text parsing
- References:
- text parsing
- From: Carolyn Marenger
- Re: text parsing
- From: McKirahan
- Re: text parsing
- From: Carolyn Marenger
- Re: text parsing
- From: McKirahan
- Re: text parsing
- From: Carolyn Marenger
- Re: text parsing
- From: McKirahan
- text parsing
- Prev by Date: Re: Thumbnail generator
- Next by Date: Re: text parsing
- Previous by thread: Re: text parsing
- Next by thread: Re: text parsing
- Index(es):
Relevant Pages
|