Re: finding strings in a text file help

From: Thomas Weidenfeller (nobody_at_ericsson.invalid)
Date: 11/05/04


Date: Fri, 05 Nov 2004 09:51:43 +0100

Chris Pike wrote:
> Hi there,
> i am a beginner at java and i am trying to implement a lexical
> analzyer that reads through a text file of code and determines between
> strings, digits and reserved words and then prints them out in order
> to a terminal window. i have managed to implement this

Thanks for providing your code. It indeed shows that you are a beginner.
I will point you to a few things you might want to read (in the API
documentation), to iron out a few things if you find the time to do it.

But first ...
> but i want to have it so when it finds a string
> it gets the whole string matches it against the reserved words array
> and prints out the word, or prints out the string of characters even
> if it doesn't match the array of reserved words. i am not quite sure
> how to go about this.

The first thing to do is to collect characters as long as you get them,
so you get the complete word into memory. You currently always just hold
one character of a potential word in your "s" string.

You will have to write code effectively doing this

   loop:
    read next character
    if character is a letter then append character to string, goto loop:
    if not {
       unread the character // or do other tests
       if the string is not empty check if a reserved word
    }
    clear the string

The second thing to do is to compare the found word with the list of
reserved words. You currently have a simple loop. This is ok only if
there are very few reserved words. If there are many, look up the term
binary search in your textbook, and have a look at the methods of the
Arrays class. You could also check out the different implementations of
the Set interface.

> public class LexicalAnalyzer {
> // instance variables
> String[] digits = {"0","1","2","3","4","5","6","7","8","9"};

See Character.isDigit() and similar methods in the Character class.
Also, a char[] array would do.

> String[] letters =
> {"a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z"};

See the Character class again. And again, if you want to do it this way,
a simple char[] array would do.

> String[] symbols =
> {"'",";","{","}","*","(","=","+",")","[","]",":","#","<",">",".","/",","};

Again, a char[] array ...

> String[] reservedWords = {"int","print","String","boolean"};

Yes, these are indeed Strings. Also consider the advice and read a
little bit about binary search or sets.

BTW: With your current way of reading/comparing and your current
letters[] array, you will have difficulties to identify the "String"
keyword.

> String[] space = {" "};

A char would do.

> String s = null;
>
> public void main(String args[]) {
>
> FileInputStream fileInput;
> InputStreamReader inputStream;

Misleading variable name. This should be inputReader, InputStreamReader
is a subclass of Reader, not Stream. But also see below.

> char c;
>
> try {
> fileInput = new FileInputStream("testcompiler.dl0.txt");
> inputStream = new InputStreamReader(fileInput);

Read the documentation of FileReader and BuffredReader.

>
> for(int i = 0; i < 233; i++) {
> c = (char) inputStream.read();

Uhhh, and what if your input is just 232 chars? Please study in detail
the documentation of the read() method. There is a reason why it returns
an int, and not a char. Use that additional information returned by read().

> s = String.valueOf(c);

Doesn't gain you anything. As a rule of thump, you usually do most work
in a scanner (lexical analyser) with characters, not with strings. And
since you always just have one char in that string, it doesn't make much
sense to have the string. You will need a separate String (or
StringBuffer or StringBuilder) to collect all the chars making up a
word, but all the input can be handled with chars.

> for(int j = 0; j < 10; j++) {
> if (s.equals (digits[j]))
> System.out.println("Character " + (i+1) +
> " is digit " + c);
> }

See the Character class. Also, one usually never hard-codes the length
of an array somewhere. digits.length would be much better than
hardcoding the 10.

> for(int j = 0; j < 18; j++) {
> if (s.equals (symbols[j]))
> System.out.println("Character " + (i+1) +
> " is symbol " + c);
> }

See above.

> for(int j = 0; j < 26; j++) {
> if (s.equals (letters[j]))
> System.out.println("Character " + (i+1) +
> " is letter " + c);
> }

See above.

> for(int j = 0; j < 4; j++) {
> if (s.equals (reservedWords[j]))
> System.out.println("Character " + (i+1) +
> " is letter " + c);
> }

See the initial discussion.

> }
> fileInput.close();

(a) close what you call inputStream (the outer Reader. Closing it will
also close the underlying fileInput).

(b) Do this in the finally block of your try.

> }
> catch (Exception e) {// in case the IO goes wrong
> }

(a) Don't discard the exception. At least print it, so you get some
information what is going wrong.

(b) Catch a less general exception (IOException). As a rule of thump,
always catch the closely matching exception, not the general one.
> }
> }

/Thomas



Relevant Pages

  • Re: detecting characters on RS232-Interface
    ... read data into string variable ... > splitted at the end of the receive buffer). ... examine the next char in turn. ... When a character ...
    (microsoft.public.vb.general.discussion)
  • Re: remove spaces from a string and Complexity
    ... string character by character and copying onto another output string. ... void delchar(char *s, char c) ... I've seen functions written as above, however I'm still a little confused about one point - C passes by value therefore with your above function wouldn't the following behave incorrectly (incorrectly as in not modify the contents referenced by the first parameter but instead modify a copy of it): ...
    (comp.lang.c)
  • Re: "Read stuff from a file and chop it up to do stuff" code advice wanted.
    ... ;; This function returns TRUE if any character ... (if (char< char #\!) ... a stream and an array to hold characters in temp memory. ... ;; resulting string. ...
    (comp.lang.lisp)
  • RfD: Escaped Strings (Version 6)
    ... Escape character is case sensitive, ... the S" string can only contain printable characters, ... \f FF (form feed, ASCII 12) ... impact of char in the file word sets. ...
    (comp.lang.forth)
  • [TOMOYO #15 3/8] Common functions for TOMOYO Linux.
    ... This file contains common functions (e.g. policy I/O, pattern matching). ... Since TOMOYO Linux is a name based access control, ... TOMOYO Linux's string manipulation functions make reviewers feel crazy, ... the Linux kernel accepts all characters but NUL character ...
    (Linux-Kernel)