Welcome, Guest. Please login or register.
Pages: [1]
  Print  
Author Topic: ScriptBasic Word Count Code Challenge  (Read 1522 times)
JRS
Moderator
*****
Posts: 689


WWW Email
« on: May 06, 2009, 10:28:48 PM »

I thought I would post my prototype code that seems to be working with my testing text file.

fox.txt
Code:
The quick brown fox jumped over the lazy dog's back (0123456789) times.
The quick brown fox jumped over the lazy dog's back (0123456789) times.
The quick brown fox jumped over the lazy dog's back (0123456789) times.
The quick brown fox jumped over the lazy dog's back (0123456789) times.
The quick brown fox jumped over the lazy dog's back (0123456789) times.
The quick brown fox jumped over the lazy dog's back (0123456789) times.
The quick brown fox jumped over the lazy dog's back (0123456789) times.
The quick brown fox jumped over the lazy dog's back (0123456789) times.
The quick brown fox jumped over the lazy dog's back (0123456789) times.
The quick brown fox jumped over the lazy dog's back (0123456789) times.

wc.sb
Code:
start_time = NOW

INCLUDE t.bas

text = t::loadString("fox.txt")

strip = "()0123456789." & CHR(10) & CHR(13)

FOR i = 1 TO LEN(strip)
  text = REPLACE(text, MID(strip, i, 1), " ")
NEXT i

SPLITA text BY " " TO word_list

OPEN "wc.raw" FOR OUTPUT AS #1

FOR x = 0 TO UBOUND(word_list)
  text_out = TRIM(word_list[x])
  IF LEN(text_out) THEN PRINT #1,LCASE(text_out),"\n"
NEXT x

CLOSE #1

ok = EXECUTE("sort wc.raw /O wc.srt", -1, PID)

OPEN "wc.srt" FOR INPUT AS #2
OPEN "wc.lst" FOR OUTPUT AS #3

last_word = ""
word_count = 0
word_total = 0

Next_Word:

IF EOF(2) THEN GOTO Done

LINE INPUT #2, this_word

word_total += 1

this_word = CHOMP(this_word)

IF last_word = "" THEN last_word = this_word

IF this_word = last_word THEN
  word_count += 1
  GOTO Next_Word
END IF

PRINT #3, last_word & " (" & word_count & ")\n"

last_word = this_word
word_count = 1

GOTO Next_Word 

Done:

PRINTNL #3
PRINT #3, word_total - 1, " words in ", NOW - start_time, " seconds.\n"

CLOSE #2
CLOSE #3

END

wc.lst
Code:
back (10)
brown (10)
dog's (10)
fox (10)
jumped (10)
lazy (10)
over (10)
quick (10)
the (20)
times (10)

110 words in 0 seconds.

Done in a fraction of a second.   Shocked

Next Post: Testing against the code challenge reference file.
Logged
JRS
Moderator
*****
Posts: 689


WWW Email
« Reply #1 on: May 07, 2009, 03:28:03 PM »

30091 words in 2 seconds.

Word List - (wc.lst)

William Shakespeare - The Tragedy of King Lear - (lear.txt)

Code:
start_time = NOW

INCLUDE t.bas

text = t::LoadString("lear.txt")

strip = "()[]<>/'@0123456789*.,;:!#?%+-=_~\"\\" & CHR(10) & CHR(13)

FOR i = 1 TO LEN(strip)
  text = REPLACE(text, MID(strip, i, 1), " ")
NEXT i

SPLITA text BY " " TO word_list

OPEN "wc.raw" FOR OUTPUT AS #1

FOR x = 0 TO UBOUND(word_list)
  text_out = TRIM(word_list[x])
  IF LEN(text_out) THEN PRINT #1,LCASE(text_out),"\n"
NEXT x

CLOSE #1

ok = EXECUTE("sort wc.raw /O wc.srt", -1, PID)

OPEN "wc.srt" FOR INPUT AS #2
OPEN "wc.lst" FOR OUTPUT AS #3

last_word = ""
word_count = 0
word_total = 0

Next_Word:

IF EOF(2) THEN GOTO Done

LINE INPUT #2, this_word

this_word = CHOMP(this_word)

IF this_word = "s" THEN GOTO Next_Word

word_total += 1

IF last_word = "" THEN last_word = this_word

IF this_word = last_word THEN
  word_count += 1
  GOTO Next_Word
END IF

PRINT #3, last_word & " (" & word_count & ")\n"

last_word = this_word
word_count = 1

GOTO Next_Word 

Done:


PRINTNL #3
PRINT #3, word_total - 1, " words in ", NOW - start_time, " seconds.\n"

CLOSE #2
CLOSE #3

END

Note: The only changes to the prototype were additional characters to the strip string, skipping orphaned s's and changing the name of the file to load.

FYI: Google has about 18,990,000,000 indexes to the word the.
« Last Edit: May 08, 2009, 02:03:03 AM by John Spikowski » Logged
JRS
Moderator
*****
Posts: 689


WWW Email
« Reply #2 on: May 08, 2009, 02:38:44 PM »

I decided to remove the single quote character and hyphen from the strip string to preserve the words as they were intended. It adds a few oddities to the front of the list but worth keeping the other words intact. The program now requires you to enter the text file name you wish to word count on the command line.  I attached a standalone version (wc.exe) you can use on any text file you wish. This is my submission for the word count code challenge.

Usage:  wc filename.txt

The Unrealized Potential of DNA Testing.  -  Word List

William Shakespeare - The Tragedy of King Lear  -  Word List

Code:
fname = COMMAND()

IF TRIM(fname) = "" THEN
  PRINT """
ScriptBasic Word Count Program
Usage: wc.exe filename
   
"""
  END
END IF   

start_time = NOW

fsize = FILELEN(fname)
OPEN fname FOR INPUT AS #1
text = INPUT(fsize, #1)
CLOSE #1

strip = "()[]<>/@0123456789*.,;:!#?%$+=_~\"\\" & CHR(10) & CHR(13)

FOR i = 1 TO LEN(strip)
  text = REPLACE(text, MID(strip, i, 1), " ")
NEXT i

SPLITA text BY " " TO word_list

OPEN "wc.raw" FOR OUTPUT AS #1

FOR x = 0 TO UBOUND(word_list)
  text_out = TRIM(word_list[x])
  IF LEN(text_out) THEN PRINT #1,LCASE(text_out),"\n"
NEXT x

CLOSE #1

ok = EXECUTE("sort wc.raw /O wc.srt", -1, PID)

OPEN "wc.srt" FOR INPUT AS #2
OPEN "wc.lst" FOR OUTPUT AS #3

last_word = ""
word_count = 0
word_total = 0

Next_Word:

IF EOF(2) THEN GOTO Done

LINE INPUT #2, this_word

this_word = CHOMP(this_word)

word_total += 1

IF last_word = "" THEN last_word = this_word

IF this_word = last_word THEN
  word_count += 1
  GOTO Next_Word
END IF

PRINT #3, last_word & " (" & word_count & ")\n"

last_word = this_word
word_count = 1

GOTO Next_Word 

Done:

PRINTNL #3
PRINT #3, word_total - 1, " words in ", NOW - start_time, " seconds.\n"

CLOSE #2
CLOSE #3

DELETE "wc.raw"
DELETE "wc.srt"

END

Note: A newer version of wc.exe program is in a later post and adds a rank list feature.

* wc.zip (177.8 KB - downloaded 28 times.)
« Last Edit: May 10, 2009, 09:47:58 PM by John Spikowski » Logged
JRS
Moderator
*****
Posts: 689


WWW Email
« Reply #3 on: May 09, 2009, 01:10:32 AM »

The Bible in text format. (4.19 MB, 804895 words in 33 seconds.)

Bible Word File  angel


FYI: ScriptBasic created a 1,067,594 element array of words/non-delimiter spaces with the SPLITA function for the Bible translation.

I'm curious how the other Basic languages are going to approach this code challenge. I feel lucky to have SPLITA and  REPLACE as built in C based functions in ScriptBasic.

Issues:

  • Word array size variable based on text document (if arrays are used for word storage)
  • Extraction of words
  • Sorting / counting words


 
« Last Edit: May 13, 2009, 11:37:26 PM by John Spikowski » Logged
JRS
Moderator
*****
Posts: 689


WWW Email
« Reply #4 on: May 10, 2009, 12:17:12 AM »

I added a couple more lines of code to provide a word rank list as well as the alphabetical word  list. I was going to use the sort reverse order option to put the most used first but it also reversed the alphabetical order of words with the same frequency count.   Undecided

Bible word list by rank

Code:
fname = COMMAND()

IF TRIM(fname) = "" THEN
  PRINT """
ScriptBasic Word Count Program
Usage: wc.exe filename
   
"""
  END
END IF   

start_time = NOW

fsize = FILELEN(fname)
OPEN fname FOR INPUT AS #1
text = INPUT(fsize, #1)
CLOSE #1

strip = "()[]{}|<>/@0123456789*.,;:!#?%$&+=_~\"\\" & CHR(9) & CHR(10) & CHR(13)

FOR i = 1 TO LEN(strip)
  text = REPLACE(text, MID(strip, i, 1), " ")
NEXT i

SPLITA text BY " " TO word_list

OPEN "wc.raw" FOR OUTPUT AS #1

FOR x = 0 TO UBOUND(word_list)
  text_out = TRIM(word_list[x])
  IF LEN(text_out) THEN PRINT #1,LCASE(text_out),"\n"
NEXT x

CLOSE #1

ok = EXECUTE("sort wc.raw /O wc.srt", -1, PID)

OPEN "wc.srt" FOR INPUT AS #2
OPEN "wc.lst" FOR OUTPUT AS #3
OPEN "wc.tmp" FOR OUTPUT AS #4

last_word = ""
word_count = 0
word_total = 0

Next_Word:

IF EOF(2) THEN GOTO Done

LINE INPUT #2, this_word

this_word = CHOMP(this_word)

word_total += 1

IF last_word = "" THEN last_word = this_word

IF this_word = last_word THEN
  word_count += 1
  GOTO Next_Word
END IF

PRINT #3, last_word & " (" & word_count & ")\n"
PRINT #4, FORMAT("%6d", word_count), "  ", last_word & "\n"

last_word = this_word
word_count = 1

GOTO Next_Word 

Done:

PRINTNL #3
PRINT #3, word_total - 1, " words in ", NOW - start_time, " seconds.\n"

CLOSE #2
CLOSE #3
CLOSE #4

' Do Rank Sort

ok = EXECUTE("sort wc.tmp /O wc.rnk", -1, PID)

DELETE "wc.raw"
DELETE "wc.srt"
DELETE "wc.tmp"

END

The attached wc.zip file contains the above source and a standalone wc.exe program. The below additions were added since the last zip offering.

  • Additional strip characters added. (seems to work on html files)
  • A word rank (wc.rnk) file is created along with the original alphabetical word list. (wc.lst)

804895 words in 55 seconds - (was 33 seconds) - Extra replace loops and parallel writes for the alphabetical and rank lists.


* wc.zip (178.79 KB - downloaded 30 times.)
« Last Edit: May 10, 2009, 10:31:37 PM by John Spikowski » Logged
JRS
Moderator
*****
Posts: 689


WWW Email
« Reply #5 on: May 17, 2009, 03:45:10 AM »

I'm not done yet.  Tongue

I think I can shave some time off by using the ScriptBasic HASH extension module. This will allow me to go directly from the array created by SPLITA to a hash table with a count value pair.

I currently have an issue with the hash extension module not returning values. (worked in the past  Sad )

I'll post new code, word table and results here soon.

I sure hope the ProvideX team hasn't thrown in the towel.  Shocked
Logged
JRS
Moderator
*****
Posts: 689


WWW Email
« Reply #6 on: May 17, 2009, 01:21:48 PM »

I may have misunderstood the use of the HASH extension module.  It looks like the iteration is in entry order and not alphabetical key order.

Quote from: HASH Docs
The hash structure implemented in this module maintains a linked list that allows the programmer to iterate over all elements of the hash in the order they were entered into the hash in both directions. To maintain iteration state there is an iteration pointer for each hash. There can only be a single iteration over a hash at a time. The iteration pointer can be set to the start, to the end of the list and can be moved one element forward and backward. The key and value pairs can also be retrieved pointed by the iteration pointer. The ordering of the elements in the hash are guaranteed to follow the time order the pairs were entered into the hash. The first element returned by the iteration pointer is the pair entered first in to the hash.

Looks like I'm going to have to live with 33 seconds unless I have a brain fart.   confused

Observations

It takes 5 seconds to load the file and run a minimal replacement pass.
It takes 10 seconds using the current strip string.
It takes another 3 seconds to get to the point the work array is built using the SPLITA function.

This means I'm using 20 seconds to do the following ...

  • Build the wc.raw file of words only (blank array elements stripped)
  • Shell out to sort this list to create wc.srt
  • Read through the wc.srt file counting duplicates and writing a entry to the alphabetical and rank files when the word changes
  • Shell out to sort the rank file.
  • Clean up the mess and end.



« Last Edit: May 17, 2009, 09:51:39 PM by John Spikowski » Logged
JRS
Moderator
*****
Posts: 689


WWW Email
« Reply #7 on: May 20, 2009, 09:35:55 PM »

I thought I would give Berkeley DB a try and test the database interface why I was at it. It didn't help with a quicker time but the code might be of interest to some of the members here looking for a keyed file system.

804895 words in 492 seconds.

Bible Word Count File

Code:
start_time = NOW

INCLUDE t.bas
INCLUDE bdb.bas

text = t::loadString("bible.txt")

strip = "()[]{}|<>/@0123456789*.,;:!#?%$&+=_~\"\\" & CHR(9) & CHR(10) & CHR(13)

FOR i = 1 TO LEN(strip)
  text = REPLACE(text, MID(strip, i, 1), " ")
NEXT i

SPLITA text BY " " TO word_list

dbh = bdb::Open("bible.db", bdb::BTree, bdb::Create, 0)

word_total = 0

FOR x = 0 TO UBOUND(word_list)
  this_word = LCASE(TRIM(word_list[x]))
  IF LEN(this_word) THEN
    word_total += 1   
    ON ERROR GOTO Update_Word
    bdb::Put(dbh, this_word, 1, bdb::NoOverWrite)
  END IF
  Updated:
NEXT x

OPEN "wc.lst" FOR OUTPUT AS #1

word_count = bdb::First(dbh, "")
GOTO First

Next_Word:

word_count = bdb::Next(dbh)
IF word_count = undef THEN GOTO Done
First:
word_key = bdb::Key(dbh)
PRINT #1, word_key & " (" & word_count & ")\n"
GOTO Next_Word

Done:

PRINTNL #1
PRINT #1, word_total, " words in ", NOW - start_time, " seconds.\n"

CLOSE #1
bdb::Close(dbh)

END

Update_Word:

count = bdb::Get(dbh, this_word)
count += 1
bdb::Update(dbh, count)
RESUME Updated

I thinking of making a side challenge out of this so we can see how different database options compare. I'm going to use the first part of my original program to create a text file of all the words in the Bible. (not in order) Each participating Basic language can use this word list to build a database of words (keys) and their occurrence count as the data.

Code I used to create the Bible word file. (see attached)
Code:
INCLUDE t.bas

text = t::loadString("bible.txt")

strip = "()[]{}|<>/@0123456789*.,;:!#?%$&+=_~\"\\" & CHR(9) & CHR(10) & CHR(13)

FOR i = 1 TO LEN(strip)
  text = REPLACE(text, MID(strip, i, 1), " ")
NEXT i

SPLITA text BY " " TO word_list

OPEN "wc.raw" FOR OUTPUT AS #1

FOR x = 0 TO UBOUND(word_list)
  text_out = TRIM(word_list[x])
  IF LEN(text_out) THEN PRINT #1,LCASE(text_out),"\n"
NEXT x

CLOSE #1

END

I'm planning on submitting the following examples.

  • Using flat files and Windows sort.exe
  • Berkeley DB
  • MySQL

* bible-word.zip (1162.22 KB - downloaded 26 times.)
« Last Edit: May 21, 2009, 12:27:48 AM by John Spikowski » Logged
JRS
Moderator
*****
Posts: 689


WWW Email
« Reply #8 on: May 21, 2009, 05:43:12 PM »

Here is the ScriptBasic word count database extended code challenge. This is the Berkeley DB version and a MySQL version will follow soon.

469 seconds

Bible Word List

Code:
start_time = NOW

INCLUDE bdb.bas

OPEN "bible-word.txt" FOR INPUT AS #1
OPEN "wc.lst" FOR OUTPUT AS #2
dbh = bdb::Open("bible.db", bdb::BTree, bdb::Create, 0)

Next_Word:

IF EOF(1) THEN GOTO Gen_List
LINE INPUT #1, this_word
word_total += 1
this_word = CHOMP(this_word)
ON ERROR GOTO Update_Word
bdb::Put(dbh, this_word, 1, bdb::NoOverWrite)
Updated:
GOTO Next_Word

Gen_List:

word_count = bdb::First(dbh, "")
GOTO First

Next_Rec:

word_count = bdb::Next(dbh)
IF word_count = undef THEN GOTO Done
First:
word_key = bdb::Key(dbh)
PRINT #2, word_key & " (" & word_count & ")\n"
GOTO Next_Rec

Done:

PRINTNL #2
PRINT #2, word_total, " words in ", NOW - start_time, " seconds.\n"

CLOSE #1
CLOSE #2
bdb::Close(dbh)

END

Update_Word:

count = bdb::Get(dbh, this_word)
count += 1
bdb::Update(dbh, count)
RESUME Updated

ScriptBasic Berkeley DB Information
« Last Edit: May 21, 2009, 05:59:18 PM by John Spikowski » Logged
JRS
Moderator
*****
Posts: 689


WWW Email
« Reply #9 on: May 23, 2009, 06:27:52 PM »

I gave the MySQL version a try but after an hour or so and only 7000 of the expected 13000 + words completed, I exited the program. This was a great stress test on the MySQL interface but not the type of processing SQL was made for. After reviewing the database, the program did what it was suppose to.

Code:
start_time = NOW

INCLUDE mysql.bas

OPEN "bible-word.txt" FOR INPUT AS #1
OPEN "wcdb.lst" FOR OUTPUT AS #2
dbh = mysql::RealConnect("host","user","password","database")

Next_Word:

IF EOF(1) THEN GOTO Gen_List
LINE INPUT #1, this_word
word_total += 1
this_word = CHOMP(this_word)
s = 1
Next_SQ:
sqp = INSTR(this_word, "'", s)
IF sqp THEN
  this_word = LEFT(this_word, sqp) & "'" & MID(this_word, sqp + 1)
  s = sqp + 2
  GOTO Next_SQ
END IF
ON ERROR GOTO Update_Word
mysql::query(dbh, "INSERT INTO bible VALUES ('" & this_word & "', " & 1 & ")")
Updated:
GOTO Next_Word

Gen_List:

mysql::query(dbh, "SELECT * FROM bible")

WHILE mysql::FetchArray(dbh, row)
  PRINT #2, row[0] & " (" & row[1] & ")\n"
WEND

PRINTNL #2
PRINT #2, word_total, " words in ", NOW - start_time, " seconds.\n"

CLOSE #1
CLOSE #2
mysql::Close(dbh)

END

Update_Word:

mysql::query(dbh, "SELECT word_count FROM bible WHERE word_name = '" & this_word & "'")
mysql::FetchArray(dbh, row)
word_count = row[0] + 1
mysql::query(dbh, "UPDATE bible SET word_count = " & word_count & " WHERE word_name = '" & this_word & "'")
RESUME Updated

ScriptBasic MySQL Information
« Last Edit: May 23, 2009, 07:10:04 PM by John Spikowski » Logged
Pages: [1]
  Print  
 
Jump to:  

Powered by SMF 1.1.11 | SMF © 2006-2009, Simple Machines LLC

All Basic Community