|
JRS
|
 |
« on: May 06, 2009, 10:28:48 PM » |
|
I thought I would post my prototype code that seems to be working with my testing text file. fox.txtThe quick brown fox jumped over the lazy dog's back (0123456789) times. The quick brown fox jumped over the lazy dog's back (0123456789) times. The quick brown fox jumped over the lazy dog's back (0123456789) times. The quick brown fox jumped over the lazy dog's back (0123456789) times. The quick brown fox jumped over the lazy dog's back (0123456789) times. The quick brown fox jumped over the lazy dog's back (0123456789) times. The quick brown fox jumped over the lazy dog's back (0123456789) times. The quick brown fox jumped over the lazy dog's back (0123456789) times. The quick brown fox jumped over the lazy dog's back (0123456789) times. The quick brown fox jumped over the lazy dog's back (0123456789) times.
wc.sbstart_time = NOW
INCLUDE t.bas
text = t::loadString("fox.txt")
strip = "()0123456789." & CHR(10) & CHR(13)
FOR i = 1 TO LEN(strip) text = REPLACE(text, MID(strip, i, 1), " ") NEXT i
SPLITA text BY " " TO word_list
OPEN "wc.raw" FOR OUTPUT AS #1
FOR x = 0 TO UBOUND(word_list) text_out = TRIM(word_list[x]) IF LEN(text_out) THEN PRINT #1,LCASE(text_out),"\n" NEXT x
CLOSE #1
ok = EXECUTE("sort wc.raw /O wc.srt", -1, PID)
OPEN "wc.srt" FOR INPUT AS #2 OPEN "wc.lst" FOR OUTPUT AS #3
last_word = "" word_count = 0 word_total = 0
Next_Word:
IF EOF(2) THEN GOTO Done
LINE INPUT #2, this_word
word_total += 1
this_word = CHOMP(this_word)
IF last_word = "" THEN last_word = this_word
IF this_word = last_word THEN word_count += 1 GOTO Next_Word END IF
PRINT #3, last_word & " (" & word_count & ")\n"
last_word = this_word word_count = 1
GOTO Next_Word
Done:
PRINTNL #3 PRINT #3, word_total - 1, " words in ", NOW - start_time, " seconds.\n"
CLOSE #2 CLOSE #3
END
wc.lstback (10) brown (10) dog's (10) fox (10) jumped (10) lazy (10) over (10) quick (10) the (20) times (10)
110 words in 0 seconds.
Done in a fraction of a second.  Next Post: Testing against the code challenge reference file.
|
|
|
|
|
Logged
|
|
|
|
|
JRS
|
 |
« Reply #1 on: May 07, 2009, 03:28:03 PM » |
|
30091 words in 2 seconds.Word List - (wc.lst) William Shakespeare - The Tragedy of King Lear - (lear.txt) start_time = NOW
INCLUDE t.bas
text = t::LoadString("lear.txt")
strip = "()[]<>/'@0123456789*.,;:!#?%+-=_~\"\\" & CHR(10) & CHR(13)
FOR i = 1 TO LEN(strip) text = REPLACE(text, MID(strip, i, 1), " ") NEXT i
SPLITA text BY " " TO word_list
OPEN "wc.raw" FOR OUTPUT AS #1
FOR x = 0 TO UBOUND(word_list) text_out = TRIM(word_list[x]) IF LEN(text_out) THEN PRINT #1,LCASE(text_out),"\n" NEXT x
CLOSE #1
ok = EXECUTE("sort wc.raw /O wc.srt", -1, PID)
OPEN "wc.srt" FOR INPUT AS #2 OPEN "wc.lst" FOR OUTPUT AS #3
last_word = "" word_count = 0 word_total = 0
Next_Word:
IF EOF(2) THEN GOTO Done
LINE INPUT #2, this_word
this_word = CHOMP(this_word)
IF this_word = "s" THEN GOTO Next_Word
word_total += 1
IF last_word = "" THEN last_word = this_word
IF this_word = last_word THEN word_count += 1 GOTO Next_Word END IF
PRINT #3, last_word & " (" & word_count & ")\n"
last_word = this_word word_count = 1
GOTO Next_Word
Done:
PRINTNL #3 PRINT #3, word_total - 1, " words in ", NOW - start_time, " seconds.\n"
CLOSE #2 CLOSE #3
END
Note: The only changes to the prototype were additional characters to the strip string, skipping orphaned s's and changing the name of the file to load. FYI: Google has about 18,990,000,000 indexes to the word the.
|
|
|
|
« Last Edit: May 08, 2009, 02:03:03 AM by John Spikowski »
|
Logged
|
|
|
|
|
JRS
|
 |
« Reply #2 on: May 08, 2009, 02:38:44 PM » |
|
I decided to remove the single quote character and hyphen from the strip string to preserve the words as they were intended. It adds a few oddities to the front of the list but worth keeping the other words intact. The program now requires you to enter the text file name you wish to word count on the command line. I attached a standalone version (wc.exe) you can use on any text file you wish. This is my submission for the word count code challenge. Usage: wc filename.txt The Unrealized Potential of DNA Testing. - Word ListWilliam Shakespeare - The Tragedy of King Lear - Word Listfname = COMMAND()
IF TRIM(fname) = "" THEN PRINT """ ScriptBasic Word Count Program Usage: wc.exe filename """ END END IF
start_time = NOW
fsize = FILELEN(fname) OPEN fname FOR INPUT AS #1 text = INPUT(fsize, #1) CLOSE #1
strip = "()[]<>/@0123456789*.,;:!#?%$+=_~\"\\" & CHR(10) & CHR(13)
FOR i = 1 TO LEN(strip) text = REPLACE(text, MID(strip, i, 1), " ") NEXT i
SPLITA text BY " " TO word_list
OPEN "wc.raw" FOR OUTPUT AS #1
FOR x = 0 TO UBOUND(word_list) text_out = TRIM(word_list[x]) IF LEN(text_out) THEN PRINT #1,LCASE(text_out),"\n" NEXT x
CLOSE #1
ok = EXECUTE("sort wc.raw /O wc.srt", -1, PID)
OPEN "wc.srt" FOR INPUT AS #2 OPEN "wc.lst" FOR OUTPUT AS #3
last_word = "" word_count = 0 word_total = 0
Next_Word:
IF EOF(2) THEN GOTO Done
LINE INPUT #2, this_word
this_word = CHOMP(this_word)
word_total += 1
IF last_word = "" THEN last_word = this_word
IF this_word = last_word THEN word_count += 1 GOTO Next_Word END IF
PRINT #3, last_word & " (" & word_count & ")\n"
last_word = this_word word_count = 1
GOTO Next_Word
Done:
PRINTNL #3 PRINT #3, word_total - 1, " words in ", NOW - start_time, " seconds.\n"
CLOSE #2 CLOSE #3
DELETE "wc.raw" DELETE "wc.srt"
END
Note: A newer version of wc.exe program is in a later post and adds a rank list feature.
|
wc.zip (177.8 KB - downloaded 28 times.)
|
|
« Last Edit: May 10, 2009, 09:47:58 PM by John Spikowski »
|
Logged
|
|
|
|
|
JRS
|
 |
« Reply #3 on: May 09, 2009, 01:10:32 AM » |
|
The Bible in text format. (4.19 MB, 804895 words in 33 seconds.) Bible Word File FYI: ScriptBasic created a 1,067,594 element array of words/non-delimiter spaces with the SPLITA function for the Bible translation. I'm curious how the other Basic languages are going to approach this code challenge. I feel lucky to have SPLITA and REPLACE as built in C based functions in ScriptBasic. Issues: - Word array size variable based on text document (if arrays are used for word storage)
- Extraction of words
- Sorting / counting words
|
|
|
|
« Last Edit: May 13, 2009, 11:37:26 PM by John Spikowski »
|
Logged
|
|
|
|
|
JRS
|
 |
« Reply #4 on: May 10, 2009, 12:17:12 AM » |
|
I added a couple more lines of code to provide a word rank list as well as the alphabetical word list. I was going to use the sort reverse order option to put the most used first but it also reversed the alphabetical order of words with the same frequency count. Bible word list by rankfname = COMMAND()
IF TRIM(fname) = "" THEN PRINT """ ScriptBasic Word Count Program Usage: wc.exe filename """ END END IF
start_time = NOW
fsize = FILELEN(fname) OPEN fname FOR INPUT AS #1 text = INPUT(fsize, #1) CLOSE #1
strip = "()[]{}|<>/@0123456789*.,;:!#?%$&+=_~\"\\" & CHR(9) & CHR(10) & CHR(13)
FOR i = 1 TO LEN(strip) text = REPLACE(text, MID(strip, i, 1), " ") NEXT i
SPLITA text BY " " TO word_list
OPEN "wc.raw" FOR OUTPUT AS #1
FOR x = 0 TO UBOUND(word_list) text_out = TRIM(word_list[x]) IF LEN(text_out) THEN PRINT #1,LCASE(text_out),"\n" NEXT x
CLOSE #1
ok = EXECUTE("sort wc.raw /O wc.srt", -1, PID)
OPEN "wc.srt" FOR INPUT AS #2 OPEN "wc.lst" FOR OUTPUT AS #3 OPEN "wc.tmp" FOR OUTPUT AS #4
last_word = "" word_count = 0 word_total = 0
Next_Word:
IF EOF(2) THEN GOTO Done
LINE INPUT #2, this_word
this_word = CHOMP(this_word)
word_total += 1
IF last_word = "" THEN last_word = this_word
IF this_word = last_word THEN word_count += 1 GOTO Next_Word END IF
PRINT #3, last_word & " (" & word_count & ")\n" PRINT #4, FORMAT("%6d", word_count), " ", last_word & "\n"
last_word = this_word word_count = 1
GOTO Next_Word
Done:
PRINTNL #3 PRINT #3, word_total - 1, " words in ", NOW - start_time, " seconds.\n"
CLOSE #2 CLOSE #3 CLOSE #4
' Do Rank Sort
ok = EXECUTE("sort wc.tmp /O wc.rnk", -1, PID)
DELETE "wc.raw" DELETE "wc.srt" DELETE "wc.tmp"
END
The attached wc.zip file contains the above source and a standalone wc.exe program. The below additions were added since the last zip offering. - Additional strip characters added. (seems to work on html files)
- A word rank (wc.rnk) file is created along with the original alphabetical word list. (wc.lst)
804895 words in 55 seconds - (was 33 seconds) - Extra replace loops and parallel writes for the alphabetical and rank lists.
|
wc.zip (178.79 KB - downloaded 30 times.)
|
|
« Last Edit: May 10, 2009, 10:31:37 PM by John Spikowski »
|
Logged
|
|
|
|
|
JRS
|
 |
« Reply #5 on: May 17, 2009, 03:45:10 AM » |
|
I'm not done yet.  I think I can shave some time off by using the ScriptBasic HASH extension module. This will allow me to go directly from the array created by SPLITA to a hash table with a count value pair. I currently have an issue with the hash extension module not returning values. (worked in the past  ) I'll post new code, word table and results here soon. I sure hope the ProvideX team hasn't thrown in the towel. 
|
|
|
|
|
Logged
|
|
|
|
|
JRS
|
 |
« Reply #6 on: May 17, 2009, 01:21:48 PM » |
|
I may have misunderstood the use of the HASH extension module. It looks like the iteration is in entry order and not alphabetical key order. The hash structure implemented in this module maintains a linked list that allows the programmer to iterate over all elements of the hash in the order they were entered into the hash in both directions. To maintain iteration state there is an iteration pointer for each hash. There can only be a single iteration over a hash at a time. The iteration pointer can be set to the start, to the end of the list and can be moved one element forward and backward. The key and value pairs can also be retrieved pointed by the iteration pointer. The ordering of the elements in the hash are guaranteed to follow the time order the pairs were entered into the hash. The first element returned by the iteration pointer is the pair entered first in to the hash.
Looks like I'm going to have to live with 33 seconds unless I have a brain fart. ObservationsIt takes 5 seconds to load the file and run a minimal replacement pass. It takes 10 seconds using the current strip string. It takes another 3 seconds to get to the point the work array is built using the SPLITA function. This means I'm using 20 seconds to do the following ... - Build the wc.raw file of words only (blank array elements stripped)
- Shell out to sort this list to create wc.srt
- Read through the wc.srt file counting duplicates and writing a entry to the alphabetical and rank files when the word changes
- Shell out to sort the rank file.
- Clean up the mess and end.
|
|
|
|
« Last Edit: May 17, 2009, 09:51:39 PM by John Spikowski »
|
Logged
|
|
|
|
|
JRS
|
 |
« Reply #7 on: May 20, 2009, 09:35:55 PM » |
|
I thought I would give Berkeley DB a try and test the database interface why I was at it. It didn't help with a quicker time but the code might be of interest to some of the members here looking for a keyed file system. 804895 words in 492 seconds. Bible Word Count Filestart_time = NOW
INCLUDE t.bas INCLUDE bdb.bas
text = t::loadString("bible.txt")
strip = "()[]{}|<>/@0123456789*.,;:!#?%$&+=_~\"\\" & CHR(9) & CHR(10) & CHR(13)
FOR i = 1 TO LEN(strip) text = REPLACE(text, MID(strip, i, 1), " ") NEXT i
SPLITA text BY " " TO word_list
dbh = bdb::Open("bible.db", bdb::BTree, bdb::Create, 0)
word_total = 0
FOR x = 0 TO UBOUND(word_list) this_word = LCASE(TRIM(word_list[x])) IF LEN(this_word) THEN word_total += 1 ON ERROR GOTO Update_Word bdb::Put(dbh, this_word, 1, bdb::NoOverWrite) END IF Updated: NEXT x
OPEN "wc.lst" FOR OUTPUT AS #1
word_count = bdb::First(dbh, "") GOTO First
Next_Word:
word_count = bdb::Next(dbh) IF word_count = undef THEN GOTO Done First: word_key = bdb::Key(dbh) PRINT #1, word_key & " (" & word_count & ")\n" GOTO Next_Word
Done:
PRINTNL #1 PRINT #1, word_total, " words in ", NOW - start_time, " seconds.\n"
CLOSE #1 bdb::Close(dbh)
END
Update_Word:
count = bdb::Get(dbh, this_word) count += 1 bdb::Update(dbh, count) RESUME Updated
I thinking of making a side challenge out of this so we can see how different database options compare. I'm going to use the first part of my original program to create a text file of all the words in the Bible. (not in order) Each participating Basic language can use this word list to build a database of words (keys) and their occurrence count as the data. Code I used to create the Bible word file. (see attached) INCLUDE t.bas
text = t::loadString("bible.txt")
strip = "()[]{}|<>/@0123456789*.,;:!#?%$&+=_~\"\\" & CHR(9) & CHR(10) & CHR(13)
FOR i = 1 TO LEN(strip) text = REPLACE(text, MID(strip, i, 1), " ") NEXT i
SPLITA text BY " " TO word_list
OPEN "wc.raw" FOR OUTPUT AS #1
FOR x = 0 TO UBOUND(word_list) text_out = TRIM(word_list[x]) IF LEN(text_out) THEN PRINT #1,LCASE(text_out),"\n" NEXT x
CLOSE #1
END
I'm planning on submitting the following examples. - Using flat files and Windows sort.exe
- Berkeley DB
- MySQL
|
|
|
« Last Edit: May 21, 2009, 12:27:48 AM by John Spikowski »
|
Logged
|
|
|
|
|
JRS
|
 |
« Reply #8 on: May 21, 2009, 05:43:12 PM » |
|
Here is the ScriptBasic word count database extended code challenge. This is the Berkeley DB version and a MySQL version will follow soon. 469 secondsBible Word Liststart_time = NOW
INCLUDE bdb.bas
OPEN "bible-word.txt" FOR INPUT AS #1 OPEN "wc.lst" FOR OUTPUT AS #2 dbh = bdb::Open("bible.db", bdb::BTree, bdb::Create, 0)
Next_Word:
IF EOF(1) THEN GOTO Gen_List LINE INPUT #1, this_word word_total += 1 this_word = CHOMP(this_word) ON ERROR GOTO Update_Word bdb::Put(dbh, this_word, 1, bdb::NoOverWrite) Updated: GOTO Next_Word
Gen_List:
word_count = bdb::First(dbh, "") GOTO First
Next_Rec:
word_count = bdb::Next(dbh) IF word_count = undef THEN GOTO Done First: word_key = bdb::Key(dbh) PRINT #2, word_key & " (" & word_count & ")\n" GOTO Next_Rec
Done:
PRINTNL #2 PRINT #2, word_total, " words in ", NOW - start_time, " seconds.\n"
CLOSE #1 CLOSE #2 bdb::Close(dbh)
END
Update_Word:
count = bdb::Get(dbh, this_word) count += 1 bdb::Update(dbh, count) RESUME Updated
ScriptBasic Berkeley DB Information
|
|
|
|
« Last Edit: May 21, 2009, 05:59:18 PM by John Spikowski »
|
Logged
|
|
|
|
|
JRS
|
 |
« Reply #9 on: May 23, 2009, 06:27:52 PM » |
|
I gave the MySQL version a try but after an hour or so and only 7000 of the expected 13000 + words completed, I exited the program. This was a great stress test on the MySQL interface but not the type of processing SQL was made for. After reviewing the database, the program did what it was suppose to. start_time = NOW
INCLUDE mysql.bas
OPEN "bible-word.txt" FOR INPUT AS #1 OPEN "wcdb.lst" FOR OUTPUT AS #2 dbh = mysql::RealConnect("host","user","password","database")
Next_Word:
IF EOF(1) THEN GOTO Gen_List LINE INPUT #1, this_word word_total += 1 this_word = CHOMP(this_word) s = 1 Next_SQ: sqp = INSTR(this_word, "'", s) IF sqp THEN this_word = LEFT(this_word, sqp) & "'" & MID(this_word, sqp + 1) s = sqp + 2 GOTO Next_SQ END IF ON ERROR GOTO Update_Word mysql::query(dbh, "INSERT INTO bible VALUES ('" & this_word & "', " & 1 & ")") Updated: GOTO Next_Word
Gen_List:
mysql::query(dbh, "SELECT * FROM bible")
WHILE mysql::FetchArray(dbh, row) PRINT #2, row[0] & " (" & row[1] & ")\n" WEND
PRINTNL #2 PRINT #2, word_total, " words in ", NOW - start_time, " seconds.\n"
CLOSE #1 CLOSE #2 mysql::Close(dbh)
END
Update_Word:
mysql::query(dbh, "SELECT word_count FROM bible WHERE word_name = '" & this_word & "'") mysql::FetchArray(dbh, row) word_count = row[0] + 1 mysql::query(dbh, "UPDATE bible SET word_count = " & word_count & " WHERE word_name = '" & this_word & "'") RESUME Updated
ScriptBasic MySQL Information
|
|
|
|
« Last Edit: May 23, 2009, 07:10:04 PM by John Spikowski »
|
Logged
|
|
|
|
|