In response to Kats
Can you feel how much faster it is?

My processor appreciates the 2 millisecond breath of fresh air.
In response to Kats
Kats wrote:
My processor appreciates the 2 millisecond breath of fresh air.

You're welcome [:


client
verb
test1()
for(var/count in 1 to 100000)
tokenize1("this is a string and holy shit this is a thing that does stuff")
test2()
for(var/count in 1 to 100000)
tokenize2("this is a string and holy shit this is a thing that does stuff"," ")

proc
//kats
tokenize1(t as text)
var/list/token = new
var/i=1
for(var/pos=1;pos<=length(t);pos++)
if(pos==length(t)) //Caps off the last token and return the function
token.len++
token[i] = t
break
switch(text2ascii(t,pos))
if(32) //Whitespace
token.len++
token[i] = lowertext(copytext(t,1,pos))
t = copytext(t,++pos)
i++
pos=0
continue
if(34) //Quotation mark
var/end = findtext(t,"\"",pos+1)
if(!end)
//error("MISMATCHED PUNCTUATION")
return null
token.len++
token[i] = copytext(t,2,end)
t = copytext(t,end+2)
i++
pos=0
continue
return token

//ter
tokenize2(str,separator)
var/slen = length(str)
var/flen = length(separator)
var/pos = 1
var/fpos = 1
. = list()
while(fpos&&pos<=slen)
fpos = findtext(str,separator,pos)
if(fpos==pos)
. += ""
else
. += copytext(str,pos,fpos)
pos = fpos + flen
if(pos<=slen)
. += copytext(str,pos)


If you do the math, the tokenizer packaged in stdlib is 340% faster than this one. Not exactly a micro-optimization. =D

Kats wrote:
My processor appreciates the 2 millisecond breath of fresh air.

(Well, that missing fpos assignment sure changes things, now dunnit?)

In response to Ter13
Ter13, we're not tokenizing 100,000 things.
You're not, but at least now he's shown you why and when to use text2ascii - maybe in the future you'll remember this in a completely unrelated scenario and be able to solve a real problem quickly and efficiently.
Ter13, we're not tokenizing 100,000 things.

Mike informed me that we're apparently on speaking terms again. There was much rejoicing.

That's actually not really the point. If you look back at what I was saying, I was really just talking about "reinventing the square wheel".

https://en.wikipedia.org/wiki/ Reinventing_the_wheel#Related_phrases

https://en.wikipedia.org/wiki/Anti-pattern

I was really just advertising stdlib's new string functions and demonstrating that they make designing software that works with strings a lot easier and faster.

Sure, maybe it's not necessary, but the way I look at it, it's more important to know why something functions better because that informs you of better patterns in the future rather than relying on flawed methodologies that can and will ultimately, if unaddressed become problematic.

I mean, after all, what are we here for, if not to expand our understanding of DM?
I believe you missed a =1 for fpos, Ter13. Happened to catch that playing around with the code myself.
In response to Ter13
Who is mike? And yeah I unblocked you because of Tom's new rules on pager bans. The situation of both talking in the same thread and no being able to see each other except for quotations is too annoying.
^Good catch. That definitely changes results to be a little more in line with what I initially expected.
In response to Ter13
>   tokenize2(str,separator)
> var/slen = length(str)
> var/flen = length(separator)
> var/pos = 1
> var/fpos
> . = list()
> while(fpos&&pos<=slen)
> fpos = findtext(str,separator,pos)
> if(fpos==pos)
> . += ""
> else
> . += copytext(str,pos,fpos)
> pos = fpos + flen
> if(pos<=slen)
> . += copytext(str,pos)
>


I am actually really confused about why the line while(fpos&&pos<=slen) even works. The way I understand it, operators are read from left to right, so fpos should be evaluating as false, because it's uninitialized.

It actually took me a minute to read because another thing still confuse me, such as why you add an empty string to the return value when fpos equals pos instead of just skipping it. Not sure what you'd use that empty string for when you actually get around the parsing the data. Unless that was just a personal choice, after all.

EDIT: Just saw FKI's post. So it was an error, then.

Also, this still doesn't solve the issue of delimiting quotation marks and preventing strings from being broken up by whitespace. That's the only reason I was iterating in the first place, rather than solely using findtext() like a sane person.

However, taking a look at your tokenizer again, I was able to make some small adjustments to mine, mainly removing the i var iteration that I was using.
Yours looks similar to mine ter, thief!
It actually took me a minute to read because another thing still confuse me, such as why you add an empty string to the return value when fpos equals pos instead of just skipping it. Not sure what you'd use that empty string for when you actually get around the parsing the data. Unless that was just a personal choice, after all.

EDIT: Just saw FKI's post. So it was an error, then.

Yeah, and test results have been adjusted. Unfortunately, the stdlib hasn't had much feedback, so there are probably more than a few small bugs floating around in it. It happens, but it's already been fixed.

And I'm definitely open to the idea of getting rid of the empty tokens, but the reason I added them is actually for sake of consistency. Let's say we want to hunt for mismatched quotation marks.

var/list/tokens = tokenize("this is \"a\" test of mismatched quotes\"","\"")
if(!(tokens.len%2))
//error! mismatched quotation marks
else
//quotation marks are matched properly


You see, if I took out the empty tokens, I'd potentially wind up with false results in the above example.

"\"\"herp\""


^That'd screw things up pretty good

Of course, I suppose that you could wind up using findtext_all() to count the number of delimiters before actually tokenizing it.

What do you think? Ditch the empty tokens?
In response to Ter13
Empty tokens you can probably get rid of, as for findtext_all(), yeah, that'd actually be pretty useful. What does it do? Just return a list of all of the positions it was found?

Also, the mismatched quotes isn't really a problem as far as my uses are concerned. It wouldn't take much more thought to add escape characters to it, but since my example runs as a command line, escape characters don't really have a use.
http://www.byond.com/ forum/?post=1872243&page=2#comment15699713

I haven't been able to sit down and document everything that stdlib provides, but that's the thread if you want to take a look.

As for the string functions:

str_ends_with() checks if the supplied string ends with the supplied substring

str_begins_with() checks if the supplied string begins with the supplied substring

str_copy_after() returns everything found in a string after the supplied substring.

str_copy_before() returns everything found in a string before the supplied substring.

str_replace() replaces every instance of findstr and replaces it with repstr in the supplied string, returning the finalized string.

tokenize() breaks the string up according to supplied separator/delimiter.

trim_whitespace() removes any whitespace from the beginning or end of a supplied string. These include characters 9, 10, 11, 12, 13, 32, 133, 160

findtext_all() searches str for any instance of findstr and returns a list of positions where they are found.

screen_loc2num() transforms a screen_loc value to a numeric coordinate. returns it in format: list(x,y)

It should be noted that almost all string functions have "Ex" variants, which denote that they are case-sensitive variants of the normal functions.
Page: 1 2