Simple Command Line

Jun 28 2015, 10:38 pm In response to Kats
Zecronious	Can you feel how much faster it is?

Jun 28 2015, 10:40 pm
Kats	My processor appreciates the 2 millisecond breath of fresh air.

Jun 28 2015, 10:42 pm In response to Kats
Kozuma3	Kats wrote: My processor appreciates the 2 millisecond breath of fresh air. You're welcome [:

Jun 28 2015, 10:59 pm (Edited on Jun 28 2015, 11:37 pm)

Ter13

client
    verb
        test1()
            for(var/count in 1 to 100000)
                tokenize1("this is a string and holy shit this is a thing that does stuff")
        test2()
            for(var/count in 1 to 100000)
                tokenize2("this is a string and holy shit this is a thing that does stuff"," ")

proc
    //kats
    tokenize1(t as text)
        var/list/token = new
        var/i=1
        for(var/pos=1;pos<=length(t);pos++)
            if(pos==length(t)) //Caps off the last token and return the function
                token.len++
                token[i] = t
                break
            switch(text2ascii(t,pos))
                if(32) //Whitespace
                    token.len++
                    token[i] = lowertext(copytext(t,1,pos))
                    t = copytext(t,++pos)
                    i++
                    pos=0
                    continue
                if(34) //Quotation mark
                    var/end = findtext(t,"\"",pos+1)
                    if(!end)
                        //error("MISMATCHED PUNCTUATION")
                        return null
                    token.len++
                    token[i] = copytext(t,2,end)
                    t = copytext(t,end+2)
                    i++
                    pos=0
                    continue
        return token

    //ter
    tokenize2(str,separator)
        var/slen = length(str)
        var/flen = length(separator)
        var/pos = 1
        var/fpos = 1
        . = list()
        while(fpos&&pos<=slen)
            fpos = findtext(str,separator,pos)
            if(fpos==pos)
                . += ""
            else
                . += copytext(str,pos,fpos)
            pos = fpos + flen
        if(pos<=slen)
            . += copytext(str,pos)

If you do the math, the tokenizer packaged in stdlib is 340% faster than this one. Not exactly a micro-optimization. =D

Kats wrote:

My processor appreciates the 2 millisecond breath of fresh air.

(Well, that missing fpos assignment sure changes things, now dunnit?)

Jun 28 2015, 11:01 pm In response to Ter13
Zecronious	Ter13, we're not tokenizing 100,000 things.

Jun 28 2015, 11:09 pm
Doohl	You're not, but at least now he's shown you why and when to use `text2ascii` - maybe in the future you'll remember this in a completely unrelated scenario and be able to solve a real problem quickly and efficiently.

Jun 28 2015, 11:14 pm

Ter13

Ter13, we're not tokenizing 100,000 things.

Mike informed me that we're apparently on speaking terms again. There was much rejoicing.

That's actually not really the point. If you look back at what I was saying, I was really just talking about "reinventing the square wheel".

https://en.wikipedia.org/wiki/ Reinventing_the_wheel#Related_phrases

https://en.wikipedia.org/wiki/Anti-pattern

I was really just advertising stdlib's new string functions and demonstrating that they make designing software that works with strings a lot easier and faster.

Sure, maybe it's not necessary, but the way I look at it, it's more important to know why something functions better because that informs you of better patterns in the future rather than relying on flawed methodologies that can and will ultimately, if unaddressed become problematic.

I mean, after all, what are we here for, if not to expand our understanding of DM?

Jun 28 2015, 11:33 pm
FKI	I believe you missed a `=1` for `fpos`, Ter13. Happened to catch that playing around with the code myself.

Jun 28 2015, 11:35 pm In response to Ter13
Zecronious	Who is mike? And yeah I unblocked you because of Tom's new rules on pager bans. The situation of both talking in the same thread and no being able to see each other except for quotations is too annoying.

Jun 28 2015, 11:38 pm
Ter13	^Good catch. That definitely changes results to be a little more in line with what I initially expected.

Jun 28 2015, 11:39 pm (Edited on Jun 28 2015, 11:52 pm)

In response to Ter13

Kats

>   tokenize2(str,separator)
>       var/slen = length(str)
>       var/flen = length(separator)
>       var/pos = 1
>       var/fpos
>       . = list()
>       while(fpos&&pos<=slen)
>           fpos = findtext(str,separator,pos)
>           if(fpos==pos)
>               . += ""
>           else
>               . += copytext(str,pos,fpos)
>           pos = fpos + flen
>       if(pos<=slen)
>           . += copytext(str,pos)
>

I am actually really confused about why the line while(fpos&&pos<=slen) even works. The way I understand it, operators are read from left to right, so fpos should be evaluating as false, because it's uninitialized.

It actually took me a minute to read because another thing still confuse me, such as why you add an empty string to the return value when fpos equals pos instead of just skipping it. Not sure what you'd use that empty string for when you actually get around the parsing the data. Unless that was just a personal choice, after all.

EDIT: Just saw FKI's post. So it was an error, then.

Also, this still doesn't solve the issue of delimiting quotation marks and preventing strings from being broken up by whitespace. That's the only reason I was iterating in the first place, rather than solely using findtext() like a sane person.

However, taking a look at your tokenizer again, I was able to make some small adjustments to mine, mainly removing the i var iteration that I was using.

Jun 28 2015, 11:47 pm
Kozuma3	Yours looks similar to mine ter, thief!

Jun 28 2015, 11:53 pm

Ter13

It actually took me a minute to read because another thing still confuse me, such as why you add an empty string to the return value when fpos equals pos instead of just skipping it. Not sure what you'd use that empty string for when you actually get around the parsing the data. Unless that was just a personal choice, after all.

EDIT: Just saw FKI's post. So it was an error, then.

Yeah, and test results have been adjusted. Unfortunately, the stdlib hasn't had much feedback, so there are probably more than a few small bugs floating around in it. It happens, but it's already been fixed.

And I'm definitely open to the idea of getting rid of the empty tokens, but the reason I added them is actually for sake of consistency. Let's say we want to hunt for mismatched quotation marks.

var/list/tokens = tokenize("this is \"a\" test of mismatched quotes\"","\"")
if(!(tokens.len%2))
    //error! mismatched quotation marks
else
    //quotation marks are matched properly

You see, if I took out the empty tokens, I'd potentially wind up with false results in the above example.

"\"\"herp\""

^That'd screw things up pretty good

Of course, I suppose that you could wind up using findtext_all() to count the number of delimiters before actually tokenizing it.

What do you think? Ditch the empty tokens?

Jun 29 2015, 12:09 am

In response to Ter13

Kats

Empty tokens you can probably get rid of, as for findtext_all(), yeah, that'd actually be pretty useful. What does it do? Just return a list of all of the positions it was found?

Also, the mismatched quotes isn't really a problem as far as my uses are concerned. It wouldn't take much more thought to add escape characters to it, but since my example runs as a command line, escape characters don't really have a use.

Jun 29 2015, 12:20 am

Ter13

http://www.byond.com/ forum/?post=1872243&page=2#comment15699713

I haven't been able to sit down and document everything that stdlib provides, but that's the thread if you want to take a look.

As for the string functions:

str_ends_with() checks if the supplied string ends with the supplied substring

str_begins_with() checks if the supplied string begins with the supplied substring

str_copy_after() returns everything found in a string after the supplied substring.

str_copy_before() returns everything found in a string before the supplied substring.

str_replace() replaces every instance of findstr and replaces it with repstr in the supplied string, returning the finalized string.

tokenize() breaks the string up according to supplied separator/delimiter.

trim_whitespace() removes any whitespace from the beginning or end of a supplied string. These include characters 9, 10, 11, 12, 13, 32, 133, 160

findtext_all() searches str for any instance of findstr and returns a list of positions where they are found.

screen_loc2num() transforms a screen_loc value to a numeric coordinate. returns it in format: list(x,y)

It should be noted that almost all string functions have "Ex" variants, which denote that they are case-sensitive variants of the normal functions.