Text Filter Efficiency

BYOND Forums

Announcements · BYOND Help · Bug Reports · Feature Requests · Beta Testers · Beta Bugs · Developer Help · Design Philosophy · Demos & Libraries · Tutorials & Snippets · Art & Sound · Classified Ads · Game Updates · Contests & Events · Linux Talk · On Topic · Off Topic

ID:151824

Jan 24 2009, 1:11 pm

Stupot

I am creating a Text Filter (which could be used for good, i.e. blocking out curse words, but I am using it for evil).

Regardless, I would like efficient code. Really, my question is how efficient is findtext() ?

Right now, I have a list of words to be found in message and for each word, I search the entire message using findtext(). This means, if I'm looking for six words, findtext() is called six times for every message, which doesn't strike me as very efficient (depending on how well findtext() works).

My other idea is to have the list of words I'm looking for to be sorted alphabetically. Then, when a message is received, it is broken down into individual words and then each word is compared to see if it is in the list. Since the list is sorted, this comparison can be done in O(log(n)). Binary Search Algorithm FTW?

Clearly the second method will require more coding on my part, but findtext() will be removed completely. The question is, is it worth it?

Jan 24 2009, 1:22 pm
Stephen001	I quite like the second approach, it's novel. If it helps at all, DM lists are backed by binary search trees. I assume findtext() performs reasonably similarly to C's strstr() (which seems commonly to be O(n²), although glibc manages O(n) under most memory allocation scenarios).

Jan 24 2009, 3:44 pm

Theodis

I did this awhile ago http://www.byond.com/developer/Theodis/LanguageFilter2

Basically it builds a tree structure from a list of words with each letter pointing to the next until it reaches a null in which the word is finished. So for say

ball
bat

The base node would be b, followed by a, a would then have two sub nodes l and t, t would then have a null child to indicate that it's the end of a set.

So when filtering it just follows down the tree as it moves through the text and if it hits a null then a word is matched. If it runs into no matching base nodes then it isn't a match.

If you have a giant list to go through this method is awesome since it requires only one pass regardless of the size of the list and passing through the list is very efficient due to the tree structure of the list.

This was however built when there was the 65k limit making this system impractical because there is a node per letter which adds up very fast. However with the new 16 mil limit it should be a very practical library. It's been over 4 years since I touched that code though so I don't know what state it's in or how much touch up work it needs.

Jan 24 2009, 4:51 pm In response to Theodis
Stupot	Ahh, I actually looked at your first Language Filter to get started on the one I made. I didn't realize you had made a second. I don't think I'm going to make mine quite so complicated though. I'll just use the array index to eliminate half of the words at a time. I don't see the need for a whole tree structure.

Jan 24 2009, 4:53 pm

In response to Stupot

Theodis

Ahh, I actually looked at your first Language Filter to get started on the one I made. I didn't realize you had made a second.

Yeah from the look of it since I still have old code commented out and verbs in there for profiling I assume I just never finished it and left it hidden due to the 65k limit. But it should actually make a decent library now. I might polish it up again since it's much more practical now.

Jan 24 2009, 8:17 pm (Edited on Jan 24 2009, 8:36 pm)

In response to Theodis

Stupot

Ok, so I've implemented it. I've only been coding in DM for about 3 days, so it's not the prettiest I'm sure, and there's probably a better place to share this, but this is what I did (in a nutshell).

proc
    SearchThroughMessage(msg as text)
    {
        var/str
        var/index
        index = findtext(msg," ")
        var/msg2 = msg
        while(index != 0)
            str = copytext(msg2,1,index)
            msg2 = copytext(msg2,index)
            index = findtext(msg2," ",2)
            SearchForWord(str)
        str = msg2
        SearchForWord(msg2)
    }
    SearchForWord(msg as text)
    {
        var/start = 1
        var/end = CurseWords.len
        var/curr = CurseWords.len/2
        while((curr != start) && (curr != end)) 
            if(findtext(msg,CurseWords[curr]))
                DisplayMessage(CurseWords[curr])
                return
            if(lowertext(CurseWords[curr]) > lowertext(msg))
                end = curr-1             
            else
                start = curr+1;
            curr = round((start+end)/2)
            if(end-start < 2)
                if(findtext(msg,CurseWords[start]))
                    DisplayMessage(CurseWords[start])
                    return
                if(findtext(msg,CurseWords[end]))
                    DisplayMessage(CurseWords[end])
                return
        if(findtext(msg,CurseWords[curr]))
            DisplayMessage(CurseWords[curr])
    }

When a usr calls upon their say verb, I have it call SearchThroughMessage(msg)

Note: CurseWords has been sorted alphabetically, ignoring upper/lower case.

Also, I'm using DisplayMessage() do display a message when a CurseWord appears, but you can see how one might easily change what to do when a filtered word is found (although it might take a bit of changing). Regardless, I think this method is efficient and I feel it's easier than setting up a tree.

Edit: Corrected mistake before anyone saw. >.>

Jan 26 2009, 11:32 am

In response to Stupot

Stupot

After some though, I did realize that this concept does have a flaw (or feature?)

Say you are trying to detect the world "apple". If you passed the phrase, "I'd like an apple." it would be detected.

If, however, you passed, "I'd like a snapple." Apple would not get detected. "I'd like an appleufagus." would be detected, as apple is the start of the word. Furthermore, "I'd like an aapple" does stand a chance of being detected. "-apple" could also cause problems.

If you were going to use this as a language filter, it is not going to catch everything.

That being said, there is always a way around a language filter, so this could still be a viable solution on those merits (assuming a language filter that can be easily gotten around is your goal).

I am simply using it to detect choice words, not as a language filter, so it fulfills my purposes just fine.