HTML tag balancer

BYOND Forums

Announcements · BYOND Help · Bug Reports · Feature Requests · Beta Testers · Beta Bugs · Developer Help · Design Philosophy · Demos & Libraries · Tutorials & Snippets · Art & Sound · Classified Ads · Game Updates · Contests & Events · Linux Talk · On Topic · Off Topic

HTML tag balancer

ID:155009

Oct 11 2011, 9:14 am

LordAndrew

Hey there. I posted a few months ago about this but figured it'd be best to make a new thread.

I'm trying to write a script that'll look through an HTML log and find tags that aren't closed, then close then. I've never really used BYOND's text manipulation procs so I honestly don't even know what I'm doing.

I'm not concerned with actually putting the tags in order, I just want to make sure for every one b tag per line, there's a closing b tag as well etc.

Oct 11 2011, 9:55 am

DarkProtest

Well, off the top of my head, I would imagine the process would look like this:

-Create a list to use as a "stack", which allows us to track which tags are active, and in which order they were opened.

-Create several variables to track positions and boundaries within the line you're working on.

-Get the text and grab your first line. If you're storing all of the lines as one string, you'll have to do findtext("\n") to identify the boundaries, and store the end of the current line.

-Use findtext() to find the first "<" within the line, and then check if the next character is a "/" or not (I like to use text2ascii() for this kind of thing, to avoid a string copy). If there is no slash, then it's an opening tag. Otherwise it's a closing tag. Store which it is.

-Either way, you need to get the tag name. Write a process (something like parseGetTag(string,start,end)) that returns the tag name. Basically, it should search from start (which should be the opening bracket "<") to end (end of line limit) to find the next ">". Search between the found "<" and ">" for any white-space characters (space, tab). The tag name ("b", "font", "strong") will run from the character after the opening bracket ("<") to the first white-space, or to the closing bracket (">") if no white-space was found.

-For an opening tag, add it to the end of the list we're using as a "stack"

-For a closing tag, check if the tag name is in the list. If it is, loop from the END of the list towards the front until you find it. As you do so, add closing tags for all of those tags that you hit BEFORE it, and insert them before the existing closing tag you found (removing the tags from the list as you add them to the line). If the existing tag isn't in the list, you can ignore it, encode it, or cut it out.

-When you reach the end of the line, add closing tags for any tags still in the list (from the END to the front, removing them as you go). Then move on to the next line.

That's the gist of it. Once you get the algorithm down it's not too bad, but the real trick is how you handle poor formatting and special characters. I'm not sure if you can count on non-html to be encoded (such as the emoticon >_>)

Oct 11 2011, 10:31 am

In response to DarkProtest

LordAndrew

Ah, hmm.

The (highly unsuccessful) method I was using before getting stuck was using this proc written by Garthor to break up the lines:

proc
    tokenize(string, delimiter)
        . = list()

        var
            len = length(delimiter)

            lastpos = 1

            pos = findtext(string, delimiter)

        while(pos != 0)
            . += copytext(string, lastpos, pos)

            lastpos = pos + len

            pos = findtext(string, delimiter, lastpos)

        . += copytext(string, lastpos)

using the <br> tag as the separator.

mob
    verb
        Strip()
            var
                log = file2text('thelog.htm')

            src << browse(log, "window=log_raw")

            var
                list/L = tokenize(log, "<br>")

            for(var/line in L)
                src << html_encode(line)

                var
                    openMarker = findtext(line, "<b>")

                    bTags = 0

                while(openMarker != 0)
                    bTags++
                    openMarker = findtext(line, "<b>", openMarker + 1)

                src << "OPEN B TAGS: [bTags]"

                var
                    closeMarker = findtext(line, "</b>")

                    alsobTags = 0

                while(closeMarker != 0)
                    alsobTags++
                    closeMarker = findtext(line, "</b>", closeMarker + 1)

                src << "CLOSE B TAGS: [alsobTags]"

This is what I have and it does report the tags per line. It will indeed count the tags but I don't know where to go from there, and even then I think I'm messing up big time with what I have here (especially considering it only checks one tag in the first place, and I'd hate to do it manually for every darn tag.)

Also here's the HTML that I'm testing it on:

<b>hi there</b><br>
<i>sup bro<br>
<b><u>dude bro what is up<br>
<s>uh oh these tags are bleeding downwards<br>
<tt>oh lord <i>this isn't good</tt><br>

(tl;dr i don't know how to program at all and messing with text procs confuses me)

Oct 11 2011, 12:03 pm (Edited on Oct 11 2011, 12:08 pm)

In response to LordAndrew

DarkProtest

Well, I can try to put together something for you, but I don't have access to BYOND at the moment, so ymmv.

mob
    verb
        Strip()
            var/log = file2text('thelog.htm')

            src << browse(log, "window=log_raw")

            log = parseCleanHTML(log, "<br>", false, true)

            src << browse(log, "window=log_clean")

// List of tags that can close themselves, ie <br />
var/list/selfClosingHtmlTags = list("area", "base", "basefont", "br", "hr", "input", "img", "link", "meta")

proc
    // Return a copy of string that has had all open tags closed
    //  Operates per line, as delimited by passed delimiter
    cleanHTML(string, delimiter="<br>", encodeTags=false, stripAttributes=false)
    
        var/result = "" // We'll build our return string as we parse the input
        var/list/stack = list() // Stack of currently open tags
        
        var
            // Text tracking variables
            stringLength = length(string)
            delimiterLength = length(delimiter)
            processingIndex = 1
            lineEndIndex
            openingBracketIndex
            closingBracketIndex
            tagName
            otherTag
            addText
            addTag
        
        do
            lineEndIndex = findtext(string, "<br>", processingIndex)
            
            // Process this line
            while(processingIndex <= stringLength)
                // Search for the opening bracket
                openingBracketIndex = findtext(string, "<", processingIndex, lineEndIndex)
                if(!openingBracketIndex) break
                
                // Search for the closing bracket
                closingBracketIndex = findtext(string, ">", openingBracketIndex, lineEndIndex)
                if(!closingBracketIndex) break
                
                tagName = parseGetFirstWord(string, openingBracketIndex+1, closingBracketIndex)
                if(!tagName || tagName=="/")
                    // Mangled tag, just treat it as normal text and add the current chunk in
                    addText = html_encode(copytext(string, processingIndex, closingBracketIndex+1))
                    addTag = ""
                else if(text2ascii(tagName,1) == 47) // 47 is the ASCII value for a forward slash (/)
                    // Closing tag, close any tags opened before it
                    tagName = ckey(tagName) // Remove the slash
                    
                    if(tagName in stack)
                        addText = html_encode(copytext(string, processingIndex, openingBracketIndex))
                        
                        // Loop backwards through stack until we hit our tag, and close them
                        addTag = ""
                        for(var/i = stack.len, i>=1, i++)
                            otherTag = stack[i]
                            stack.len--
                            if(otherTag in selfClosingHtmlTags)
                                continue // We'll treat certain tags as optional to close
                            
                            addTag += "</[otherTag]>" // Add closing tag
                            
                            if(otherTag == tagName)
                                break // Found our tag, done!
                        
                    else
                        // No opening tag found, just treat as plain text
                        addText = html_encode(copytext(string, processingIndex, closingBracketIndex+1))
                        addTag = ""
                else
                    // Opening tag, add to list of open tags
                    tagName = ckey(tagName) // Remove any trailing slashes (self-closing)
                    stack += tagName
                    
                    // Add in chunk of encoded normal text
                    addText = html_encode(copytext(string, processingIndex, openingBracketIndex))
                    
                    // Add in the tag itself
                    if(encodeTags)
                        addTag = html_encode(copytext(string, openingBracketIndex, closingBracketIndex+1))
                    else if(stripAttributes)
                        addTag = "<[tagName]>"
                    else
                        addTag = copytext(string, openingBracketIndex, closingBracketIndex+1)
                
                // Add in parsed text
                result += addText + addTag
                
                // Move past it
                processingIndex = closingBracketIndex+1
                
            // Copy over remaining text
            result += html_encode(copytext(string, processingIndex, lineEndIndex))
            
            // Add any necessary closing tags (skips any self-closing tags)
            for(var/i = stack.len, i>=1, i++)
                otherTag = stack[i]
                stack.len--
                if(otherTag in selfClosingHtmlTags)
                    continue // We'll treat certain tags as optional to close
                addTag += "</[otherTag]>" // Add closing tag
            
            // Add end-of-line
            result += delimiter
            
            // Move to the next line (if exists)
            if(lineEndIndex)
                processingIndex = lineEndIndex + delimiterLength
            
        while(processingIndex <= stringLength && lineEndIndex>0)
        
        return result
    
    // Get first word in string, delimited by white-space: space, tab, or newline
    // End if index AFTER last character to be considered
    parseGetFirstWord(string, start=1, end=0)
        // Find the closest whitespace character to the start
        var/whitespace = findtext(string, " ", start, end) // Space
        var/otherspace = findtext(string, "\t", start, (whitespace > 0 ? whitespace : end)) // Tab
        if(otherspace > 0) whitespace = otherspace
        otherspace = findtext(string, "\n", start, (whitespace > 0 ? whitespace : end)) // Newline
        if(otherspace > 0) whitespace = otherspace
        
        // The first word is the text from start to the first whitespace character
        //   (or to end if no white space)
        return copytext(string, start, (whitespace > 0 ? whitespace : end))

I think I covered most of the edge cases reasonably well. I'll take another look at it tonight when I have access to BYOND, but let me know if it works or not ;)

Oct 11 2011, 12:13 pm

In response to DarkProtest

LordAndrew

Thanks!

Plugged it in and got a few errors that were easily fixed. Tried it out though and got hit with:

runtime error: list index out of bounds
proc name: parseCleanHTML (/proc/parseCleanHTML)
  source file: DarkProtest.dm,105
  usr: Guest-3550777048 (/mob)
  src: null
  call stack:
parseCleanHTML("hi there
\nsup br...", "
", 0, 1)
Guest-3550777048 (/mob): Strip()

            // Add any necessary closing tags (skips any self-closing tags)
            for(var/i = stack.len, i>=1, i++)
                otherTag = stack[i] //line 105
                stack.len--
                if(otherTag in selfClosingHtmlTags)
                    continue // We'll treat certain tags as optional to close
                addTag += "</[otherTag]>" // Add closing tag

Oct 11 2011, 12:24 pm In response to LordAndrew
DarkProtest	Whoops, change that i++ in the for loop iterator to i--

Oct 11 2011, 12:52 pm

In response to DarkProtest

LordAndrew

Changed it, got the same error on line 64:

                        for(var/i = stack.len, i>=1, i++)
                            otherTag = stack[i] //line 64
                            stack.len--
                            if(otherTag in selfClosingHtmlTags)
                                continue // We'll treat certain tags as optional to close

Changed that one to i-- and compiled it. Got no errors but the text it outputted:

BEFORE:

hi there
sup bro
dude bro what is up
<s>uh oh these tags are bleeding downwards
oh lord this isn't good
AFTER:

hi there
sup bro
dude bro what is up
<s>uh oh these tags are bleeding downwards
oh lord this isn't good

Oct 11 2011, 2:22 pm In response to LordAndrew
DarkProtest	This is why you don't write code in notepad... Before the second for() loop, add: addTag = "" Then change "result += delimiter" to "result += addTag + delimiter"

Oct 11 2011, 2:44 pm (Edited on Oct 11 2011, 3:02 pm)

In response to DarkProtest

LordAndrew

Added the fixes and tested it. It does in fact balance tags! It does seem to convert apostrophes into

Which... isn't too much of a deal but I dunno why it'd do that.

<s>After testing it on an actual log that required cleaning, the stripper removed all the contents from <font> tags, as well as <img> tags. Hrm.</s> I am an idiot and didn't spot that one of the proc's parameters allowed me to toggle this on and off. WHOOPS.

Everything seems to be in working order with no problems I can spot aside from apostrophes being made into entities. Thank you very much DarkProtest! :D