ID:155009
 
Hey there. I posted a few months ago about this but figured it'd be best to make a new thread.

I'm trying to write a script that'll look through an HTML log and find tags that aren't closed, then close then. I've never really used BYOND's text manipulation procs so I honestly don't even know what I'm doing.

I'm not concerned with actually putting the tags in order, I just want to make sure for every one b tag per line, there's a closing b tag as well etc.
Well, off the top of my head, I would imagine the process would look like this:

-Create a list to use as a "stack", which allows us to track which tags are active, and in which order they were opened.

-Create several variables to track positions and boundaries within the line you're working on.

-Get the text and grab your first line. If you're storing all of the lines as one string, you'll have to do findtext("\n") to identify the boundaries, and store the end of the current line.

-Use findtext() to find the first "<" within the line, and then check if the next character is a "/" or not (I like to use text2ascii() for this kind of thing, to avoid a string copy). If there is no slash, then it's an opening tag. Otherwise it's a closing tag. Store which it is.

-Either way, you need to get the tag name. Write a process (something like parseGetTag(string,start,end)) that returns the tag name. Basically, it should search from start (which should be the opening bracket "<") to end (end of line limit) to find the next ">". Search between the found "<" and ">" for any white-space characters (space, tab). The tag name ("b", "font", "strong") will run from the character after the opening bracket ("<") to the first white-space, or to the closing bracket (">") if no white-space was found.

-For an opening tag, add it to the end of the list we're using as a "stack"

-For a closing tag, check if the tag name is in the list. If it is, loop from the END of the list towards the front until you find it. As you do so, add closing tags for all of those tags that you hit BEFORE it, and insert them before the existing closing tag you found (removing the tags from the list as you add them to the line). If the existing tag isn't in the list, you can ignore it, encode it, or cut it out.

-When you reach the end of the line, add closing tags for any tags still in the list (from the END to the front, removing them as you go). Then move on to the next line.

That's the gist of it. Once you get the algorithm down it's not too bad, but the real trick is how you handle poor formatting and special characters. I'm not sure if you can count on non-html to be encoded (such as the emoticon >_>)
In response to DarkProtest
Ah, hmm.

The (highly unsuccessful) method I was using before getting stuck was using this proc written by Garthor to break up the lines:

proc
tokenize(string, delimiter)
. = list()

var
len = length(delimiter)

lastpos = 1

pos = findtext(string, delimiter)

while(pos != 0)
. += copytext(string, lastpos, pos)

lastpos = pos + len

pos = findtext(string, delimiter, lastpos)

. += copytext(string, lastpos)


using the <br> tag as the separator.

mob
verb
Strip()
var
log = file2text('thelog.htm')

src << browse(log, "window=log_raw")

var
list/L = tokenize(log, "<br>")

for(var/line in L)
src << html_encode(line)

var
openMarker = findtext(line, "<b>")

bTags = 0

while(openMarker != 0)
bTags++
openMarker = findtext(line, "<b>", openMarker + 1)

src << "OPEN B TAGS: [bTags]"

var
closeMarker = findtext(line, "</b>")

alsobTags = 0

while(closeMarker != 0)
alsobTags++
closeMarker = findtext(line, "</b>", closeMarker + 1)

src << "CLOSE B TAGS: [alsobTags]"


This is what I have and it does report the tags per line. It will indeed count the tags but I don't know where to go from there, and even then I think I'm messing up big time with what I have here (especially considering it only checks one tag in the first place, and I'd hate to do it manually for every darn tag.)

Also here's the HTML that I'm testing it on:

<b>hi there</b><br>
<i>sup bro<br>
<b><u>dude bro what is up<br>
<s>uh oh these tags are bleeding downwards<br>
<tt>oh lord <i>this isn't good</tt><br>


(tl;dr i don't know how to program at all and messing with text procs confuses me)
In response to LordAndrew
Well, I can try to put together something for you, but I don't have access to BYOND at the moment, so ymmv.

mob
verb
Strip()
var/log = file2text('thelog.htm')

src << browse(log, "window=log_raw")

log = parseCleanHTML(log, "<br>", false, true)

src << browse(log, "window=log_clean")

// List of tags that can close themselves, ie <br />
var/list/selfClosingHtmlTags = list("area", "base", "basefont", "br", "hr", "input", "img", "link", "meta")

proc
// Return a copy of string that has had all open tags closed
// Operates per line, as delimited by passed delimiter
cleanHTML(string, delimiter="<br>", encodeTags=false, stripAttributes=false)

var/result = "" // We'll build our return string as we parse the input
var/list/stack = list() // Stack of currently open tags

var
// Text tracking variables
stringLength = length(string)
delimiterLength = length(delimiter)
processingIndex = 1
lineEndIndex
openingBracketIndex
closingBracketIndex
tagName
otherTag
addText
addTag

do
lineEndIndex = findtext(string, "<br>", processingIndex)

// Process this line
while(processingIndex <= stringLength)
// Search for the opening bracket
openingBracketIndex = findtext(string, "<", processingIndex, lineEndIndex)
if(!openingBracketIndex) break

// Search for the closing bracket
closingBracketIndex = findtext(string, ">", openingBracketIndex, lineEndIndex)
if(!closingBracketIndex) break

tagName = parseGetFirstWord(string, openingBracketIndex+1, closingBracketIndex)
if(!tagName || tagName=="/")
// Mangled tag, just treat it as normal text and add the current chunk in
addText = html_encode(copytext(string, processingIndex, closingBracketIndex+1))
addTag = ""
else if(text2ascii(tagName,1) == 47) // 47 is the ASCII value for a forward slash (/)
// Closing tag, close any tags opened before it
tagName = ckey(tagName) // Remove the slash

if(tagName in stack)
addText = html_encode(copytext(string, processingIndex, openingBracketIndex))

// Loop backwards through stack until we hit our tag, and close them
addTag = ""
for(var/i = stack.len, i>=1, i++)
otherTag = stack[i]
stack.len--
if(otherTag in selfClosingHtmlTags)
continue // We'll treat certain tags as optional to close

addTag += "</[otherTag]>" // Add closing tag

if(otherTag == tagName)
break // Found our tag, done!

else
// No opening tag found, just treat as plain text
addText = html_encode(copytext(string, processingIndex, closingBracketIndex+1))
addTag = ""
else
// Opening tag, add to list of open tags
tagName = ckey(tagName) // Remove any trailing slashes (self-closing)
stack += tagName

// Add in chunk of encoded normal text
addText = html_encode(copytext(string, processingIndex, openingBracketIndex))

// Add in the tag itself
if(encodeTags)
addTag = html_encode(copytext(string, openingBracketIndex, closingBracketIndex+1))
else if(stripAttributes)
addTag = "<[tagName]>"
else
addTag = copytext(string, openingBracketIndex, closingBracketIndex+1)

// Add in parsed text
result += addText + addTag

// Move past it
processingIndex = closingBracketIndex+1

// Copy over remaining text
result += html_encode(copytext(string, processingIndex, lineEndIndex))

// Add any necessary closing tags (skips any self-closing tags)
for(var/i = stack.len, i>=1, i++)
otherTag = stack[i]
stack.len--
if(otherTag in selfClosingHtmlTags)
continue // We'll treat certain tags as optional to close
addTag += "</[otherTag]>" // Add closing tag

// Add end-of-line
result += delimiter

// Move to the next line (if exists)
if(lineEndIndex)
processingIndex = lineEndIndex + delimiterLength

while(processingIndex <= stringLength && lineEndIndex>0)

return result

// Get first word in string, delimited by white-space: space, tab, or newline
// End if index AFTER last character to be considered
parseGetFirstWord(string, start=1, end=0)
// Find the closest whitespace character to the start
var/whitespace = findtext(string, " ", start, end) // Space
var/otherspace = findtext(string, "\t", start, (whitespace > 0 ? whitespace : end)) // Tab
if(otherspace > 0) whitespace = otherspace
otherspace = findtext(string, "\n", start, (whitespace > 0 ? whitespace : end)) // Newline
if(otherspace > 0) whitespace = otherspace

// The first word is the text from start to the first whitespace character
// (or to end if no white space)
return copytext(string, start, (whitespace > 0 ? whitespace : end))


I think I covered most of the edge cases reasonably well. I'll take another look at it tonight when I have access to BYOND, but let me know if it works or not ;)
In response to DarkProtest
Thanks!

Plugged it in and got a few errors that were easily fixed. Tried it out though and got hit with:
runtime error: list index out of bounds
proc name: parseCleanHTML (/proc/parseCleanHTML)
  source file: DarkProtest.dm,105
  usr: Guest-3550777048 (/mob)
  src: null
  call stack:
parseCleanHTML("hi there
\nsup br...", "
", 0, 1)
Guest-3550777048 (/mob): Strip()


            // Add any necessary closing tags (skips any self-closing tags)
for(var/i = stack.len, i>=1, i++)
otherTag = stack[i] //line 105
stack.len--
if(otherTag in selfClosingHtmlTags)
continue // We'll treat certain tags as optional to close
addTag += "</[otherTag]>" // Add closing tag
In response to LordAndrew
Whoops, change that i++ in the for loop iterator to i--
In response to DarkProtest
Changed it, got the same error on line 64:

                        for(var/i = stack.len, i>=1, i++)
otherTag = stack[i] //line 64
stack.len--
if(otherTag in selfClosingHtmlTags)
continue // We'll treat certain tags as optional to close


Changed that one to i-- and compiled it. Got no errors but the text it outputted:

BEFORE:

hi there
sup bro
dude bro what is up
<s>uh oh these tags are bleeding downwards
oh lord this isn't good
AFTER: hi there
sup bro
dude bro what is up
<s>uh oh these tags are bleeding downwards
oh lord this isn't good
In response to LordAndrew
This is why you don't write code in notepad...

Before the second for() loop, add:
addTag = ""


Then change "result += delimiter" to "result += addTag + delimiter"
In response to DarkProtest
Added the fixes and tested it. It does in fact balance tags! It does seem to convert apostrophes into

'


Which... isn't too much of a deal but I dunno why it'd do that.




<s>After testing it on an actual log that required cleaning, the stripper removed all the contents from <font> tags, as well as <img> tags. Hrm.</s> I am an idiot and didn't spot that one of the proc's parameters allowed me to toggle this on and off. WHOOPS.

Everything seems to be in working order with no problems I can spot aside from apostrophes being made into entities. Thank you very much DarkProtest! :D