Hey there. I posted a few months ago about this but figured it'd be best to make a new thread.
I'm trying to write a script that'll look through an HTML log and find tags that aren't closed, then close then. I've never really used BYOND's text manipulation procs so I honestly don't even know what I'm doing.
I'm not concerned with actually putting the tags in order, I just want to make sure for every one b tag per line, there's a closing b tag as well etc.
ID:155009
![]() Oct 11 2011, 9:14 am
|
|
Ah, hmm.
The (highly unsuccessful) method I was using before getting stuck was using this proc written by Garthor to break up the lines: proc using the <br> tag as the separator. mob This is what I have and it does report the tags per line. It will indeed count the tags but I don't know where to go from there, and even then I think I'm messing up big time with what I have here (especially considering it only checks one tag in the first place, and I'd hate to do it manually for every darn tag.) Also here's the HTML that I'm testing it on: <b>hi there</b><br> (tl;dr i don't know how to program at all and messing with text procs confuses me) |
Well, I can try to put together something for you, but I don't have access to BYOND at the moment, so ymmv.
mob I think I covered most of the edge cases reasonably well. I'll take another look at it tonight when I have access to BYOND, but let me know if it works or not ;) |
Thanks!
Plugged it in and got a few errors that were easily fixed. Tried it out though and got hit with: runtime error: list index out of bounds proc name: parseCleanHTML (/proc/parseCleanHTML) source file: DarkProtest.dm,105 usr: Guest-3550777048 (/mob) src: null call stack: parseCleanHTML("hi there \nsup br...", " ", 0, 1) Guest-3550777048 (/mob): Strip() // Add any necessary closing tags (skips any self-closing tags) |
Changed it, got the same error on line 64:
for(var/i = stack.len, i>=1, i++) Changed that one to i-- and compiled it. Got no errors but the text it outputted: BEFORE: hi there |
This is why you don't write code in notepad...
Before the second for() loop, add: addTag = ""
Then change "result += delimiter" to "result += addTag + delimiter" |
Added the fixes and tested it. It does in fact balance tags! It does seem to convert apostrophes into
' Which... isn't too much of a deal but I dunno why it'd do that. <s>After testing it on an actual log that required cleaning, the stripper removed all the contents from <font> tags, as well as <img> tags. Hrm.</s> I am an idiot and didn't spot that one of the proc's parameters allowed me to toggle this on and off. WHOOPS. Everything seems to be in working order with no problems I can spot aside from apostrophes being made into entities. Thank you very much DarkProtest! :D |
-Create a list to use as a "stack", which allows us to track which tags are active, and in which order they were opened.
-Create several variables to track positions and boundaries within the line you're working on.
-Get the text and grab your first line. If you're storing all of the lines as one string, you'll have to do findtext("\n") to identify the boundaries, and store the end of the current line.
-Use findtext() to find the first "<" within the line, and then check if the next character is a "/" or not (I like to use text2ascii() for this kind of thing, to avoid a string copy). If there is no slash, then it's an opening tag. Otherwise it's a closing tag. Store which it is.
-Either way, you need to get the tag name. Write a process (something like parseGetTag(string,start,end)) that returns the tag name. Basically, it should search from start (which should be the opening bracket "<") to end (end of line limit) to find the next ">". Search between the found "<" and ">" for any white-space characters (space, tab). The tag name ("b", "font", "strong") will run from the character after the opening bracket ("<") to the first white-space, or to the closing bracket (">") if no white-space was found.
-For an opening tag, add it to the end of the list we're using as a "stack"
-For a closing tag, check if the tag name is in the list. If it is, loop from the END of the list towards the front until you find it. As you do so, add closing tags for all of those tags that you hit BEFORE it, and insert them before the existing closing tag you found (removing the tags from the list as you add them to the line). If the existing tag isn't in the list, you can ignore it, encode it, or cut it out.
-When you reach the end of the line, add closing tags for any tags still in the list (from the END to the front, removing them as you go). Then move on to the next line.
That's the gist of it. Once you get the algorithm down it's not too bad, but the real trick is how you handle poor formatting and special characters. I'm not sure if you can count on non-html to be encoded (such as the emoticon >_>)