Buffistas Building a Better Board
Do you have problems, concerns or recommendations about the technical side of the Phoenix? Air them here. Compliments also welcome.
To-do list
Having thought about it for a while, it's only the tags that would create formatting problems further down the page that need to be handled. An unclosed LI tag won't cause the rest of the page to be indented, but an unclosed OL or UL will. Unclosed P tags won't cause any problems either.
I'm seeing the logic like this:
Go through the post, creating a list of all potential troublesome tags, i.e. when we encounter the first A tag, put it on a list.
When we encounter the next </A>, take that A tag off the list again.
If there's anything on the list at the end, close it.
It needs some more refinement, but that's essentially it, right?
I've just remember that the last big HTML problems were caused by unclosed
attributes,
not unclosed tags. I don't have any idea how to sort that out...
Could you keep track of all open attributes and close them at the close of a tag?
It's possible but scary, to my brain.
I've already roughed out a version in Perl which just counts opening/closing tags. There are either more openings than closings, or there aren't.
It tells me, for this page, that the A tags are right on target, there are exactly the same number opening as closing, but the P tags aren't. Which we'd expect.
If it were are troublesome tag, like B, then it would just insert the closing-B x number of times where x is the difference between openers and closers.
For tags like A, FONT, B, I and so on, it won't necessarily perfect that post
as the writer intended,
but it should stop the error cascading on down the page at least.
For other tags, for instance table tags, it's a bit more scary.
[EDIT: just realised this script forces me to write "for(@openers)". Ha ha geek ha.]
Well, you know tags can't nest, so you only have to deal with one at a time. It's just like the logic you laid out for tags, only for attributes.
Actually, it's easier then that, isn't it? It's just counting double-quotes, isn't it?
Start counting double quotes when you see a <, and if you have an odd number of them when you see > throw in an extra. Plus a pinch of salt.
you know tags can't nest
Er, do I? You mean you can't have:
<tag blah blah blah <nothertag> blah blah>
right?
The idea of checking the syntax of every attribute of every tag seems rather processor-intensive to me. No reason it can't be done.
But, but but, coming home from the bottle shop (liquor store) it occurred to me that what was
really
scary the last time we had a major HTML snag was that it stopped the post from even being editable.
So I think what we need is not only an edit but a "safe edit" for admins, which would somehow rob such malformed posts of their power to break the browser/form/interface. Because ita had to go into the SQL DB and edit by hand, command-line style. Or do I mean "commando-style"? Anyway.
I'm 99% certain you can't next tags like that. But it occurs to me that even if you can, you would just have to recursively parse.
In order to do the tag closing you'll have to parse the tags. Counting double quotes to make sure each tag has an even number of them shouldn't make it more processor intensive. And it should only happen on post, so it's not like it's going to make page serving more expensive.
One thing I just thought of, as I was reading the HTML spec; you'll have to count single quotes too. In fact, once you see one sort of quote, you'll need to ignore occurences of the other. In other words, double quotes found between pairs of single quotes are normal, as are single quotes found between pairs of double quotes.
I could hack together some perl that does the checking, if that would help.
I'm 99% certain you can't next tags like that.
That's OK because I'm
100%
certain you can't. But there's nothing but nesting if you just mean:
<b> blah blah <i> blah blah </i></b>
once you see one sort of quote, you'll need to ignore occurences of the other. In other words, double quotes found between pairs of single quotes are normal, as are single quotes found between pairs of double quotes.
Well there shouldn't be that kind of nesting in regular HTML, though you'll get it in JavaScript for sure.
I could hack together some perl that does the checking, if that would help.
When I said I was working in Perl, I should have said "but of course this board uses PHP" -- have you ever done PHP? If you've worked in Perl it will be no big deal.
Betsy's post with the mismatched quotes? The one that broke the board and ita had to edit by hand? It was particularly devious. If you look at the resultant html in the page, it's not at all obvious how it ended up the way it did. ita reproduced the problem in our test environment. I took the page and copied it into a "regular" html page. You can look at it here. It's post #4 that broke things. I'd analyze it some more, but it's amazing what a few shots of tequila can do to one's analytical abilities.
I looked at the broken post, and it's not a case of too few quotes. Instead, it's someone includingn a URL that uses double-quotes.
If this anchor tag was handwritten by the author of the post, I don't think we can do anything about it. If instead it was created by the code that automagically wraps anchor tags around URLs, then the code needs to detect single or double quotes in the URL and use the other kind of quote.
The quote counting is still a good idea, but it won't fix this problem.