Loud Joe and I have been working on a large-scale data mining project. Right now, there are some issues/mystique/intellectual property that surround it, so I won’t get into details as to what we’re doing exactly. Suffice it to say that we’re pulling and pushing a large amount of content that’s representative of — and has been generated by — the general public.
And when you’re talking about the general public, you can (and should) never, ever trust the integrity of data that you’ve been given. You’ve got to remind yourself that you’re dealing with data that was likely generated by the same segment of the population that “double-punched” ballots in Florida, or who’d thought that “Cheese Sandwich” might make a fine President of the United States in 2004. You’re also dealing with people who generate this data by using applications that emphasize content formatting (italics, colored fonts, pictures, etc.) over data integrity, character sets, or, frankly, sanity.
While scrubbing this data today, I came across a series of characters of which I was unfamiliar. In my text editor, these characters showed up as “glowing”, which didn’t give me a whole lot of clue as to their width/composition. I threw out a couple of ways in which I could match (and eliminate) these characters, but none of them was successful. So, I Googled “smart” characters.
Thankfully, this problem has already been solved, apparently by someone who was having problems harvesting stories.
And to think I’d expected to find a solution in someplace other than a gay stories forum! What was I thinking?!