I've been in Charlotte, NC this week for my "day" job and have been working on a side project from the hotel that involves converting a large amount of data from a, let's just say, "non-standard" database program to MySQL. One of the tables stores information about 14,000+ legal documents that we have in PDF format on the filesystem.
As part of the conversion project, I thought I'd be really cute and import the OCR text layer of the PDFs into a field in the table and set up a Verity index that included that field. I figured, hey, this will give the client something he doesn't have today and make me look good in the process right?
The problem lies in that the text for each of the documents can be anywhere from one printed page to 40 printed pages (meaning several hundred kb of text for each). Now, importing all this was fairly easy (if not a little slow just for the amount of data going into the database), but for the life of me I cannot figure out how to get the query indexed by verity.
I obviously can't just grab all the records in the table (query takes too much memory that way). I've tried a few things like grabbing the primary keys of all the records only then looping over that in groups of 500 or so and using session variables to manage the looping from one page request to the next, but even that isn't working.
So, right now I'm a bit "perturbed" as my mother sometimes says. When I figure out a solution that will work on this I'll post a comment back.
At least it's Friday and I'm headed to the airport to go home in a couple of hours.