Why do I get different results when applying an extraction template to a PDF printed through UNIFY vs. the same PDF archived into NuGenesis SDMS? - WKB56619
ENVIRONMENT
- NuGenesis 8 SDMS
ANSWER
Extracting text from printed vs. archived PDFs may not produce the same results. When a PDF file is archived, the original data format is retained in the database, and the extraction template builder uses a PDF reader library to parse text from the PDF. If you print the same file, it first goes through a conversion to an Enhanced Metafiles (EMF) type file. The initial conversion is done by the print spooler because UNIFY printers are registered as EMF-type printers. You'll see this if you open the Printer Properties for a UNIFY printer, select the Advanced tab, and then click Print processor. The default for UNIFY is NG80print and "NT EMF 1.003" as the data type. Therefore, the spool file as delivered to UNIFY contains EMF files, and UNIFY can only pull text from the EMF if the EMF actually contains text records. In many cases the EMF does not contain text; it has only an image, and UNIFY cannot do OCR on images.
So, if it is important to run Extraction Templates on PDFs printed to UNIFY, the only way to influence the result is before UNIFY is involved; that is, try to get the spool file generated so that the EMF file (contained in the spool file) has text records. Try printing the PDF to UNIFY using different PDF programs to see if any of them produce an acceptable result.
ADDITIONAL INFORMATION
id56619, SDMS, SDMS8, SDMS8NU, SUPISDMS, SUPNG