United States Department of Energy Office of Scientific and Technical Information

STIP 2007 Systems Focus: Special Characters and More Lance Vowell United States Department of Energy Office of Scientific and Technical Information

Many different character encodings in the computer world. Each encoding “methodology” uses numbers to represent the letters, numbers, and symbols (characters) that can be displayed by that particular methodology. OSTI used the ASCII Character Set and ISO for many years. Newer software systems are moving toward a character encoding called Unicode. Let’s review

Late in 2005 OSTI began adopting UTF-8, a specific implementation of Unicode: as a necessary part of upgrading hardware/software as a step toward realizing the goal of incorporating “true” special characters into DOE’s scientific and technical metadata records Just a little more review to go

Upgrading is a continual, staggered process. Different parts of the workflow processing system were at different stages of upgrade…with different character encodings! Two encoding methodologies can use the same number to represent two entirely different characters or use different numbers for the same character. So, some characters that were handled correctly under one encoding might display as garbage characters when a different encoding handles them at a different point in the workflow. So, what was the problem, exactly?

Displaying characters correctly is only one third of the overall issue. Another “third” is figuring out how to get them in the door correctly…and the most critical “third” of all is figuring out how to ensure records are still retrievable when they do include special characters. And that’s just at OSTI. Sites will have their own issues! So, what was the problem, exactly?

This calls for a phased approach Phase 1: Deal with corruption of characters caused by disconnect between “old” and “new” character encodings. • Catch/stop new problems from occurring. • Identify the existing corruption and fix it. • Upgrade the remainder of OSTI’s systems to ensure consistency

Phase 1 Actions – The “Eyeball” Review The Review Process: • Call up each record in E-Link • Look at all the fields • If anything strange is found, open the full text to compare. • If no full text, search for item elsewhere (journals, etc.) on the Internet and compare. • Make copies of the records with errors and of the full text pages proving what the correct text should be.

Phase 1 Actions – The “Eyeball” Review Things we discovered: • Only 19 records had garbage in two or more fields. This was odd, because often the same word was in the record several times…but only messed up in one place. • Title and author fields were most commonly corrupted, with keywords next, and abstracts a distant fourth. • Of the 140 “actually corrupted” records, 83 were first submitted prior to 2005; the rest were newer than 2005. • Full text links in the record existed for 97 of the 140. However, the full text was almost always found somewhere else online.

Tedious, but rewarding. The numbers kept getting better and better.

Phase 1 Actions – Correction July 2006 – January 2007 The Correction Process for DOE’s “Actually Corrupted” Records • Record is called up and edited in E-Link • Record is “saved” in E-Link after correction • Different person later pulls up the record, reviews the correction for accuracy. • Determines if record can be re-released and, if so, performs that operation. If not, re-saves the record for referral to the appropriate site.

Phase 1 Actions (still to come) May 2007 – June 2007 Sites will receive individual emails about the few records that need further action. Emails will detail what to do for each record. June 2007 Run a new search that will look for corruption only in records received since June 2006. This will check to ensure that the fixes made last year by the Technical Team’s are doing their job.

Phase 2: The Goals • This year, as we actually move into Phase 2, we can see ahead more clearly. Our goals are: • Work with submitting sites to ensure their XML harvest or XML batch input files are fully UTF-8 compliant. Some sites will need to modify the program that outputs their XML. This will be a one-on-one effort between OSTI and each individual site. More information will be sent out in the July timeframe. • Conduct a pilot project with a UTF-8 compliant site wanting to submit records containing special characters that fall outside of the ASCII or ISO set (i.e. “real” special characters)

Phase 2: The Real Thing • Continue upgrading OSTI’s system, focusing this year on output products (Dublin Core databases, applications that “write” files out to other applications such as Science.gov, etc. • Continue to research the industry for best ways to move more deeply into incorporating special characters into scientific databases.

Phase 3 • Still foggy, but the pilot project in Phase 2 will help us formulate Phase 3 goals when the time comes.

PDFs and Extractable Text

At OSTI we do accept PDF documents that do not have extractable full text. 10 – 20 per month usually get stuck in processing When we receive these types of PDFs we attempt to OCR them using our Adobe Capture software. This is an automated process. Recently we have received some PDFs that due to the image conversion software used we have been unable to OCR them using an automated process (approx. 75 last month). PDFs and Extractable FT

Adobe Acrobat 7.0 Professional Edition We can convert them manually but this takes sufficient man hours and a high degree of workstation resources to complete. It took us over 17 hours with a dedicated machine to convert 34 of these. PDFs and Extractable FT cont.

Best case scenario: All PDFs submitted to OSTI have extractable full text. If this is not possible: Make sure your image scanning software uses CCITT Fax Group 4 compression when converting TIFF images. What we need from you…

Submission Processes

Metadata Harvested vs. 241.1 and BatchFY2002-FY2007 Harvesting Sites: ANL BNL LLNL NREL PNNL FNAL SLAC SANDIA FY2007 numbers are as of March 30, 2007

Harvested Metadata vs. 241.1 and BatchFY2002-FY2007 FY2007 numbers are as of March 30, 2007

Questionnaire Time

United States Department of Energy Office of Scientific and Technical Information