Choosing a Publishing Model

From MPublishing

Revision as of 14:33, 30 August 2009 by Kshawkin (Talk | contribs)
Jump to: navigation, search

Contents

Choosing a Publishing Model

The Scholarly Publishing Office recognizes that there are many formats in which a publishing partner may have material for SPO to publish. For both material to be republished by SPO and for material to be published for the first time, there may be paper copies, electronic files in various formats, or both.

We can discuss conversion of paper copies with you; the best practices for this fall outside the scope of this document.

This document explains how to choose a SPO publishing model based on what types of electronic source documents are available. The choice of source format depends on the ease of conversion to SPO's preservation-quality digital formats and the completeness of a given format, and this in turn determines which publishing model we choose for your publication.

Page images scanned from paper

SPO uses the page image model for publications with a large print backfile, the content of which is not available in electronic form. Users can "turn" pages in a sequence or jump directly to a given page number.

We process these images with OCR software to allow users to search the full text of a document. This software can operate on most languages written in the Latin script; however, it works best on unilingual texts. The word recognition rate will be much lower for texts in multiple languages, but the predominant language of the text will have a higher level of accuracy than the other languages.

Model A: Journal issue as unit

ex: http://quod.lib.umich.edu/b/basp/

When scanning a journal from paper, it's most efficient to scan an entire issue without attempting to determine article boundaries at the time of scanning. We can still provide links to individual articles, but the user will be able to turn pages from one article to another. It's best to use this model when there is a large print backfile and when pagination of articles is not consistent. Because this model prevents SPO from implementing article-level access restrictions, we generally choose Model B instead when scanning from paper.

Model B: Journal article as unit

ex: http://quod.lib.umich.edu/m/mjcsl/ (before volume 8)

It can be more worthwhile to scan articles separately so that these will become separate units in the delivery system. This is best for a publication where new documents sent to SPO will also be split at the article level or where article-level access restrictions in the delivery system are critical.

Model C: True electronic text

ex: http://quod.lib.umich.edu/w/wsfh/

SPO generally prefers to publish true electronic text since it allows for hyperlinks, multimedia, and accurate searching of the full text based on the structure of the text. In addition, true electronic text allows the documents to be disseminated in various ways not tied to the print page. If the publishing partner provides PDF files, SPO can put these online as an alternative format for readers.

For this model, SPO generally needs electronic source documents. For small volumes of text, or when supplementary funding is available, we can create electronic text from print sources, using OCR software and then verifying words discovered by a spellchecker and correcting as needed.

Model D: Page images from PDF files

ex: http://quod.lib.umich.edu/m/mjcsl/ (volume 8 to present)

For some publications, we display page images but also have electronic text underneath that allows for more accurate searching. We do this when:

  • There are many diagrams and figures that would be difficult to render in electronic text.
  • You value precise page layout that can't be consistently replicated online.

For this model, we need PDF files in which the text can be highlighted when you open the PDF. Text in more than one column can present problems for extraction of text, so have us test a sample file if this applies to you. Note that our current software only allows extraction of text written in the Latin script, so non-Latin text will not be searchable by users.

Unfortunately, extracting text from PDF files leads to a number of problems that decrease the accuracy of searching:

  • Words hyphenated across line breaks can't be automatically reconstructed into whole words.
  • Other words at line-breaks are often not followed by a space character, causing them to run together with the word on the next line after extraction.
  • You're unable to search for phrases spanning pages, columns, and sometimes even lines.

We only use this model for publications where the journal article is the unit.

Combination of models

We often use one model for backfiles and another for new documents sent to SPO. Possible combinations are:

For more information on these models, see the Scholarly Publishing Office whitepaper "Choice of DocEncodingType and encoding level for SPO publications."

Personal tools