sacred-texts.com Homesacred-texts.com HomeAbout sacred-texts.comFrequently Asked QuestionsHow to contact sacred-textsSearch sacred-textsBuy the Internet Sacred Text Archive on CD-ROM
Wisdom is priceless, the sacred-texts CD-ROM is 49.95. Click here to learn more
Topics
Home
  World Religions
  Traditions
  Mysteries
  What's New?
  About
  Abuse
  Books
  Bibliography
  Contact
  Credits
  Copyrights
  Donate
  Downloads
  FAQ
  Links
  Map
  Press
  Privacy
  Search
  Top Level
  Terms of Service
  Translate
  Standards
  Unicode
  Volunteer
African
Age of Reason
Alchemy
Americana
Ancient Near East
Asia
Atlantis
Australia
Basque
Baha'i
Bible
Book of Shadows
Buddhism
Celtic
Christianity
Classics
Confucianism
DNA
Earth Mysteries
Egyptian
England
Esoteric/Occult
Evil
Fortean
Freemasonry
Gothic
Grimoires
Hinduism
I Ching
Islam
Icelandic
Jainism
Journals
Judaism
Legends/Sagas
LGBT
Miscellaneous
Mormonism
Native American
Neopaganism/Wicca
Nostradamus
Oahspe
Pacific
Paleolithic
Philosophy
Piri Re'is Map
Prophecy
Roma
Sacred Books of the East
Sacred Sexuality
Shakespeare
Shamanism
Shinto
Sikhism
Sky Lore
Tantra
Taoism
Tarot
Thelema
Theosophy
Time
Tolkien
UFOs
Utopia
Women
Zoroastrianism

 

Document Encoding Standards

sacred-texts.com collage of texts, (c) 1999, J.B. Hare

These coding standards apply to new files scanned at sacred-texts. There is a bibliography of all files scanned at this site here. Because these conventions have evolved over time, not all texts will have all of the markup specified here, particularly the page numbering and footnote conventions. We are constantly refining our methods of document preparation and production. However, there are still quite a few files at this site which were scanned prior to these features being added, and many texts were developed elsewhere. These files are incrementally being upgraded to reflect the newer standards; this process is likely to take quite a while.


General Standards

The text is, as far as practically possible, transcribed letter for letter from the original book. The etext includes all of the original illustrations and graphics, where possible.

The etext includes a transcription of the title page, all prefatory material, a linked table of contents, and all footnotes and other apparatus, except (in most cases) for the index. We omit the index because it is sometimes impossible or difficult to convert to etext, and once the text is on the Internet, every word is indexed by multiple search engines. We reproduce italics and bold text using standard HTML markup.

Etexts are posted in HTML format (and sometimes plain ASCII text) due to open source concerns. PDF and other ebook formats are vendor-specific, and it is often difficult to migrate text out of them, particularly with regard to formatting.

The etexts are uncensored and uncut. In no case is any text omitted from the transcription, even in case of egregious factual errors or language which might be considered offensive.

Non-English passages are reproduced verbatim, and no translation is supplied (unless the translation is part of the original text).

Each text is carefully checked against a specific printed copy of the book. We attempt to find an early edition of the book, (if possible the first edition). If that is not practical, a photographic reproduction of an early edition is used. In some cases a later printing may be used, if we have a high degree of confidence that it accurately reflects the complete, uncut original text, with no major editing. If a later edition is used, that will be noted. We do not normally use multiple editions of the book to prepare the text.

Page numbers are included in the transcription. We feel that is important to indicate page numbers so the etexts can be cited in academic and other publications.

Use of Unicode

Characters not within the standard ISO-8859-1 HTML coding are represented using the closest or exact equivalent in the Unicode character set. In older files, these were transcribed systematically using close 8 bit equivalents. Such substitutions and omissions are noted on a case by case basis.

Unicode is inserted using extended character entities, rather than UTF-8, since this is documents the character more clearly. In some cases UTF-8 may be employed for longer files for reasons of space.

In some cases vowel diacritics may be omitted from the transcription, particularly in the case of well-known words such as 'Sufi' or 'Nirvana'. This will be noted.

Macron Vowels

Where vowels with a 'macron' (a straight line above the vowel, usually indicating a 'long' vowel) in a text are used consistently, these may be transcribed using the 'circumflex' (the 'hat' diacritic, such as â, ê, etc.).

In-text commentary

It is the policy of this site to not add any commentary in the body of the etext. Editorial matter written at sacred-texts about the text is reserved for the index page, and will have a byline. Where it is necessary to add brief in-text commentary, it is printed in a green font. Such editorial annotations may also be italicized and the initials of the transcriber (normally, 'jbh') noted. In-text commentary is usually limited to technical notes about the source book, such as notes of illegible or ambiguous type, missing pages, or the redactors' attribution paragraph on the title page.

Errata

Each text is spell-checked during the proof-reading stage using the standard MS Word spell-check dictionary. Specialized vocabulary is added to a custom dictionary on a per-text basis. The text is also vetted for known 'OCR bums'--words that are OCRed incorrectly, but spell-check as valid words, such as 'burn' and 'bum', 'bad' and 'had', 'arid' and 'and'.

British, archaic and dialect spelling has been retained where it occurs. If an idiosyncratic spelling occurs more than twice in a given text, it is normally not marked as errata.

Typos are corrected and those corrections are noted. For instance, if 'Greenland' is spelled 'Greeenland', that is considered a typo. In such cases, the word (correctly spelled) is linked to an errata file which documents the original spelling. This file is named 'errata.htm' in the same directory as the text.

In some cases minor punctuation errors have been silently corrected. However, care has been taken to preserve as closely as possible the original punctuation, particularly in the case of older books. Some books (for instance the Ganguli translation of the Mahabharata) have major problems with punctuation, particularly nested dialog, and this has been corrected where it was deemed necessary for comprehension.

Poetry

Poetry often requires complex indentation to transcribe its appearance on the printed page. This is simulated using nested DIR tags and non-breaking spaces at the start (and in some cases the middle) of the line. Where possible, each block of poetry has line breaks (BR) between each line, and paragraph breaks at the end of each stanza. Page numbers are placed within the body of a block of poetry to avoid breaking up the layout.

Quotation Marks

Where a space appears on both sides of a double or single quote, the inner space is deleted:

" Look at the size of that thing! ", said Wedge.

is transcribed

"Look at the size of that thing!", said Wedge.

For the most part, only 'straight' quotes are used, except where it may be required to resolve ambiguity such as

'‘til Tuesday', she said.

Very rarely, curved quotes will be used in specialized transcriptions, (typically indicating a glottal stop), and the straight quote is reserved for an accent mark. Backquote (`) is only used in specialized transcriptions of non-English texts.

Page Breaks, Numbering and Continuations

Page breaks are indicated as p. NNN, where NNN is the page number. The HTML page markup is as follows:

<P><A NAME="page_91"><FONT SIZE=1 COLOR=GREEN>p. 91</FONT></A></P>

Thus you could link to this specific page using the HTML anchor 'page_91'. For instance, if the name of the file was 'foo42.htm' in the directory 'http://www.sacred-texts.com/bar' you could link to it using the following HTML markup:

<A HREF="http://www.sacred-texts.com/bar/foo42.htm#page_91">On page 91</A> we are warned not to run with scissors...

No attempt is made currently to add indentation to the start of paragraphs, due to the limitations of vanilla HTML. We have a system for resolving whether a paragraph that begins at the start of a page is a new or continued paragraph.

The page number is placed in a paragraph by itself, left indented. If the page break occurs within or between paragraphs, it looks like this:

Fred was sure it wouldn't

p. 45

rain that afternoon.


The Queen was pacing back and forth.

p. 45

In the next paragraph, a shot rang out.

If the page break occurs between two sentences in the same paragraph, or if the first word on the successive page does not begin with a lower case letter (including numbers, quotation marks and so on), a continuation is noted as follows:

This paragraph continues on the next page.

p. 45

[paragraph continues] This sentence is part of the paragraph on the previous page.


"Now wait just one minute", said

p. 45

[paragraph continues]Fred, emphatically.


There are

p. 45

[paragraph continues]17 reasons not to run with scissors...


The Queen commanded Fred,

p. 45

[paragraph continues]"Don't run with scissors!"


The continuation is added even if a human being could recognize the paragraph continuation from context. This is so that the text could be broken down into paragraphs correctly by a computer program.

If a paragraph continues midpage after a blockquote or an illustration (that is, it is not indented in the book and doesn't begin with a lowercase letter), the continuation is noted in the same way.



There was a young lady named Bright
Who travelled much faster than light...

[paragraph continues]This is one of the best limericks about relativity.

Hyphenation

An em hyphen is transcribed as two successive ASCII hyphens (--).

In general, hyphenated words in the original text which appear because of line breaks are joined. This is to facilitate search engines. If a page break occurs in the middle of a word, any words hyphenated across the page boundary are concatenated on the prior page. For instance, if the word 'abracadbra' appears on page 42 hyphenated as 'abra-' and continued on page 43 as 'cadabra', it is transcribed as:

abracadabra,

p. 42

The one exception to this standard is if a hyphenated word is footnoted, and the footnote is on the successive page, then the word is concatenated and moved to the successive page, to avoid moving the footnote.

A paragraph continuation is inserted if the concatenation creates one.

Footnotes

The following terms are used in this document to describe footnotes. The location in the body of the text is called a 'footnote reference' or just 'footref'. The term 'footnote' means the actual text of the footnote.

Footrefs are hyperlinked to the corresponding footnote. All footnotes in a chapter are moved to the end of the chapter. They are anchored using the concatenated page number and footref on that page. The number next to the footnote is linked back to the first instance of the footref that references it. If a footnote continues over more than one page, a page number is inserted into the body of the footnote text in small green text. This does not have an HTML anchor markup.

In the case where asterisk, virgule, dagger, etc. are used for footnotes, these are retranscribed silently as simple numerals.

In older files, footnotes are indicated using square brackets, as in etext produced by Project Gutenberg. This was abandoned because it is difficult or impossible to distinguish between a footref and a footnote marked in this way using a computer program and it makes the text clumsy to read.

In some cases, books have footnotes placed in another logical chapter than the footref ('endnotes'). Normally these are not relocated in the etext. In this case, the footrefs will appear as normal, but the footnote will be in a separate file, in the sequence it was found in the book. The endnote footnote numbers are not currently linked back to the footref; to get back to the original location after reading the footnote, you will need to use the 'back' button on your browser.

SACRED TEXTS NEEDS YOUR SUPPORT

It costs thousands of dollars a year to pay for this sites' bandwidth and maintenance. Without your continued support, sacred-texts would go offline or have to be scaled back. Your support is crucial; this site does not receive grants or institutional support.

The best way to support the site is to purchase the CD-ROM. The Sacred-texts CD-ROM has hundreds of books on it that are extremely hard to locate, including all of the major world scriptures. If you buy a copy, you can feel good knowing that you are helping keep this site online.

--J.B. Hare

This site is available on CD-ROM!

Buy it now

"Stunning"
read more...

The Internet Sacred Text Archive CD-ROM includes electronic texts of nearly a thousand of the most important books and articles ever written, including over two hundred transcribed specially for sacred-texts. Years of extensive research and scholarship went into this CD-ROM: all the core texts of religion, mythology, folklore and the esoteric are on one disk.

"worth far more than the price"
read more...

This collection includes the full text of each book, many with footnotes and illustrations. To buy all of these books you would have to spend tens of thousands of dollars, even if you could find them: many are out of print and hard to obtain at any price. You pay just pennies a book: the CD-ROM costs just $49.95; worldwide shipping is free when you buy direct!

"Phenomenal collection"
read more...

Everything on the disk can be viewed in a standard web browser on your PC or Macintosh. Proceeds from sales of the CD-ROM go to support free access to the Internet Sacred Text Archive on the web and development of new etexts.