Sunday, 30 September 2012

Text Analysis

Description of the texts:
  • What is Text Analysis - by Geoffrey Rockwell and Ian Lancashire
  • Computer-Assisted Reading: Reconceiving Text Analysis - by Stéfan Sinclair
  • The Measured Words - by Geoffrey Rockwell and Stéfan Sinclair
I'm not sure whether to acknowledge or ignore that the person marking this blog is also the co-author of the majority of this week's readings. It will undoubtedly effect my comments on some level. Let's start with Stéfan Sinclair's work and delay the issue temporarily.

I feel that in Computer-Assisted Reading: Reconceiving Text Analysis Sinclair presents several valid and note-worthy points. However, I also believe that his underlying principle is incorrect: there is not, in my mind, any fundamental "...incongruities between humanistic and scientific traditions". Theoretical physics, for example, functions very much like traditional humanities subjects; the scholar observes a peculiarity and speculates as to why it is so using previous research to support the theory, for a philosophy scholar this might be George Grote, for a physicist it could be Niels Bohr, regardless the 'scholarly tradition' being followed is the same. An experimental physicist might pick up his theoretical cohort's work and design an experiment (perhaps one based in computers) to test it, but so might a humanities computing scholar apply text analysis techniques to his comrades theory. My point is that Sinclair (and much of academia) treats the realm of subjects as divisible among sciences, social sciences, and humanities. I think in reality you could better understand research and researchers not by arranging them by their disciplines, but on a continuum from (for the purposes of Sinclair's argument) computer-based analysis to purely theoretical research,
on which each scholar could be plotted according to their preferred research techniques, with no regard for their subject matter. If it should happen that there are more science-based researchers on the computer side of this spectrum, then that is likely a result of more software existing to support their regularly quantity based research, and it is in this that I whole heartily agree with Sinclair's conclusions: we do need "tools that accentuate reading rather than counting" if humanities computing is to progress.

That being said, I am troubled by a possible implication in Sinclair's article that while in "...traditional text-analysis tools, data completely displace the text..." in his proposed tools this would not be the case, allowing a researcher to read and analyze at the same time. I am not certain if he is suggesting that since this is the case a scholar should proceed directly to a text-analysis tool without first having read the text sans-markup. I sincerely hope this is not the case, as text analysis tools can't help, in my opinion, but bias the reader towards focusing on particular features. The word frequency feature of HyperPo, for example, which changes the colour of the text depending on how often it appears, would naturally cause a reader to place more importance on those more common terms which appear in darker colours. A word that appears only once, however, might be of the utmost importance in its usage, but be overlooked because the programming implies it is unimportant. I have trouble believing that Sinclair meant such text-analysis tools could replace unadorned text, so it is perhaps that such is not clearly stated that bothers me, particularly since he felt it necessary to point out clearly that "whimsical and playful musings" were not a replacement for literary criticism.

I'll start my discussion of The Measured Words with a list of typos and grammatical mistakes:
  • pg 5 - "Strings are sequences of character that either have a beginning and end."
    • I'm guessing it was either supposed to be 'beginning or end' or 'beginning or end or both'. There should have been an 'or' somewhere anyways to complete the 'either'
  • pg 7 - "...non-printing characters like "line feed" and "return" than could be used..."
    • 'that could be used'
  • pg 7 - "<emph>...</emp>"
    • I don't know offhand what the correct tag is, but I'm pretty sure the start and end tags should match
  • pg 11 - "Most scholarly projects use open formats based on XML and following guidelines..."
    • 'and follow'
 I don't know if this article has been published yet or not, but in case it hasn't enjoy these corrections.

I'd also like to compliment a few of the examples used of difficulties faced while treating words as strings of characters. Specifically the crossword puzzle reference (pg 6) and the use of contractions ruining 'end word upon punctuation' commands (pg 20). I doubt I would have thought of either of these scenarios. It makes me wonder whether the authors thought them up ahead of time, stumbled across them while writing, or racked their brains afterwards trying to think of exceptions. Similarly I enjoyed the comparison of foreign language texts to how computers read (pg 8). I have definitely ended up just looking for patterns in letters when trying to read Latin and Greek.

As far as content in the article, I find myself wondering whether research tools can ever truly be useful to a person who merely goes to a website and accesses them. The person who designs the program has the benefit of the "thinking through of formalizing and modeling" process, whereas a user sees only the end product: a text which has been broken and rearranged in the ways the encoder deemed informative. The quality of this is ultimately dependent on both the skill of the encoder, and the effort they're willing to put in to the work. Depending on the programmer's goal this also becomes a question of productivity: if they can encode 3 texts adequately in the time it takes to encode 1 thoroughly, which is the better option?

The question posed (all be it rhetorically) of  "when is a variant way of writing a letter actually a new letter?" was particularly intriguing because of an issue I am facing in marking up a text in XML code for another class. The typeface of the text includes odd letter forms such as an elongated 's'  and 'VV' for W. If the end game of this project was text analysis then these features would likely have to be ignored to ensure the computer could process the letter for intended meaning, even though this means information regarding the printer would be lost. For the 's' specifically an alternative Unicode character would have to be substituted (likely &#643;) which would replicate the design of the letter but carries an entirely different purpose. I am therefore left with, as the article says, a choice as to "what is important, and what is not essential to the representation".

Monday, 24 September 2012

Markup and Enrichment Reader Response

Discussion of the texts:
Two years ago the markup of texts was my first foray into Humanities Computing, and it is interesting, looking back, to reflect on just how little I understood about what I was doing. As this week's readings discuss, one of the benefits of TEI is its simplicity; given basic instruction a person can markup text without any knowledge of the principles behind the process. That being said, it's nice to finally have an explanation of why Oxygen rejected certain arrangements of tags.

My main take away from these articles is an extension of the discussion in the class on digitization: what is gained by text markup and what is lost? In Text Encoding Alan Renear outlines numerous advantages of descriptive markup, my favourites including information retrieval and the support of analytical tools. He then launches into a discussion of OHCO, SGML, XML and TEI and breezes by several potential downfalls of markup. For instance he states that in OHCO, the foundation idea of the others, texts "...are not things like pages, columns, (typographical) lines, font shifts, vertical spacing, horizontal spacing, and so on". For modern books and papers this may be the case, but in older documents, particularly handwritten manuscripts, the layout of pages can be just as important as the text itself depending on the interests of the researcher. I myself have struggled when transcribing and encoding texts with the information lost by not indicating, for example, that a line is centered on the page.

Renear, Mylonas and Durand examine the evolution of OHCO ideals, and the continued problems with it, in the third reading. The authors explain a founding principle of overlapping hierarchies: that texts naturally conform to a set structure based on their type which nest inside each other and do not overlap. They then show how this was refuted through counter examples, and the theory softened as a result. The most intriguing idea presented in this article, in my opinion, however, is one that the authors introduce and then abandon in their discussion. The 'Theoretical' defense of OHCO they present states that a "layout feature" of a text can change without effecting the content, but the structure cannot. As discussed in the previous paragraph, I disagree with this assertion; layout can be integral to a text, but the underlying principle (as described by the authors) is in my mind the key to understanding a text and creating a good markup: differentiating between "essential and accidental properties". 


My first impression of A very gentle introduction to the TEI markup language was how far should I trust an author who in his own words hasn't "...quite learned how to write an XML document and display it with links in a frameset". This trepidation soon passed, however, first because I don't know how difficult it would be to do that, and more importantly because this document is a great example of why you don't want an expert to explain technical concepts. I can't speak for a reader that had no previous knowledge of XML or TEI, but as someone whose had some experience with it I feel this explanation was simple enough to be easily followed and at the same time brief enough to not be frustrating or repetitive. The numerous examples, a feature often underused by 'experts', were a big part of its clarity. [I know these blogs are supposed to be prompting discussion about the texts, not just praising them, but I really have nothing else on this one]. It was a great way to launch into HuCo 520's instruction on XML tomorrow, and I hope is a solid enough foundation for the upcoming HuCo 500 discussion.