HUCO 500: Text Analysis

Description of the texts:

What is Text Analysis - by Geoffrey Rockwell and Ian Lancashire
Computer-Assisted Reading: Reconceiving Text Analysis - by Stéfan Sinclair
The Measured Words - by Geoffrey Rockwell and Stéfan Sinclair

I'm not sure whether to acknowledge or ignore that the person marking this blog is also the co-author of the majority of this week's readings. It will undoubtedly effect my comments on some level. Let's start with Stéfan Sinclair's work and delay the issue temporarily.

I feel that in Computer-Assisted Reading: Reconceiving Text Analysis Sinclair presents several valid and note-worthy points. However, I also believe that his underlying principle is incorrect: there is not, in my mind, any fundamental "...incongruities between humanistic and scientific traditions". Theoretical physics, for example, functions very much like traditional humanities subjects; the scholar observes a peculiarity and speculates as to why it is so using previous research to support the theory, for a philosophy scholar this might be George Grote, for a physicist it could be Niels Bohr, regardless the 'scholarly tradition' being followed is the same. An experimental physicist might pick up his theoretical cohort's work and design an experiment (perhaps one based in computers) to test it, but so might a humanities computing scholar apply text analysis techniques to his comrades theory. My point is that Sinclair (and much of academia) treats the realm of subjects as divisible among sciences, social sciences, and humanities. I think in reality you could better understand research and researchers not by arranging them by their disciplines, but on a continuum from (for the purposes of Sinclair's argument) computer-based analysis to purely theoretical research,

on which each scholar could be plotted according to their preferred research techniques, with no regard for their subject matter. If it should happen that there are more science-based researchers on the computer side of this spectrum, then that is likely a result of more software existing to support their regularly quantity based research, and it is in this that I whole heartily agree with Sinclair's conclusions: we do need "tools that accentuate reading rather than counting" if humanities computing is to progress.

That being said, I am troubled by a possible implication in Sinclair's article that while in "...traditional text-analysis tools, data completely displace the text..." in his proposed tools this would not be the case, allowing a researcher to read and analyze at the same time. I am not certain if he is suggesting that since this is the case a scholar should proceed directly to a text-analysis tool without first having read the text sans-markup. I sincerely hope this is not the case, as text analysis tools can't help, in my opinion, but bias the reader towards focusing on particular features. The word frequency feature of HyperPo, for example, which changes the colour of the text depending on how often it appears, would naturally cause a reader to place more importance on those more common terms which appear in darker colours. A word that appears only once, however, might be of the utmost importance in its usage, but be overlooked because the programming implies it is unimportant. I have trouble believing that Sinclair meant such text-analysis tools could replace unadorned text, so it is perhaps that such is not clearly stated that bothers me, particularly since he felt it necessary to point out clearly that "whimsical and playful musings" were not a replacement for literary criticism.

I'll start my discussion of The Measured Words with a list of typos and grammatical mistakes:

pg 5 - "Strings are sequences of character that either have a beginning and end."

I'm guessing it was either supposed to be 'beginning or end' or 'beginning or end or both'. There should have been an 'or' somewhere anyways to complete the 'either'

pg 7 - "...non-printing characters like "line feed" and "return" than could be used..."

'that could be used'

pg 7 - "<emph>...</emp>"

I don't know offhand what the correct tag is, but I'm pretty sure the start and end tags should match

pg 11 - "Most scholarly projects use open formats based on XML and following guidelines..."

'and follow'

I don't know if this article has been published yet or not, but in case it hasn't enjoy these corrections.

I'd also like to compliment a few of the examples used of difficulties faced while treating words as strings of characters. Specifically the crossword puzzle reference (pg 6) and the use of contractions ruining 'end word upon punctuation' commands (pg 20). I doubt I would have thought of either of these scenarios. It makes me wonder whether the authors thought them up ahead of time, stumbled across them while writing, or racked their brains afterwards trying to think of exceptions. Similarly I enjoyed the comparison of foreign language texts to how computers read (pg 8). I have definitely ended up just looking for patterns in letters when trying to read Latin and Greek.

As far as content in the article, I find myself wondering whether research tools can ever truly be useful to a person who merely goes to a website and accesses them. The person who designs the program has the benefit of the "thinking through of formalizing and modeling" process, whereas a user sees only the end product: a text which has been broken and rearranged in the ways the encoder deemed informative. The quality of this is ultimately dependent on both the skill of the encoder, and the effort they're willing to put in to the work. Depending on the programmer's goal this also becomes a question of productivity: if they can encode 3 texts adequately in the time it takes to encode 1 thoroughly, which is the better option?

The question posed (all be it rhetorically) of "when is a variant way of writing a letter actually a new letter?" was particularly intriguing because of an issue I am facing in marking up a text in XML code for another class. The typeface of the text includes odd letter forms such as an elongated 's' and 'VV' for W. If the end game of this project was text analysis then these features would likely have to be ignored to ensure the computer could process the letter for intended meaning, even though this means information regarding the printer would be lost. For the 's' specifically an alternative Unicode character would have to be substituted (likely ʃ) which would replicate the design of the letter but carries an entirely different purpose. I am therefore left with, as the article says, a choice as to "what is important, and what is not essential to the representation".

HUCO 500

Sunday, 30 September 2012

Text Analysis

No comments:

Post a Comment

About Me