EESE 4/1996

Françoise Deconinck-Brossard (Paris-Nanterre)

The Case for Computer-Aided Textual Analysis

The number of texts available in machine-readable form has increased significantly in the last decade. Many books and periodicals have now been either produced or transcribed on computers, and can easily be retrieved from such electronic libraries as the Oxford Text Archive or freely downloaded from relevant servers on the Internet.1 Thus, new avenues of computerised textual analysis have opened up. Not all texts, however, are produced on computer.2 Scholars working on manuscript material, diachronic corpora or secondary authors will have to go on typing in their documents in order to be able to process them by computer, as long as optical character recognition cannot yield satisfactory results.3 One may therefore wonder which texts will be singled out for computational analysis.4 It would be a pity if the emergence of the new medium focused mostly either on the major writers or genres, in spite of recent eulogies of the canon,5 or on contemporary literature. Another problem derives from the poor quality of some electronic editions, which sometimes do not measure up to scholarly expectations.6

The following paper tries to account for work in progress on a larger project focusing on computer-aided discourse analysis of eighteenth-century English sermons.7 Data capture is so cumbersome and time-consuming that every available sample has been included in the temporary corpus under consideration, which has now reached over 80,000 words -- a length that may be regarded as statistically relevant, by the received standards of literary computation.8

The nonconformist subset contains, not only full transcripts, such as Joseph Towers' first manuscript in the collection held at Dr. Williams's Library,9 together with the sermon that he preached, and published, in 1777 for the benefit of a London charity-school,"10 but also excerpts from both manuscript and printed sources,11 within a time span extending over most of the century, from Daniel Neal (1678-1743) to the contemporaries of the American war of independence and the French Revolution. It may also be worth mentioning that several varieties of Dissent, including the drift to unitarianism at the end of the century, are exemplified here. Any disproportionate representation of individual authors or denominations merely arises from differences in availability of primary material.

The Church of England subset, on the other hand, may be divided into two. As I have shown elsewhere,12 the small selection of manuscript sermons by John Sharp (1723-1792), vicar of Hartburn, prebendary of Durham and archdeacon of Northumberland, may be regarded as representative of an approximately ten times longer corpus.13 In this case, the samples are units of a composite work. On the other hand, all but one of the other items are full transcripts of a very homogeneous group of single political sermons published in the first two decades of the century.14

The two subsets are of similar length, which may make statistical interpretation easier, as a comparative approach soon proves essential. One should bear in mind, however, that such an arrangement does not at all mirror the reality of eighteenth-century sermon production. Indeed, Dissenters only published a fraction of all sermons printed at the time, i.e. 19% by the most optimistic account, though the figure might have been as low as 6.7% at an earlier stage.15 In this respect, it would be no exaggeration to describe non-conformist preaching as marginal. Interestingly enough, the Dissenting population was an even smaller minority in the nation. On average, the three main denominations of Old Dissent accounted for about 5% of the overall population at the beginning of the century, albeit with a very uneven geographic distribution. By 1773, the much-lamented decline of the Dissenting interest had brought the figure down by one percentage point at least.16 It may therefore be worth wondering whether non-conformist preaching exerted less marginal influence than demography might suggest. One of the most obvious features of a particular genre lies in its vocabulary. Hence, the crudest tools that computer-aided textual analysis can offer are wordlists in which the number of occurrences may be sorted either alphabetically or by ascending (or descending) frequencies. A quick look at such raw data (fig. 1) may already show some of the Dissenters' stylistic idiosyncrasies. The use of personal pronouns and modal auxiliaries will be particularly telling. Hence the need to keep function words within the table of lexical items! A scholar may feel quite frustrated when they have been weeded out or "reserved," as in K. Monkman's concordance of Sterne's sermons,17 or when software packages, like SphinxTM offer the possibility to concentrate on content words only. It does not necessarily mean that one should go to the other extreme, and only pay attention to the most common word-types, in spite of their remarkable pervasiveness.18

Like their anglican counterparts, nonconformist preachers preferred to address their congregations in the first person plural, and hardly ever used the feminine pronoun she. Similarly, and perhaps not unexpectedly, the non-grammatical word that all preachers used most often was God, while Christ and Jeus rank far lower down the list. It goes without saying that such crude information needs refining, particularly because it relates to a motley set of texts.

Distribution graphs and other breakdown devices showing how (un)evenly given words are located may be of interest. Such displays will show that the few occurrences of third person feminine pronouns amongst Dissenters are almost exclusively confined to funeral sermons, just as in the Sharp corpus they almost only appear in one particular text, namely the charity sermon preached in favour of clergymen's widows, in which the relatively exceptional use of the feminine is understandable.19 More interestingly, it seems that the component texts might well differ in their stylistic and theological approaches. Not only is Joseph Towers' inordinate manner of opening sentences with and made obvious, but his theocentricity appears glaringly. Neither feature should come as a surprise to a peruser of the Dictionary of National Biography: the author did not receive much formal education, became an Arian from an early age, when he was apprenticed to Richard Goadby, and later in life moved in the same circles as Richard Price. However, his lack of Christocentricity did not marginalise him at all, either among his fellow-Dissenters, or amid many of his anglican counterparts. By contrast, the Christ-centred approach of such Dissenters as Robert Bragge (1665-1738), whose manuscripts20 constantly repeat the abbreviation X for Christ, appears as the exception rather than the norm.

Retrieving relevant segments of text assists further analysis. The intriguingly high rank of the first person singular in the wordlist, and its relatively even distribution in the array of dissenting sermons, cannot be attributed to chance. Unlike John Sharp, who consistently took his father's advice "to avoid Egoisms (or mention of ones [sic] self) as much as may be"21 except in transitions such as "I shall discourse &c., [...] I now proceed to &c., I conclude with &c.," Dissenting preachers conveyed personal feelings that their anglican contemporaries had been trained to marginalise in their sermons. As the trend may be noticed across the non-conformist spectrum, and throughout the century, it may reasonably be assumed to result from a deliberate protest against the established impersonal style whereby the ego is detestable. In one of the most extreme examples, Edwards exclaimed, at the end of the century, "This I say in my own behalf" A few years earlier, in a purple-patch pro-revolutionary passage, Richard Price had emphasised the repetition of the first person singular:

Edwards and Price were only building up on a Tradition of personal homiletics that had allowed Josiah Owen to cry out likewise, at the time of the '45 rebellion, "methinks, I hear a voice from Heaven, answering, Go and sin no more, lest a worse thing befal you."23 When his friend colonel Gardiner died on the battlefield, the more moderate Doddridge was able to write of his personal faith and inspiration: "I trust in GOD, that so heroic a behaviour will inspire our warriors with augmented courage."24 In less troubled times, though, when discussing the education of children, he evinced typical moderation : "I would by no means drive Matters to Extremities." (1736b, 3, 63).

Correlatively, the second person --singular or plural-- is more widely used by Dissenters than in the established church. In many instances indeed, you co-occurs with I in the same sentence. In his sermons on education, Doddridge alternately addresses parents and children. He begs the former not to be too lenient : "I intreat you, as you love your Children, [...] that it may not be said of you ... that you have neglected to restrain them." (1736b, 3, 63) Then he concludes his address to the children with comforting words : "I have one Thing to tell you for your Encouragement."(1736b, 4, 88) Contrarily, most sermons preached by prominent members of the Established church in favour of charity-schools hardly ever mentioned the children they spoke about in other terms than they.25 Thus, the characteristic use of pronouns in the Dissenting pulpit reveals a rhetoric that is based on the skilful development of a personal relationship between the preacher and his audience. Hence the use of affectionate collocates such as "brethren," or even "my brethren" by preachers who belonged to what has been characterised as the affectionate religious tradition of Old Dissent.26 By contrast, the impersonal strategy prevailing in the Church of England implies that the preacher only acts as a channel of communication, as if he were concealing the position of authority from which he speaks, for he is bound by the same duties as the congregation. It may be worth noting that in this respect, the short excerpt by John Wesley clearly stands on the side of Old Dissent. However, one should resist the temptation of inferring quick conclusions from fragmented data.

The vexing question relates to the significance of word frequencies and stylistic similarities or differences, particularly with texts of diverse lengths. As soon as one raises the issues of typicality, representativeness or specificity, the notions of marginality and aberration become momentous. The limitations of this short paper, have led me, so far, to pinpoint areas where the Dissenters' strategy of persuasion diverged from the prevailing rhetoric. Other examples might include a slightly different emphasis in expressions of modality, which must mirror minor differences in the approach to intrasubjective relationships.27 Differences in word usage also convey diverging worldviews: while John Sharp systematically correlates "duty" with "pleasure" and "delight," Dissenters consistently characterised it as a "reasonable" obligation. It would have been equally feasible, however, to underline many instances of congruence. To quote but one example, one may highlight the preachers' reluctance to use negative forms with expressions of compulsion. Clearly, whether they were dissenters or anglicans, these preachers considered negation as a useless rhetorical approach. Neither did they mention "atheists," as if they simply ignored anybody who lived beyond the pale of Christianity. Likewise, they seldom referred to "superstition," not even always as synonymous with catholicism, partly because they felt, like Towers, that "the nation may not now be in the same danger of Popery, as it had been before." (1777, 21)

The answer will necessarily be based on the statistical notion of standard deviation. The end-result may be mapped on graphs derived from factorial analysis of correspondence, a method initially developed in France.28 Fortunately, one does not need any special expertise in mathematics, but it takes a little while to get used to interpreting such visual displays.

First, the graph on figure 2, representing Towers' set of texts, was produced under a case-sensitive operating system (UNIX TM), so that one sometimes finds two traces for the same term whenever the spelling varies in the original document. Such is the case for gospel, for instance. Louis Milic's recommendations about the "normalization" of punctuation, capitalisation, and italicisation29" might exclude such variants. So far, however, I have opted against standardising the preachers' language because the purpose of the transcription process was twofold. Not only did it aim at a quantitative investigation of the authors' style, but it also attempted to provide a faithful reproduction of texts that are otherwise almost unavailable. Time and resources have not allowed me to prepare two separate versions of the corpus in order to meet the requirements of these different approaches. Besides, there is evidence that, in the case of manuscript sermons at least, the use of punctuation and capitalisation cannot be deemed insignificant.30

The graph is plotted on an invisible grid (fig. 3), which, unfortunately, is too tight for all the significant vocabulary items to appear in full, so that often only the first letters of a given word may be read. By 'word' here and throughout is meant what is sometimes called 'word type': short of a procedure of lemmatization,31 advantage and advantages are different 'types' or 'words'. The reason why lemmas have not been grouped does not arise from a theoretical standpoint on this controversial issue32 but simply from the fact that the package (SPADT TM) only generates unlemmatised tables of lexical items. However, experience has shown that it may be useful to break down information about word usage by number or tense. To quote but one example from the homiletic corpus under review, passion in the singular refers to the life of Jesus Christ, whereas passions deal with emotions, as in the work of Descartes or the baroque theory of affections. Admittedly, it would be convenient if we, our, ours and us appeared together in frequency distribution tables.

Moreover, the package produces a list of "hidden words," which, though distinctive enough, cannot be visualized, simply because they are concealed by others with exactly the same coordinates. Insignificant units are either literally marginalised or even excluded from the graph. As a gross oversimplification, one may say that the more central, and the nearer to either of the two imaginary diagonal lines of projection, the less specific the word is, in its particular context. Not surprisingly, the most usual grammatical forms --the, of, to, an -- are very central indeed. So is the first person plural, though the subject pronoun I is not insignificant. Furthermore, Towers' favourite sentence opening, And, is well located on the axis. That God, Deity, Almighty and Being should lie in a cluster may be of interest, all the more so as Christ stands further away and the centrality of reason cannot be denied. Interestingly enough, the shorter excerpts appear as less representative than the two full transcripts. One may therefore wonder whether the use of fragmented text should be questioned in the light of such evidence, even though it has been common practice since the pioneer studies of Louis Milic, whose first compilation of a historical corpus dates back to 1972. 33

On the other hand, the Kiddell graph (fig. 4) seems to reveal remarkable consistency in the preacher's manuscripts, while the printed text stands out as more peripheral. Therefore, the working hypothesis of a possible differenciation between printed and manuscript sermons cannot be ruled out.

What attracts attention, moreover, when the aggregate body of Dissenting sermons is considered together (fig. 5), is not only the key position held by Doddridge, but also the relative cohesion of each individual preacher compared with the whole. The common assumption underlying many stylometric studies, namely, that the consistency of word usage within any author's works act as lexical/syntactic fingerprints that may allow one to draw his/her stylistic profile, could not be dismissed easily in view of such graphs. In addition, Doddridge's textual centrality is all the more remarkable as it echoes his considerable influence and extraordinary contribution to English religion in the eighteenth century. As his vast correspondence shows,34 he maintained a large, ecumenical circle of friends and acquaintances, at home and abroad, with Dissenters and Anglicans alike. His biography and writings reveal his wider interests in philanthropic schemes to benefit the community, and, as has been mentioned before, in stirring up patriotic fervour at the time of the '45 rebellion.35 That his discourse should be as central to the Dissenting pulpit as his influence on his own and later generations may prove momentous. Besides, the striking parallel thus drawn between textual and meta-textual information gives food for thought.36

If the collection of texts under examination becomes too large, then one needs to locate the data onto more than two axes, therefore several graphs are produced. Otherwise, the charts would be overcrowded, so that it would be well nigh impossible to read them. However, the least negligible information is concentrated on the first few axes and diagrams. To start with, the Sharp sample was combined with the body of Dissenting sermons. The results were plotted on 10 axes and 9 graphs. The first graph (fig.5) confirms each author's internal consistency, except that Towers' printed sermon appears as relatively peripheral to his other texts. That Bragge's manuscripts are to be found at the bottom edge comes as no surprise: his Calvinistic theology has little to do with a constellation of sermons where God and reason are at least as central as Christ. The very noticeable cluster of Dissenters highlights the underlying unity of what was then known as "the Dissenting interest",37 with the possible exception of Richard Price who, incidentally, has been reproached by some modern historians for using injudicious language that gave the Dissenting pulpit such an undeservedly bad reputation.38 The data is more sparsely scattered on the second graph (fig. 6), but the chief novelty there is John Wesley's literally marginal position. Although it should be borne in mind that only a very short excerpt of his sermons has been included in the set of texts under consideration, it will be worth enquiring whether such radical differenciation in style from both the Established church and Old Dissent may be confirmed, and add stylistic evidence to the historical issue of the relationship between methodism and the Church of England.

The next stage of the investigation consisted in amalgamating the group of early eighteenth-century anglican sermons to the previous body of texts, plotting the data again on 10 axes and 9 charts. The interest of the first graph (fig. 7) derives from the fact that one could almost draw a line between the Established church and the Dissenters, whose stylistic and theological approaches thus appear to differ considerably, whereas a traditional approach to the problem had only been able to pinpoint many instances of thematic congruence.

Indeed, one of the unspoken disappointments in my doctoral thesis long ago39 was the convergence of ideas within eighteenth-century sermon literature as a whole. Computer-aided statistical analysis has revealed that "Protestant dissenters" had succeeded in their concern to "distinguish themselves" from their fellow-preachers. New avenues of investigation have been opened up, and the case for exploratory textual analysis may be substantiated.

Françoise Deconinck-Brossard
Université de Paris X
fadeco@u-paris10.fr

Graphs available in a separate file

Notes

1 Some of the most famous servers on the web, for instance, are The English Server at Carnegie Mellon University (http://www.cs.cmu.edu/Web/books.html) or the Online Archive of Electronic Texts at the University of Virginia (http://www.lib.virginia.edu/etext/texts.html).

2 Unlike what Maurice Gross claims in his paper "On Counting Meaningful Units of Text," JADT 1995: III Giornate Internazionali di Analisi Statistica dei Dati Testuali, eds. S. Bolasco, L. Lebart & A. Salem (Rome CISU) 1:5.

3 It is worth noting that many recent editing ventures have opted against optical character recognition: such is the case of the Thesaurus Linguae Graecae and the English Poetry Full-text Database, for instance (see The English Poetry Full-text Database Newsletter 3, p. 3). I wish to thank the IBM Almaden Research Center (San Jose, CA) for their generous hospitality and assistance with the question of data capture and OCR when I visited them while on research leave in the early months of 1992.

4 The issue was raised by Michael Leslie, the former editor of the Hartlib Papers Project at the University of Sheffield: "Electronic Edition and the Hierarchy of Texts," The Politics of the Electronic Text, eds. Warren Chernaik, Caroline Davis, and Marilyn Deegan (Oxford: Office for Humanities Communication Publications, 1993) 41-51.

5 Harold Bloom, The Western Canon (London: Papermac, 1995).

6 See Karen Lunsford, "Electronic Texts and the Internet: A Review of The English Server," Computers and the Humanities 29:4 (1995): 297-305.

7 The core of the following essay is based on a paper given at the 3rd International Conference on Statistical Analysis of Textual Data: "Stylistic Marginality of Eighteenth-century Dissenting Preachers," JADT 1995, II: 321-8.

8 See A. Kenny, The Computation of Style: An Introduction to Statistics for Students of Literature and the Humanities (Oxford : Pergamon Press, 1982) 98-103.

9 Modern octavo ms. 28.14(I).

10 The Professors of the Gospel under the Strongest Obligations to Labour to Distinguish Themselves b.v an Eminent Degree of Piety and Virtue. A Sermon Preached at St. Thomas's, January 1, 1777, for the Benefit of the Charity-School in Gravel-Lane, Southwark. (London: J. Johnson and J. Buckland, 1777).

11 Some of the excerpts from printed material have been published in English Sermons: Mirrors of Society, ed. Christiane d'Haussy (Toulouse: Presses universitaires du Mirail, 1995).

12 "Ecritures sermonnaires," forthcoming in Confluences VIII (Université de Paris X: Centre de recherches sur les origines de la modernité et les pays anglophones).

13 The full transcription, which has now been completed, amounts to approximately 200,000 words.

14 My debt of gratitude to Professor Neumann, who has been generous enough to pass this transcription on to me, is beyond words.

15 The figure is derived from Professor Spaulding's computerisation of John Cooke's The Preacher's Assistant, (After the Manner of Mr. Letsome), Containing a Series of the Texts of Sermons and Discourses Published Either Singly, or in volume, by Divines of the Church of England, and by the Dissenting Clergy, since the Restoration to the Present Time, Specifying Also the Several Authors Alphabetically Arranged Under Each Text -- with the Size, Date, Occasion, or Subject-matter of Each Sermon or Discourse (Oxford: Clarendon P, 1783). The figure that may be drawn from Letsome is 6.7% : see my paper on "Eighteenth-Century Sermons and the Age," Crown and Mitre: Religion and Society in Northern Europe since the Reformation (Woodbridge: The Boydell P, 1993) 106-9.

16 On the demography of decline, see James E. Bradley, Religion, Revolution and English Radicalism: Non-conformity in Eighteenth-century Politics and Society (Cambridge U P, 1990) 92 -6.

17 A printout is on deposit at the Cambridge University Library, and the master copy at Shandy Hall, of course.

18 J.F. Burrows reckons that they represent up to 41 per cent of the whole dialogue in Jane Austen's novels: see Computation into Criticism: A Study of Jane Austen's Novels And an Experiment in Method (Oxford: Clarendon Press, 1987) 82. Louis Milic's computation leads him to the more general figure of 30 per cent in any 2000-word sample: "The Century of Prose Corpus : A Half-Million Word Historical Data Base," Computers and the Humanities 29:5 (October 1995): 328.

19 I have given a detailed analysis of the use of the feminine pronoun in the Sharp corpus in my article on "The Computerisation of a Manuscript Corpus: Expressions of Compulsion in Eighteenth-century Sermons," Asp 4 (July 1994): 62-73.

20 Modern octavo ms. 24.33 at Dr. Williams's Library.

21 Northumberland Record Office, ms. NRO: 452/C3/3 3.

22 A Discourse on the Love of our Country (London: George Stafford for R. Cadell, 1789), quoted in d'Haussy 85.

23 All is Well: or The Defeat of the Late Rebellion, and Deliverance from its Dreadful Consequences, an Exalted and Illustrious Blessing (London: J. Hodges, [1745]), quoted in d'Haussy 83.

24 The Christian Warrior (London : J. Waugh, 1745), quoted in d'Haussy 80.

25 I have already discussed the we/they relationship in charity-sermons: see Vie politique, sociale et religieuse en Grande-Bretagne d'après les sermons préchés ou publiés en Grande-Bretagne 1738-1760 (Paris: Didier-Erudition, 1984) II, 599 & sqq.

26 Isabel Rivers, Reason, Grace, and Sentiment: A Study of the Language of Religion of Ethics in England 1660-1780 Volume I Whichcote to Wesley (Cambridge UP, 1991) chap. 4.

27 "The Computerisation of a Manuscript Corpus," passim.

28 For a history of, and a good introduction to, correspondence analysis, see Michael J. Greenacre, Theory and Applications of Correspondence Analysis (London : Academic P, 1984). The standard textbook explanation is to be found in Ludovic Lebart & André Salem, Statistique textuelle (Paris: Dunod, 1994) 79-109.

29 Milic 330-1.

30 I have made this point in "Ecritures sermonnaires," and in my lecture on "Dr. John Sharp: An Eighteenth-century Northumbrian Preacher," (Durham, St. Mary's College, 1995).

31 The procedure consists in assembling all the forms of a given lemma under the base form. Thus, I, me, my and mine would count as only one item.

32 See L. Lebart et A. Salem 36-8.

33 Louis Milic's first compilation of diachronic texts, The Augustan Prose Sample, is available from the Oxford Text Archive. On his more recent corpus, see the "The Century of Prose Corpus," 327-37, passim.

34 Geoffrey F. Nuttall, Calendar of the Correspondence of Philip Doddridge, D.D. 1702-1751 (Northamptonshire Record Society and the Royal Commission on Historical Manuscripts, 1979).

35 The most recent biography is by Malcolm Deacon, Philip Doddridge of Northampton 1702-1751 (Northamptonshire Libraries, 1980).

36 On the notion of meta-information, see Ludovic Lebart, "Analyse statistique des données textuelles quelques problèmes actuels et futurs," JADT 1995, I, xvii-xxiv.

37 On the unity of Dissent, see Bradley 53.

38 Bradley mentions the controversy about Richard Price's language.

39 Vie politique, sociale et religieuse en Grande-Bretagne ....